153
INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION SOURCES USING A KNOWLEDGE BASED METHODOLOGY A THESIS SUBMITTED TO THE DEPARTMENT OF CIVIL AND ENVIRONMENTAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF ENGINEER Siddharth Taduri March 2012

INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

INFORMATION RETRIEVAL ACROSS MULTIPLE

INFORMATION SOURCES USING A KNOWLEDGE BASED

METHODOLOGY

A THESIS

SUBMITTED TO THE DEPARTMENT OF CIVIL AND ENVIRONMENTAL

ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF ENGINEER

Siddharth Taduri

March 2012

Page 2: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/jx742kr9947

© 2012 by Siddharth S Taduri. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii

Page 3: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

Approved for the department.

Kincho Law, Adviser

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this thesis in electronicformat. An original signed hard copy of the signature page is on file in University Archives.

iii

Page 4: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

iv

ABSTRACT

The recent years have seen a tremendous growth in research and developments in

science and technology, and an emphasis in obtaining Intellectual Property (IP)

protection for one’s innovations. Information pertaining to IP for science and

technology is siloed into many diverse sources and consists of laws, regulations,

patents, court litigations, scientific publications, and more. Although a great deal of

legal and scientific information is now available online, the scattered distribution of

the information, combined with the enormous sizes and complexities, makes any

attempt to gather relevant IP-related information on a specific technology a daunting

task. In this thesis, we develop a knowledge-based software framework to facilitate

retrieval of patents and related information across multiple diverse and uncoordinated

information sources in the US patent system. The document corpus covers issued US

patents, court litigations, scientific publications, and patent file wrappers in the

biomedical technology domain.

A document repository is to be populated with issued US patents, court cases,

scientific publications, and file wrappers in XML format. Parsers are developed to

automatically download documents from the information sources. Additionally, the

parser also extracts metadata and textual content from the downloaded documents and

populates the XML repository. A text index is built over the repository using Apache

Lucene, to facilitate search and retrieval of documents.

Page 5: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

v

Based on the document repository, the underlying methodology to search across

multiple information sources in the patent system is discussed. The methodology is

divided into two major parts. First, we develop a knowledge-based query expansion

methodology to tackle domain terminological inconsistencies in the documents.

Relevant knowledge is retrieved from external sources such as domain ontologies.

Since our goal is to retrieve a collection of relevant documents across multiple

sources, we develop a patent system ontology to provide interoperability between the

different types of documents and to facilitate information integration. We discuss the

Information Retrieval (IR) framework which combines the knowledge-based query

expansion methodology with the patent system ontology to provide a multi-domain

search methodology. A visualization tool based on term co-occurrence is developed

that can be used to browse the document repository through class hierarchies of

domain ontologies.

The knowledge-based query expansion methodology is evaluated through formal

measures such as precision and recall. A simple term-based search is used as a

baseline reference for comparison. Additionally, the results from related works are

also used for comparison. A series of common questions asked during patent prior art

searches and infringement analysis are generated to evaluate the patent system

ontology. A summary of the results and analysis is provided.

Page 6: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

vi

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deepest thanks to my advisor Prof

Kincho H. Law for providing me with a wonderful opportunity in the form of this

project. His continued support, patience, and belief got me through my graduate

studies at Stanford. He is a role model to me and continues to inspire decisions I make

in life.

I would like to thank Prof. Jay Kesan, School of Law at the University of Illinois

at Urbana-Champaign, and Dr. Gloria Lau, Consulting Associate Professor at Stanford

University, for their constant guidance and comments on this project. Their experience

has been an invaluable resource to me. I would also like to thank my uncle Dr.

Sudarsan Rachuri for helping me make informed career choices over the years.

My stay at Stanford has made me realize the importance of family more than ever

before. I would like to thank all my family, especially my parents, sister, and brother-

in-law whose motivation immensely helped me get here. I wish I could show my work

to my late grandfather, who spoke about technology and entrepreneurship years ago

when I could barely spell the words.

My office mates, former and present members of the Engineering Informatics

Group, have made the long hours spent at the office enjoyable. I would like to thank

Vladimir Fedorov, Zan Chu, Kay Smarsly, Baryam Aygun, and Jinkyoo Park. My

close friends Varun Sheth, Khushnuma Irani, Smit Shah, Gautham Sista, Saurabh

Page 7: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

vii

Saraf, Reuben Joseph, Siddharth Ahuja, and Siddharth Kumar have been the closest to

family I have had here and I would like to thank them for their constant

encouragement.

I would also like to thank the university and library staff, especially Kim Vonner,

Brenda Sampson, and Jill Nomura for all the help they have offered. This research is

partially supported by the National Science Foundation, Grant Number 0811460, and

by the Information Technology Laboratory at the National Institute of Standards and

Technology. Any opinions and findings are those of the author, and do not necessarily

reflect the views of the National Science Foundation or the National Institute of

Standards and Technology.

Page 8: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

viii

TABLE OF CONTENTS

Abstract…… ................................................................................................................ iv

Acknowledgements ...................................................................................................... vi

Table of Contents ....................................................................................................... viii

List of Tables .............................................................................................................. xiii

List of Figures ............................................................................................................. xv

Chapter 1. Introduction ............................................................................................. 1

1.1 Motivation and Problem Statement ............................................................ 1

1.2 Goals of this Research ................................................................................ 4

1.3 Background and Related Research ............................................................. 6

1.3.1 Background on the Patent System ............................................... 6

1.3.2 Related Work ............................................................................... 7

1.4 Thesis Outline............................................................................................. 9

Chapter 2. Document Repository ............................................................................ 12

Page 9: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

ix

2.1 Introduction .............................................................................................. 12

2.2 Use Case ................................................................................................... 14

2.3 Document Collection and Parsing ............................................................ 16

2.3.1 Patents ........................................................................................ 17

2.3.2 Court Cases ................................................................................ 20

2.3.3 Publications ................................................................................ 24

2.3.3.1 Identifying Ground Truth from TREC Corpus .......... 27

2.3.4 File Wrappers ............................................................................. 28

2.4 Evaluation and Accuracy.......................................................................... 32

2.4.1 Evaluation of the Extracted Patent Data .................................... 33

2.5 Text Index................................................................................................. 34

2.5.1 Vector Space Model ................................................................... 35

2.5.2 TF-IDF ....................................................................................... 36

2.5.3 Fields and Schema ..................................................................... 37

2.5.4 Solr ............................................................................................. 38

2.6 Related Work ............................................................................................ 38

2.6.1 Interoperability, Information Frameworks and Semantic

Web ............................................................................................ 39

Page 10: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

x

2.6.2 Digital Repositories ................................................................... 39

2.6.3 Document Parsing and Information Extraction ......................... 40

Chapter 3. Methodology ........................................................................................... 42

3.1 Introduction .............................................................................................. 42

3.2 Bio-Ontologies ......................................................................................... 45

3.2.1 Query Expansion: General Form ............................................... 50

3.2.2 Effects of choosing the right ontology ....................................... 54

3.2.3 Effects of Indexing Parameters .................................................. 56

3.2.4 Scope of the Query Terms ......................................................... 58

3.2.5 Interactive Model for Visualization ........................................... 59

3.3 Patent Ontology ........................................................................................ 60

3.3.1 Defining Scope of the Ontology ................................................ 62

3.3.2 Conceptualization ...................................................................... 65

3.3.3 Populating the Ontology ............................................................ 70

3.3.4 Using the Declarative Syntax: Expressing Queries and

Developing Rules ....................................................................... 71

3.3.4.1 Expressing Competency Questions as

SPARQL queries ....................................................... 71

3.3.4.2 Expressing Heuristics as Rules .................................. 72

Page 11: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xi

3.4 IR Framework........................................................................................... 74

3.4.1 Implementation Details .............................................................. 78

3.5 Related Work ............................................................................................ 79

3.5.1 Knowledge-Based IR ................................................................. 80

3.5.2 Other Approaches to IR ............................................................. 80

3.5.3 Ontology Development and Interoperability ............................. 81

Chapter 4. Performance Evaluation ....................................................................... 83

4.1 Introduction .............................................................................................. 83

4.2 Background and Related Work ................................................................ 84

4.2.1 Evaluation Metrics ..................................................................... 84

4.2.2 SPARQL .................................................................................... 85

4.3 Knowledge-Based Methodology using Bio-Ontologies........................... 86

4.3.1 Baseline ...................................................................................... 87

4.3.2 Query Expansion ........................................................................ 89

4.3.2.1 Query Expansion for Retrieval of Patent

Documents ................................................................. 91

4.3.2.2 Query Expansion for Retrieval of Scientific

Publications ............................................................... 95

4.4 Evaluating Patent System Ontology and IR Framework........................ 101

Page 12: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xii

4.4.1 Use Case Scenario: Patent Prior Art Search ............................ 102

4.4.2 Use Case Scenario: File Wrapper Example ............................. 107

4.4.3 Other Benefits of the Patent System Ontology ........................ 112

4.5 Summary ................................................................................................ 113

Chapter 5. Conclusion and Future Work ............................................................. 116

5.1 Summary ................................................................................................ 116

5.2 Future Work ........................................................................................... 118

5.2.1 Digital Repositories ................................................................. 118

5.2.2 User Relevancy Feedback ........................................................ 119

5.2.3 Query Expansion, Semantic Indexing and Other

Methodologies .......................................................................... 120

5.2.4 Scaling to More Applications, More Data Sources, and

More Subject Domains ............................................................ 121

Bibliography .............................................................................................................. 122

Page 13: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xiii

LIST OF TABLES

Number Page

Table 2.1: Patent XML Element Descriptions ....................................................... 21

Table 2.2: Field-by-Field Accuracy of Extracted Patent Data .............................. 34

Table 3.1: Summary of the Selected Biomedical Ontologies ................................ 47

Table 3.2: Effect of the Distance between Search Clauses ................................... 59

Table 3.3: Expressing Competency Questions in SPARQL .................................. 72

Table 3.4: Expressing SWRL rules ....................................................................... 73

Table 4.1: Baseline Reference: Rank of Core Patents ........................................... 88

Table 4.2: Baseline Reference for Evaluating the Query Expansion

Methodology ......................................................................................... 89

Table 4.3: Change in Average Rank of Core Patents with Level of

Expansion ............................................................................................. 93

Table 4.4: Precision and Average Rank of Core Patents for Fielded

Search on Patent Documents ................................................................ 95

Page 14: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xiv

Table 4.5: Pre-Processed Queries to Evaluate Query Expansion on

Scientific Publications .......................................................................... 97

Table 4.6: Precision for Results Obtained by Querying Patent System

Ontology for Documents Related to a Set of Inventors,

Assignees or US Classification .......................................................... 106

Page 15: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xv

LIST OF FIGURES

Number Page

Figure 2.1: Sample Patent Document ..................................................................... 19

Figure 2.2: Sample Patent XML Document ............................................................ 20

Figure 2.3: Sample Court Case Document .............................................................. 23

Figure 2.4: Sample Court Case XML Document .................................................... 24

Figure 2.5: Sample Publication in XML ................................................................. 26

Figure 2.6: Contents of a File Wrapper ................................................................... 30

Figure 2.7: Sample Rejection Letter (Office Action) ............................................. 31

Figure 2.8: Sample Interference Document ............................................................ 32

Figure 2.9: Sample File Wrapper in XML .............................................................. 33

Figure 2.10: Cosine Similarity in VSM .................................................................... 36

Figure 3.1: The Importance of Domain Knowledge in Retrieving

Scientific Publications .......................................................................... 48

Page 16: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xvi

Figure 3.2: The Importance of Domain Knowledge in Retrieving Patent

Documents ............................................................................................ 49

Figure 3.3: Query Expansion along MeSH Hierarchy to Retrieve Relevant

Documents ............................................................................................ 51

Figure 3.4: Relations in Domain Ontologies .......................................................... 52

Figure 3.5: General Form of the Expanded Query .................................................. 54

Figure 3.6: Comparison between Multiple Biomedical Ontologies ....................... 55

Figure 3.7: Visualizing Concept Co-occurrences using MINOE ........................... 60

Figure 3.8: Conceptual View of Patent Documents ................................................ 66

Figure 3.9: Conceptual View of Court Case ........................................................... 66

Figure 3.10: Events Contained in a File Wrapper ..................................................... 67

Figure 3.11: Excerpt from the Patent System Ontology: Rejection class ................. 68

Figure 3.12: Top Level Ontology for the Patent System .......................................... 69

Figure 3.13: Cross-Referencing between Documents in the Patent System ............. 69

Figure 3.14: Populating the Patent System Ontology ............................................... 70

Figure 3.15: Expressing Heuristics through Rules in Patent System

Ontology ............................................................................................... 74

Figure 3.16: IR Framework ....................................................................................... 75

Figure 3.17: Example to Illustrate IR Framework .................................................... 77

Page 17: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xvii

Figure 3.18: Current Implementation of the IR Framework Methodology .............. 79

Figure 4.1: Average Precision and Recall for Query Expansions on Patent

Documents ............................................................................................ 91

Figure 4.2: Comparison between use of Multiple Ontologies vs.

Individual Ontologies ........................................................................... 94

Figure 4.3: Effect of Depth of Query Expansion on Retrieval of Scientific

Publications .......................................................................................... 98

Figure 4.4: Performance of Query Expansion on Individual Topics ...................... 98

Figure 4.5: Number of Query Terms with Increasing Depth of Query

Expansion ............................................................................................. 99

Figure 4.6: SPARQL Query to Retrieve Court Cases Related to

Erythropoietin ..................................................................................... 103

Figure 4.7: SPARQL Query to Retrieve Patents Involved in Court Cases

Related to Erythropoietin ................................................................... 104

Figure 4.8: SPARQL Query to Extract US Patent Classification, Names of

Assignees and Inventors from Patent Documents .............................. 105

Figure 4.9: SPARQL Query to Extract Patent Documents Related to a Set

of Inventors, Assignees and/or US Patent Classification ................... 106

Figure 4.10: Querying Patent System Ontology for Backward Citations ............... 107

Figure 4.11: SPARQL Query to Display Contents of a File Wrapper,

Ordered by the Date ............................................................................ 108

Page 18: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

xviii

Figure 4.12: SPARQL Query to Extract the Text of Claims from the

Original Patent Application ................................................................ 109

Figure 4.13: Class View of Patent Examiner’s Restriction in File Wrapper

for US Patent 5,955,422 ..................................................................... 110

Figure 4.14: Example to Illustrate a Simple Rule-Based Similarity Measure ........ 112

Page 19: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

Chapter 1.

INTRODUCTION

1.1 MOTIVATION AND PROBLEM STATEMENT

The recent years have seen a tremendous growth in research and developments in

science and technology, and an emphasis in obtaining Intellectual Property (IP)

protection for one’s innovations. IPs are important assets of any organization. In a

study of over 9000 European patents between 1993 and 1997, the median value of a

patent was estimated to be EUR 300,000 with 10% of the owners reporting a value of

EUR 10 mil. or more.1 Clearly, any company or inventor would want to protect the

rights to use, make, or sell their invention. During the lifetime of a patent, from its

initial filing, patent issuance to disputes and litigations, the patent system will

constantly be searched for information. Information pertaining to IP and the patent

system for science and technology is siloed into many diverse sources and consists of

laws, regulations, patents, court litigations, scientific publications, and more. Although

a great deal of legal and scientific information is now available online, the scattered

distribution of the information, combined with the enormous sizes and complexities,

makes any attempt to gather relevant IP-related information on a specific technology a

1 Study on Evaluating the Knowledge Economy – What are Patents Actually Worth?

http://ec.europa.eu/internal_market/indprop/docs/patent/studies/patentstudy-report_en.pdf

(Accessed on 03/01/2012).

Page 20: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 2

daunting task. Currently, the task of gathering IP-related information is performed

manually and is both laborious and expensive. This falls disproportionally on smaller

firms, start-ups, and individual inventors who have very limited resources. In this

thesis, we develop a methodology to facilitate retrieval of patents and related

information across multiple diverse and uncoordinated information sources in the US

patent system. The following scenarios illustrate some of the issues faced with the

current patent system:

A company looking to patent its technology on medical imaging devices, for

example, is required to perform an initial patentability search and establish the

usefulness, novelty, and non-obviousness of the technology [1]. The

patentability search involves a thorough study of prior art including scientific

literature and patent databases, competitor analysis, existing litigations to

similar technologies, and regulations issued by government agencies such as

the Federal Drug Agency (FDA) (or any agency enforcing laws with respect to

medical imaging devices and related technologies).

Similar to the patent applicant, a patent examiner performs patentability search

when examining an application. As of 2009, the United States Patent and

Trademark Office (USPTO) employs about 6,242 patent examiners and

received over 456,106 utility patent applications.2 Roughly, this translates to

around 73 patents per examiner annually and approximately 1.5 patents per

examiner per week. Although a patent examiner is generally well-versed with

the technological domain of the patent application, the situation imposes a

serious time constraint during the review process. Hence, each application

receives lesser time and potentially leads to incomplete examination, and

possibly infringement or invalidation at a later stage.

2 The USPTO’s annual statistics can be accessed at

http://www.uspto.gov/web/offices/ac/ido/oeip/taf/reports.htm (Accessed on 03/01/2012)

Page 21: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 3

To protect IPs, companies may perform an infringement analysis to ensure that

a particular patent’s right is not being infringed. The consequences of an

infringement can be severe and result in heavy losses. For example, Microsoft

Inc. made a settlement of US $521 mil. to Eolas Inc., over a single patent in

2007.3 Other notable settlements include the litigations between Medtronics

and Michelson (US $1.57 billion)4, and Kodak and Polaroid (US $909 mil)

5.

Infringement analysis involves a thorough search in the issued patents

database, patent application database, prior court litigations, regulations, and

any form of documented evidence to help assert the infringement or invalidate

a patent’s claim as a defensive measure.

Irrespective of the scenario, whether a company intends to patent its technology or

to perform an infringement analysis, or a patent examiner intends to perform a

patentability search, several questions arise:

What are the issued patents in related technologies?

What is the legal scope of similar patents?

Who are the competitors?

Have any similar patents been challenged in court?

How can one work around existing body of knowledge?

Are there any scientific literatures, or regulations which can potentially be used

to challenge and to invalidate a patent’s claims?

3 The details of the settlement between Eolas Inc. and Microsoft Inc. can be viewed at

http://en.wikipedia.org/wiki/Eolas (Accessed on 03/01/2012). 4 The details of the settlement between Medtronics and Michelson can be viewed at

http://www.nytimes.com/2005/04/23/business/23medronic.html (Accessed on 03/01/2012). 5 The details of the settlement between Kodak and Polaroid can be viewed at

http://articles.latimes.com/1990-10-13/business/fi-1997_1_instant-photography (Accessed on

03/01/2012)

Page 22: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 4

These questions cannot be answered from any single information source. An

integration framework is needed to enable the retrieval of relevant information from

diverse sources. In this thesis, we explore a knowledge-based approach to address two

fundamental information integration issues – (a) the lack of interoperability among the

information sources in the current patent system; and (b) the varying information

needs by the users of the patent system. The work presented will positively impact

small businesses and independent researchers, such as lawyers and patent examiners,

with a potential to influence the use of IT in the current patent system.

1.2 GOALS OF THIS RESEARCH

The objective of this research is to develop a methodology that can facilitate the

retrieval of patent related information from heterogeneous sources. To limit the scope

of the research, we focus on the technology space in biomedicine; i.e. patents,

scientific publications, laws and regulations that broadly fall under the area of

biomedicine. The heterogeneous nature of information sources results in different

language conventions, terminology, and publication formats etc.. In fact, the

documents belonging to the various information sources are almost entirely different.

In order to understand the challenges that are currently faced in gathering and

searching the patent system, our first step is to construct a document repository. We

study the current publication formats, structure, and style of language and identify the

critical elements in each information source. Our corpus consists of patent documents,

court litigations and scientific publications related to a biomedical use case –

‘erythropoietin’. The corpus also includes a patent file wrapper which is a collection

of all documents and communication between the patent applicant and the patent

office during the application phase of a patent. Since XML is a stable format to

represent structured and semi-structured information, the documents are appropriately

parsed and stored as XML files. The repository is made searchable via the use of

Apache Lucene; a text mining library [5].

Page 23: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 5

Terminological variations such as synonymy and polysemy are a common source

of problems which often hinder the effectiveness of traditional term based Information

Retrieval (IR) methods. We develop a knowledge-based method that uses external

knowledge sources such as domain ontologies to provide the required semantics to

resolve terminological inconsistencies and improve semantic interoperability between

information sources. The study examines current trends, sources, and applications of

domain ontologies. While the primary focus of this research is the use of biomedical

ontologies, multiple domain ontologies spanning both legal and technical domains are

needed in order to achieve information interoperability.

An important step in achieving interoperability is to allow the information sources

to communicate with one another. To achieve this, the information sources must use a

standardized and structured representation for documents. We develop a Patent

System Ontology (PSO) to standardize the representation of the information sources

and achieve interoperability. While the documents are vastly diverse, the information

is implicitly cross-referenced, that can be used as relevancy measures between

documents. For example, a court document which involves a particular patent

document reveals a high relevancy between the two documents. Such relevancy

measures are central to our method for multi-source IR and will be discussed in detail

in this thesis.

We design an Information Retrieval (IR) framework which integrates the patent

system ontology and the domain ontologies to retrieve a set of related documents

across multiple sources in an iterative manner. Since the potential user base can range

from lawyers to technical professionals, understanding the user’s intent from a single

query becomes challenging. We incorporate user feedback into the framework in order

to capture the user’s true information needs.

Page 24: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 6

A fully-functional tool is developed based on the proposed IR framework. We

discuss the requirements of such a tool in detail, and include several features in order

to provide good user experience. These features include faceting, tag clouds, co-

occurrence graphs and so on. We also extend the visualization module of MINOE, a

software tool originally developed to explore regulations in ocean ecosystems [37], to

browse the document repository through the hierarchies of biomedical ontologies.

The research primarily revolves around the use of knowledge based

methodologies and information modeling. IR strongly relies on natural language

understanding, data and text mining, and machine learning which are fast evolving and

providing promising results. This research provides plenty of scope to incorporate

such emerging techniques into the framework to further enhance the quality of the

results.

1.3 BACKGROUND AND RELATED RESEARCH

1.3.1 BACKGROUND ON THE PATENT SYSTEM

This section provides some very basic, but necessary background on the patent

system. The patent system is a two stage system where the first stage includes the

acquisition of patents, and the second includes their enforcement. In the acquisition

phase, a patent application is prosecuted by the USPTO and finally issued or rejected

based on the patent examiner’s decision. The prosecution history, also known as the

file wrapper, is documented for that issued patent or application. The various

documents involved in the acquisition phase are the patent applications, file wrappers,

issued patents and any form of prior art such as scientific publications.

The enforcement stage of the patent system comes into play once the patent is

issued. In case of infringement of patent claims, the infringer of a patent can be tried

in court in a patent litigation. The enforcement stage revisits the steps taken in

Page 25: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 7

acquisition stage, and can invalidate an entire patent based on its findings. The

documents involved in the enforcement stage include patent applications, issued

patents, file wrappers, court cases, other forms of prior art including scientific

publications, and appropriate chapters of the United States Code (U.S.C.) and the

Code of Federal Regulations (C.F.R.).

1.3.2 RELATED WORK

The patent system contains a wealth of technology related information, distributed

under a regulatory system. Many government agencies are moving toward digital

libraries to publish and archive information [133]. Information Technology (IT) is

becoming indispensable to government and to facilitate access of government data.

Continuing development of IR methodologies is necessary to keep up with the

information growth. Furthermore, with information being created and managed by

different organizations and agencies, establishing interoperability between the

information is essential [56,57]. Since the patent system covers technical and legal

domains and involves a variety of information sources, a thorough literature review

would require studying the current state-of-the-art IR methods for each source, and in

each domain. Our attempt is to abstract methodologies and recent related works that

are most applicable to facilitating IR in the patent system.

Almost all information sources and documents contain metadata. For example, all

documents have a Title, a Date etc.. The metadata is rather generic and is not tied to a

specific domain. For this reason, IR methods such as link analysis, citation analysis,

bibliographic ranking, and other metadata related approaches are commonly used in

both legal texts and technical texts [43,79,100]. However, such IR mechanisms based

simply on metadata are not sufficient and are typically used in conjunction with term-

based IR models such as the Vector Space Model (VSM) [111]. The VSM, which

represents each unique word in a corpus as a separate dimension, suffers from the

Page 26: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 8

‘curse-of-dimensionality’ in the sense that the high number of dimensions causes data

sparseness and other computational issues [84]. To overcome the ‘curse-of-

dimensionality’ of the VSM, Latent Semantic Indexing (LSI) and its variants such as

the Probabilistic LSI have gained interest in both legal and technical IR communities

[8,29,60,84,135]. Other alternative models include the Okapi Probabilistic Retrieval

Framework and the Divergence From Randomness (DFR) probabilistic model

[4,97,108].

The combination of the large amount of information with the lack of standard

terminology among the diverse information sources renders simple term-based models

ineffective. Simple term based models are not sufficient to capture user context and

information needs. LSI attempts to capture the relationships between terms in similar

context, addressing issues such as synonymy and polysemy, but the method is not

sufficient to capture the domain content of the information. Domain knowledge,

created by experts of a specific domain, can be valuable to conceptualize and express

the semantics, i.e. the meaning of terms and their relationships. Thus, newer methods

often incorporate the semantics of a domain through external knowledge sources such

as ontologies, taxonomies, vocabularies and thesauri [35,88]. Knowledge-based

approaches are commonly applied to technical information sources such as

publications [35], and are slowly moving into the legal domain as well. In fact, legal

ontologies and general language ontologies such as WordNet are becoming popular in

IR applications [39].

Most research in IR has focused on a single domain (such as biomedicine) and a

single information source (such as scientific publications). The nature of the problem

we are addressing demands that information be retrieved from multiple sources and

domains. The problem of IR from multiple diverse information sources has been

studied [8,9,84]. In order to facilitate multi-source IR, both structural and semantic

interoperability are required [115,131]. Semantic interoperability can be achieved

Page 27: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 9

through the use of several domains, legal and general purpose ontologies [57,103].

Studies have made use of top level ontologies to provide structural interoperability

between sources [131].

Automatic ontology learning techniques aim to extract relevant concepts and their

relationships from a corpus. Recent interest in ontologies has led to research in the

area of automatic ontology learning [103,139]. Unlike biomedicine, most technical

domains lack domain ontologies that sufficiently cover the sub-topics. Hence,

automatic ontology learning techniques can be potentially used to learn concepts and

relationships where domain knowledge is sparse.

Patent and patent statistics provide valuable information. For example, they can be

used as indicators to study technological growth and change, knowledge flows,

estimating position of companies and organizations in a technological space, and so on

[52]. Similarly, data analysis, data mining, and machine learning are relevant in IR

research. Some of the more important methodologies include NLP techniques such as

feature extraction and statistical parsing [14,40,75]. Feature extraction has been used

to identify concepts such as genes, person names, etc., that can provide important

information that can be incorporated in search mechanisms. Although it is believed

that NLP techniques cannot be easily applied to legal texts [21], studies are attempting

to parse patent claims and other legal corpora to extract important phrases, extract

dependencies between terms, and facilitate machine translation etc. [114].

1.4 THESIS OUTLINE

In this research, we address the problem faced in accessing information across

multiple diverse yet related sources in the US patent system. We discuss our

methodology to improve information source interoperability through the use of

ontologies. We present an IR framework to improve retrieval of related documents

from the patent system. The work also presents a preliminary study on user relevancy

Page 28: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 10

and user experience through the implementation of a fully functional tool built upon

the IR framework. The thesis is organized as follows:

Chapter 2 discusses the development of the document repository. We

specifically deal with four types of documents – (1) issued patents; (2) court

cases; (3) patent file wrappers; and (4) scientific publications. We present a

detailed description of the challenges faced in gathering information, current

state-of-the-art tools to access and parse relevant information from the

documents. A thorough evaluation of the document repository is provided to

ensure the accuracy of the parsed information. The chapter also discusses the

text indexes and schemas which are used throughout this study for inspection

and analysis.

Chapter 3 discusses our methodology in three parts. The first part explores the

use of domain ontologies to tackle terminological inconsistencies and

incorporate domain specific semantics to improve access and retrieval within a

single information source. The effects of ontology selection, document

structure, and indexing schemes on the methodology are explained. The second

part discusses our Patent System Ontology (PSO), designed to improve

structural interoperability between information sources. One of the key

contributions of the PSO is the ability to express cross-referenced information

and user heuristics using declarative syntax. In the third part, we combine the

application of domain ontologies and the PSO to illustrate a powerful

methodology for improving information access and retrieval in the patent

system. We briefly discuss user relevancy feedback techniques and attempt to

illustrate the importance of good user experience through a well-designed tool.

Chapter 4 presents an evaluation and analysis of our methodology. Through

several use case scenarios, the practical applications and the potential impacts

of this multi-disciplinary research are discussed. The analysis provides a solid

Page 29: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 1. INTRODUCTION 11

foundation for potential future work and studies the requirements to develop a

valuable exploratory tool.

Chapter 5 summarizes the contents of this thesis and discusses the broader

impacts of the research.

Page 30: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

Chapter 2.

DOCUMENT REPOSITORY

2.1 INTRODUCTION

The complexity of the patent system makes retrieval of relevant information a

challenging task. The existence of multiple information silos results in heterogeneity

in almost every aspect of the documents – in structure, in semantics, in format and in

system [115]. Information pertaining to one type of documents may be available from

several sources within the same silo. For example, the information silo representing

scientific literature could comprise of repositories such as IEEE-xplore, ACM, and

PubMed etc. [2,64,105]. Similarly, patent documents can be accessed from multiple

sources such as USPTO and Google Patents [50,129]. Modern day applications

demand a high amount of integration between information sources to facilitate cross-

domain Information Retrieval (IR). As explained in Chapter 1, there is a lack of a

standardized framework that facilitates information integration and the development of

tools to improve accessibility and retrieval of documents. The first step towards

developing such a framework involves studying the state-of-the-art publishing

standards and accessibility tools. This is best learnt by constructing a representative

document repository which includes the diverse information sources that we are

addressing. In this chapter, we describe the development of the document repository

Page 31: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 13

from various information silos in the patent system which will also serve as our

experimental data set for the development and evaluation of our IR framework.

Our main goal is to develop a document repository which contains a collection of

related documents that encompasses – (a) issued patent documents; (b) scientific

publications; (c) court litigations and (d) USPTO file wrappers. To our knowledge,

currently no such data set spanning the patent system readily exists. The chapter

presents a thorough discussion exposing the structural diversity, inconsistencies in

publication standards, and accessibility of these information sources and lays the

foundation for the development of our methodology in discussed Chapter 3.

The rest of the chapter is organized as follows: Section 2.2 describes our use case

and the contents of the repository. Section 2.3 discusses the challenges associated with

interfacing and accessing the information sources and our methodology to collect the

documents. The documents are typically lengthy and contain large amount of

information. In practice, applications seldom use the entire content of the document.

By discussing examples of end user applications, relevant portions of the documents

are identified. We parse this information and convert the documents to a common

structured format. We choose XML to store the parsed information due to the

abundance of supporting software libraries and parsing tools.6 Section 2.4 presents a

formal evaluation of the data set to ensure usability. The document repository is

implemented using Apache Lucene, a widely used text mining library. Using a Java

interface, the XML files created in Section 2.3 are used to build and search the

document repository. Section 2.5 describes the text indexes implemented using

Lucene, which serve as the basis for the preliminary evaluations of our methodology.

Section 2.6 summarizes related work and discusses potential future directions.

6 For a list of XML parsers in Java, see – http://en.wikipedia.org/wiki/Java_XML (Accessed on

03/01/2012).

Page 32: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 14

2.2 USE CASE

Recent advances in the biomedical domain have led to the creation of several

external and manually curated knowledge-bases and ontologies, far more than most

other disciplines. This prompted us to choose our use case in the biomedical domain,

since it will give us immediate access to the existing knowledge to implement our

knowledge-based approach. These recent advancements have also reflected in the

patent system, evident with the increased number of patent applications, scientific

publications, and court activities. For example, in 2010, among the 219,614 patents

that were granted by the USPTO, 21,840 patents were roughly classified as chemical

patents (an increase of 5,672 patents from the previous year)7; MEDLINE, a database

which consists of over 21 million citations from over 5000 journals, has accepted 38

new journal titles between June 2011 and October 20118.

We build the document repository around the concept – erythropoietin; a hormone

responsible for the production of red blood cells in the human body. The synthetic

production of erythropoietin has led to the treatment of chronic diseases such as

anemia. Epogen - the production brand of synthetic erythropoietin manufactured by

the pharmaceutical giant Amgen Inc. is protected by five core patents namely – US

5,547,933, US 5,618,698, US 5,621,080, US 5,756,349, and US 5,955,422. These

patents have been central to many related court cases involving other pharmaceutical

companies such as Hoescht Marion Roussel and Transkaryotic Therapies, and heavily

cite scientific literature from top journals.

In order to compensate for terminological inconsistencies caused due to

synonymy, hyponymy, and abbreviations etc., we identified 43 concepts related to

7 USPTO statistics can be accessed at

http://www.uspto.gov/web/offices/ac/ido/oeip/taf/stchem.htm (Accessed on 03/01/2012). 8 Information regarding newly accepted journal titles in MEDLINE can be accessed at -

http://www.nlm.nih.gov/bsd/lstrc/new_titles.html (Accessed on 03/01/2012).

Page 33: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 15

erythropoietin (“the 43 concepts” hereafter) by searching bio-ontologies for

synonyms, subclasses, and super-classes in BioPortal [95]. As of January 2010, we

downloaded the top 50-100 documents for each of the 43 concepts from the USPTO

issued patent database, collecting a total of 1150 patent documents. The 1150 patent

documents cover various aspects of erythropoietin such as the production of

erythropoietin, its usage, and related procedures, etc.. Hence, using the 43 concepts to

gather the data set provides us with a data set broad enough within the erythropoietin

use case. Among these 1150 patent documents, we identified 135 highly relevant

patent documents (“the 135 patents” hereafter) by following the forward and the

backward citations to the five core patents. The 135 patents will serve as our ground

truth for any experiments that are to follow.

In order to gather court documents, we searched several court litigations, dated

back to the 1980’s. The repository contains 30 court documents (“the 30 court

documents” hereafter) which directly or indirectly involve Amgen Inc. and the five

core patents. This search was performed using erythropoietin and the 43 concepts on

Google Scholar [51].

PubMed is a comprehensive index of over 5000 biomedical journals indexing over

20 million MEDLINE citations [85,105]. In building our approach towards multiple

information source retrieval, we also wish to study the application of biomedical

ontologies in each individual document domain. For this purpose, we would like to

have a comprehensive biomedical dataset that we can experiment on. The Text

Retrieval Conference (TREC) organized by the National Institute of Standards and

Technology (NIST) is a well-known and prestigious competition that produces high

quality datasets every year.9 The TREC 2007 genomics data set consists of over

9 Information regarding the Text Retrieval Conference (TREC) can be accessed at -

http://trec.nist.gov/ (Accessed on 03/01/2012).

Page 34: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 16

162,000 scientific publications from 49 prominent biomedical journals.10

The data set

provides a well-defined ground truth for experimentation with around 36 topics

representing varying information needs. However, in building the dataset for our use

case of multi-domain retrieval, we must first identify documents related to the use

case. We listed over 3000 publications (“the 3000+ publications” hereafter) following

citations from the 135 patents as the ground truth in the publication domain. Out of the

3000+ publications, we identified around 1737 publications in the TREC dataset.

Section 2.3.3 provides a more detailed explanation of how this mapping is made.

A patent document is the outcome of years of negotiations between the patent

office and the applicant. All the negotiations, including the original application, office

actions, amendments, and the final issued patent document are bundled together in a

file history or a file wrapper. File wrappers provide very detailed information of the

patent including the original claims, the final claims, the added and deleted citations

etc.. Such information is very critical in defining the scope and validity of the patent,

especially during the litigation or enforcement phase. Due to logistics involved in

gathering file wrappers (described in detail in Section 2.3.4), currently our corpus

includes only one file wrapper for the core patent U.S. 5,955,422.

All in all, this document repository represents the unique problem in IR involving

multiple information sources in the patent system and provides an experimental

platform for developing and evaluating our methodology.

2.3 DOCUMENT COLLECTION AND PARSING

A quick study of the information sources reveals several inconsistencies in terms

of publication standards, document structure, and accessibility. For example – (1)

10 The 2007 TREC Genomics track can be accessed at -

http://ir.ohsu.edu/genomics/2007data.html (Accessed on 03/12/2012).

Page 35: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 17

PubMed provides scientific publications in well-defined XML files while USPTO

provides issued patents as HTML files; (2) PubMed provides APIs or web services to

access information while USPTO still lacks such interfaces; and so on. In this section,

we explore available sources for documents, their publication standards and available

web-services to programmatically access data. Whether we deal with patents or

scientific publications, each document contains a wealth of information. Applications

seldom use the entire information available in a document, but rather use specific and

much smaller portions such as the metadata, or simply the title and so on. When

viewed from an application’s (or user’s) point of view, it becomes clear what aspects

of the documents are crucial. This gives us an estimate as to how much metadata and

textual information we need to parse from the documents to make the data set useful.

We study the structure of the documents and develop parsers to extract information

from the documents. We place the extracted data from the documents in well-defined

XML files in order to maintain a consistent format.

2.3.1 PATENTS

There are over 41 different patent issuing authorities across the world, including

the European, Japanese, and German Patent Offices [38,44,68]. The Derwent World

Patents Index (DWPI) is one of the largest patent databases with documents indexed

from 41 patent-issuing authorities [35]. HeinOnline, LexisNexis and WestLaw are

other libraries for IP related legal information [58,78,134]. Google now makes all

USPTO products freely available online [49]. Thomson Innovation and Dialog LLC

provide tools to help in information mining of patent documents and other scientific

literature through services such as Delphion and Web of Science [123,124]. Our

current focus involves only the USPTO. The USPTO maintains a public database for

issued patents, patent applications, copyrights, and trademarks. There are currently

over 7 million patents issued in the US. In 2009 alone, 485,312 patent applications

were filed with the USPTO. Proprietary websites like Delphion do not allow

Page 36: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 18

automated downloading or crawling of patent documents. Moreover, since any new

document is first published on USPTO, we stick to using the USPTO as our source for

patent documents.

To our knowledge, USPTO does not provide a standard API to access and

download documents. However, patents can be downloaded using a simple script

based on wget11

as HTML pages. Currently, we do not download images or figures as

our methodology primarily focuses on text. It must be noted that USPTO maintains

full-text for only documents post 1973. If necessary, patents issued prior to 1973 are

available but only as image files. The wget script we developed takes two forms of

input – (a) a list of patent numbers we wish to download; and (b) a list of keywords.

We manually downloaded the five core patents, and generated the list of patents we

ultimately wish to download by parsing the backward and forward citations. This gave

us a list of the 135 patents which form the ground truth for the use case. Next, we use

the 43 concepts as a list input to the script and downloaded the top 50-100 documents

for each concept. Including the 135 patents, the script downloaded a total of 1150

patent documents. Upon downloading the HTML files, the full-text is parsed and

stripped of all HTML tags using available HTML parsers.12

A patent document is basically a combination of two distinct sections; one that is

entirely technical and the other that is entirely legal. Applications dealing with patent

documents have specific requirements that deal with smaller portions of the

documents. For example, a common strategy involves filtering documents based on

only the abstract and technology class until a manageable list of a few hundred

documents is obtained [113]. At this stage, the claims and full technical description

may be referred to with more importance. Patent claim invalidations strongly

11 Wget is a tool to retrieve information from web servers - http://www.gnu.org/software/wget/

(Accessed on 03/01/2012). 12

The HTML parser used in our research can be downloaded from -

http://htmlparser.sourceforge.net/ (Accessed on 03/01/2012).

Page 37: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 19

emphasize on the claims, and the limitations and the priority dates. Citations including

both patents and scientific literature can hold important information. Infringement

analysis often requires specific information such as priority dates, application

information etc.. Additionally, patent documents consist of valuable metadata

information such as inventors, assignees, technology classifications, etc., which can

act as filters to quickly narrow the search to appropriate results.

Although the patent documents are not explicitly marked up, fields in the

documents are clearly defined with section headers (see Figure 2.1). Using these

section headers as markers, we carefully coded a regular expression based script to

parse and extract various fields. Since the documents we are parsing spread over

several years, there are subtle variations in the documents that can cause parsing

inaccuracies. Moreover, some documents have information that is missing in others.

For example, some documents contain an additional ‘Assistant Examiner’ field. This

requires a regular expression, or a set of regular expressions which can handle

United States Patent 5,955,422: “Production of Erthropoietin”

September 21, 1999

Abstract Disclosed are novel polypeptides possessing part or all of the primary structural

conformation … of mammalian erythropoietin ("EPO") … polynucleotides in a

heterologous cellular or viral sample prepared from, e.g., DNA present in a

plasmid or viral-borne cDNA or genomic DNA "library“…

Inventors: Lin; Fu-Kuen (Thousand Oaks, CA)

Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA)

Claims 1. A pharmaceutical composition comprising a therapeutically

effective amount of human erythropoietin …

Description ….

Figure 2.1: Sample Patent Document

Page 38: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 20

multiple such cases. The regular expressions are implemented in both Java and Perl.

Once the HTML tags are stripped out using a standard HTML parser, the text

information is passed as an input to the script, which extracts the information and

converts it into a fully marked up XML document. Figure 2.2 shows a sample patent

in the resulting XML format. Since patent documents are very lengthy, only a small

portion of the document is shown in the figure. A full list of the extracted fields is

displayed in Table 2.1.13,14

2.3.2 COURT CASES

Court documents can be obtained from several sources. Public Access to Court

13 The International Patent Classification system codes can be accessed at -

http://www.wipo.int/classifications/ipc/en/ (Accessed on 03/01/2012). 14

The United States Patent Classification codes can be accessed at -

http://www.uspto.gov/web/patents/classification/ (Accessed on 03/01/2012).

<Patent> <Title>Production of erythropoietin</Title>

<Assignee>Kirin-Amgen, Inc.</Assignee>

…. <Inventor>Lin Fu-Kuen</Inventor>

…. <Citation>3033753</Citation>

<InwardCitation>7645898</InwardCitation>

<InwardCitation>7645733</InwardCitation>

…. <Pub> The Polycythemias: Diagnosis and Treatment, </Pub>

….

<Claim> A process for the preparation of an in vivo biologically active

erythropoietin product comprising the steps of…

</Claim>

….

</Patent>

Figure 2.2: Sample Patent XML Document

Page 39: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 21

Electronic Records (PACER) is an electronic system to access the databases of the 94

District Courts and 13 Courts of Appeals (CAFC) [99]. PACER is an initiative toward

developing a centralized system for accessing court data and contains the most

updated information. DocketX is a privately owned company which has taken up the

task to converting all PACER documents into full text [34]. Their services are

currently available under a paid subscription. Other sources for case documents

include LexisNexis, WestLaw, and Google Scholar which may also provide additional

supporting materials such as case analyses etc. [51,78,134].

Unlike USPTO, PACER has several challenges which make it hard to

automatically fetch documents and docket information.15

Firstly, PACER does not

provide a keyword based search. Documents must be searched using specific metadata

15 A court docket is the official summary of the proceedings of a case.

Table 2.1: Patent XML Element Descriptions

Field Description

Patent Number Unique document identifier provided by the USPTO

Date of Issue The date from which the patent is considered active

Inventor The Inventor of the patent

Inventor Location The location is often used in knowledge transfer research

studies [67]

Assignee The individual who or company which owns the patent

Assignee Location Location of the patent owner

Title The title of the patent document

Abstract The abstract of the patent document

Examiner The examiner who examined the patent application

IPC Classification Technology class as per the International Patent

Classification system

US Classification Technology class as per the USPTO classification system

Claims Statements indicating the legal scope of the invention

Technical Description This field contains the remaining portion of the patent

document.

Page 40: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 22

such as the party involved, case numbers or case types.16

Secondly, the case

documents are available in image form and sometimes very illegible. This makes full-

text extraction very cumbersome as an immense amount of time must be spent on

manual curation, even after using modern Optical Character Recognition (OCR)

techniques. Moreover, each court database must be manually searched for the

specified search criteria. Being a paid service, this can be both time consuming and

economically infeasible.

Sources such as LexisNexis and Google Scholar provide a keyword based search.

Case documents are available in several formats from which full text can be easily

extracted. A search for erythropoietin and its related concepts resulted in around 30

documents. However, neither LexisNexis nor Google Scholar has APIs or web

services that can be used to automate downloading large number of case documents.

Hence, we manually downloaded the 30 court cases as text documents. A docket is an

official summary of the proceedings in a court. Ideally, we would like to also include

docket information, which is currently not being held in our corpus, as it is critical to

some applications and users.

As of today, there are millions of active patents in various technology classes.

Many patent infringement cases are filed every year. For example, the number of

patent infringement appeals in the Fiscal year of 2011 increased to 426; a 7.5%

increase of the average over the past 4 years.17

This clearly establishes the importance

of court documents in both the patent acquisition and enforcement stages. Information

such as plaintiff, defendant, name of the court, case title, and case type, etc., are

important fields for any application dealing with court cases. “Designing around an

existing patent” typically uses information such as the patents involved, important

16 PACER uses Nature of Suit codes to classify cases. Code 830 represents patent infringement

cases and must be used when searching PACER. 17

Statistics related to court litigations can be accessed at - http://www.cafc.uscourts.gov/the-

court/statistics.html/ (Accessed on 03/01/2012).

Page 41: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 23

scientific literature citations, names of inventors etc.. This information is available in

the body of the court cases.

Unfortunately, the information contained in the body of a court case is not

standard across all patent litigations. As a result, the court documents downloaded

from LexisNexis are comparatively more unstructured compared to patent documents

downloaded from USPTO (see Figure 2.3). Since we are dealing with a small number

of documents (around 30), we manually parsed and marked up the data into XML

files. Currently, the marked up fields include (a) Case Title and Number; (b) Plaintiff;

927 F.2d 1200 (1991)

AMGEN, INC., Plaintiff/Cross-Appellant,

v.

CHUGAI PHARMACEUTICAL CO., LTD., and Genetics Institute, Inc.,

Defendants-Appellants. Nos. 90-1273, 90-1275.

United States Court of Appeals, Federal Circuit. March 5, 1991.

Suggestion for Rehearing Declined May 20, 1991.

Before MARKEY, LOURIE and CLEVENGER, Circuit Judges.

THE PATENTS On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to

Dr. Rodney Hewick U.S. Patent 4,677,195, entitled "Method …” … claims of the

'195 patent are:

1. Homogeneous erythropoietin characterized by a molecular weight of

about 34,000 Daltons … 280 nanometers.

3. A pharmaceutical composition for the treatment of anemia …

homogeneous erythropoietin … vehicle.

4. Homogeneous erythropoietin … 34,000 Daltons on SDS PAGE …

280 nanometers.

DISCUSSION

Figure 2.3: Sample Court Case Document

Page 42: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 24

(c) Defendant; (d) Court Type and Name; (e) Case Type; (f) Date of

proceeding/hearing or decision; (g) Preceding Judge; (h) Patents Involved; and (i)

General Case Body (see Figure 2.4).

2.3.3 PUBLICATIONS

In the biomedical domain, PubMed is the most comprehensive and updated library

indexing over 5000 biomedical journals from areas like medicine, nursing, pharmacy,

dentistry, healthcare, biochemistry, and bioinformatics. The National Center for

Biomedical Informatics (NCBI) uses Entrez to search and retrieve data from several

databases including PubMed, Nucleotide databases, Protein structures and many more

[92]. Such databases can be very valuable in terms of providing additional knowledge

that can be applied to searching scientific publications. Our current focus is on

retrieving scientific publications from PubMed. There are alternatives to search

<Case> <Title>Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd.</Title>

<CaseNum>706 F. Supp. 94</CaseNum>

<Plaintiff>Amgen, Inc.</Plaintiff>

…. <Defendant>Chugai Pharmaceutical Co., Ltd.</Defendant>

<Defendant>Genetics Institute, Inc.</Defendant>

…. <Misc>Civ. A. No. 87-2617-Y.</Misc>

<Court>United States District Court, D. Massachusetts.</Court>

<Date>January 31, 1989.</Date>

….. <Judge>YOUNG, District Judge</Judge>

<Body> This action involves the alleged infringement of several patents covering

erythropoietin, a protein which circulates in the blood and stimulates the

production of red blood …

</Body>

</Case>

Figure 2.4: Sample Court Case XML Document

Page 43: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 25

PubMed other than Entrez. GoPubMed is a search engine which searches PubMed

with the help of annotations from the Gene Ontology (GO) [7,35]. HubMed is a new

interface which provides plenty of features to browse and search the PubMed

repository [36].

We must note that PubMed is only an index of biomedical publications. Hence,

full-text publications of an article may not be readily available. Access to the full-text

of biomedical publications can be very important in determining relevancy. For this

purpose, we download the latest TREC genomics dataset (2007), which has been

widely used in the TREC competitions organized by NIST. The TREC dataset (fully-

downloaded and indexed on our local computer) contains over 162,000 documents

from 49 journals (dated after 1994). These are supported with their respective

MEDLINE citations and are referred to by their unique PubMed IDs (PMID) (see

Figure 2.5). Any services and databases managed by NLM have very well defined

Document Type Definitions (DTD).18,19

Citations conforming to the DTD can be

alternatively downloaded in XML format via Entrez.

MeSH descriptors are typically a group of concepts in the MeSH vocabulary

which describe a topic or a set of topics that the scientific article refers to [90]. In

some sense, this can be looked as a classification scheme for the publications

according to the MeSH ontology. MeSH descriptors are valuable and could play an

important role during the search and retrieval process [120]. In our data set, we choose

to work with a smaller subset of the MEDLINE DTD. Using standard XML parsers,

we specifically extract – (a) list of authors; (b) article title; (c) journal title; (d) PMID;

(e) abstract; (f) MeSH descriptors; and (g) MeSH qualifiers. Currently we are not

indexing the publication-publication citations although they would provide yet another

18 The Document Type Definitions for files hosted by NLM can be accessed at -

http://www.nlm.nih.gov/databases/dtd/ (Accessed on 03/01/2012). 19

The descriptions of the DTD elements for databases hosted by the NLM can be accessed at -

http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html (Accessed on 03/01/2012).

Page 44: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 26

valuable set of information to enhance the search and retrieval process. However, if

needed, the Entrez DTD provides the missing publication-publication citation

information. Since the services offered by the National Library of Medicine (NLM)

provide well defined DTDs, updating our local index with newly parsed elements if

we decide to do so in the future should be trivial.

<PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE">

<PMID>10022466</PMID>

<DateCreated> <Year>1999</Year> <Month>02</Month> <Day>25</Day>

</DateCreated>

…. <Article PubModel="Print">

<Journal>

…. <JournalIssue CitedMedium="Print">

<Volume>84</Volume> <Issue>2</Issue>

….

</JournalIssue> <Title>The Journal of clinical endocrinology and metabolism</Title>

<ISOAbbreviation>J. Clin. Endocrinol. Metab.</ISOAbbreviation>

</Journal> <ArticleTitle>About the use … of an ACTH 1-39 ….</ArticleTitle>

…. <AuthorList CompleteYN="Y">

<Author ValidYN="Y">

<LastName>Grino</LastName>

<ForeName>M</ForeName>

<Initials>M</Initials>

</Author>

….

</AuthorList>

….

<MeshHeadingList>

<MeshHeading> <DescriptorName MajorTopicYN="Y">Corticotropin</DescriptorName>

</MeshHeading>

….

Figure 2.5: Sample Publication in XML

Page 45: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 27

2.3.3.1 Identifying Ground Truth from TREC Corpus

The TREC corpus provides an excellent experimentation platform. However, we

must identify which of the 3000+ publications also co-exist in the TREC corpus. In

PubMed, publications use PMIDs to cite other publications. However, patent

documents from USPTO do not follow the same citation standards. Hence, in order to

identify the 3000+ publications that co-exist in the TREC corpus, we need to parse

each citation string in the patent documents and somehow identify its PMID. PubMed

provides a citation matcher tool which allows us to map any information we have to a

specific citation and hence a unique PMID. However, this is not an easy task since the

citation strings parsed from the patent documents are not consistent enough to use this

tool. For example, consider the following citation strings retrieved from multiple

patent documents:

1. Hansen, Jan E. et al. 1997. "O-GLYCBASE Version 2.0: A Revised Database

of O-Glycosylated Proteins." Nucleic Acid Research. vol. 25, No. 1, pp. 278-

282. cited by other

2. Daubas et al., Nucleic Acids Research, 16(4) 1251-1271 (1988).

3. Altschul et al., "Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs", Nucleic Acids Res. 25:3389-3402 (1997). cited by

other

The citations do not follow consistent patterns and hence, simple regular expression

based parsers will not perform well. Moreover, the citations are often incomplete. For

example, some citations are missing the title of the article while others may have

incomplete author list. In addition, some citation strings use full journal titles while

others simply use abbreviations.

The TREC corpus only contains more recent articles (post-1994) while the 135

patents cite articles as early as the 1970s. We begin by listing the citation strings of

Page 46: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 28

the 3000+ publications in a text file and filter out the citation strings that are not in the

TREC corpus. Our first filtering criterion removes any citation string which does not

belong to one of the 49 journals in the TREC corpus. However, due to the inconsistent

use of abbreviations and full journal titles as explained earlier, we must first convert

all citation strings to a consistent format. NLM provides standard abbreviations for

each journal.20

For each of the 49 journals available in the TREC corpus, we extract

the standard abbreviations and convert the citation strings to a consistent format as

shown below:

1. Daubas et al., Nucleic Acids Research, 16(4) 1251-1271 (1988).

2. Altschul et al., "Gapped BLAST…database search programs", Nucleic Acids

Research, …

Our second filtering criteria removes all citation strings which represent

publications dated prior to 1994. Since the TREC corpus is complete with all articles

for the 49 journals post 1994, we assume that every resulting citation string is

available in the TREC corpus. This procedure results in a total of 1737 publications

from the TREC corpus, that serve as the ground truth for evaluating our methodology

within the ‘erythropoietin’ use case.

2.3.4 FILE WRAPPERS

In 2003, the USPTO introduced the Image File Wrapper (IFW) system to replace

the paper based system. IFWs are publicly available on the Patent Application

Information Retrieval (PAIR) service offered by the USPTO.21

Google has recently

started indexing these documents and provides a web service to download these files

20 The standard journal title abbreviations defined by NLM can be accessed at -

ftp://ftp.ncbi.nih.gov/pubmed/J_Medline.txt (Accessed on 03/01/2012). 21

The Patent Application Information Retrieval system can be accessed at -

http://portal.uspto.gov/external/portal/pair (Accessed on 03/01/2012).

Page 47: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 29

[49]. The major challenge faced with both PAIR and Google is that the files are

available only as images, that means additional processing and smart OCR algorithms

are required to extract text from them. In addition, the PAIR system blocks automatic

downloads and crawlers by enforcing CAPTCHA verification. Currently, to access file

wrappers prior to 2003, a third party agent is the best solution to convert the paper

based file wrappers to text-readable file wrappers. IFW Insight is a tool which has

indexed over 1,000 IFWs and allows one to navigate and search for critical

information contained within them [65]. However, the IFWs indexed by them are not

relevant to our use case.

Due to the challenges in obtaining file wrappers, currently, we include only one

file wrapper in our corpus for US 5,955,422. The file wrapper contains around 50

documents (office actions, amendments, etc.). The total length of the file wrapper is

around 500 pages. We received the file wrapper as an OCR’ed text file, which implies

the text can be copied and extracted but with some inaccuracies. Nevertheless, the file

wrapper is very useful for our preliminary experimental study.

Every patent application goes through a very different cycle in varying time

frames. The time frame can be lengthy and the recorded information often lacks

structure or order to the communications between the patent applicant and the

examiner. In fact, file wrappers are so different that some file wrappers contain special

documents such as an interference (see Figure 2.6). The first challenge in parsing file

wrappers is to deal with such non-structured information.

In order to understand how the file wrappers can be useful, let us examine an

example. In infringement analysis, to determine whether a patent is infringed or not, it

is important to understand the scope of the claims.22

This in turn requires

understanding of how the patent evolved from its original patent application. This

22 The word ‘scope’ is used to represent the extent of legal protection the patent claims offer.

Page 48: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 30

involves studying how the claims, citations (both patent and scientific literature), and

technical descriptions, etc., changed with every amendment or office action. File

wrappers play a crucial role as they contain the information needed for this purpose.

For example (see Figure 2.7), the examiner’s rejection letter shows the following

differences in the claims for the patent U.S. 5,955,422:

(1) Out of the original 60, none were pursued further.

(2) 3 additional claims were filed (claims 61-63) out of which only claims 61 and

62 were accepted.

Knowing why claims were rejected could provide key information for anyone

performing an infringement analysis. Other information contained in the file wrapper

includes added or deleted references, relevant laws, and regulations which were

Figure 2.6: Contents of a File Wrapper

Page 49: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 31

enforced and so on. Figure 2.8 shows a sample interference document which brings

out a strong relation between two patents U.S. 5,955,422 and U.S. 4,879,272, which is

otherwise not obvious from either patent. A pharmaceutical company entering the

drug market for ‘erythropoietin’ will find this information very valuable. It is worth

noticing that the interference document (shown in Figure 2.8) is very different from

the rejection (shown in Figure 2.7). Several miscellaneous documents such as the fee

structure are ignored in our model. Furthermore, each of these documents (such as

rejection, interference, etc.) is generally in the form of a letter with important

information such as restricted claims, allowed claims, rejected claims and

corresponding arguments are expressed in a mixed form within the text (see Figures

2.7 and 2.8). The second challenge faced in parsing file wrappers is associated with (1)

modeling each of these documents individually; and (b) extracting relevant

information from unstructured text. Since we are dealing with a single file wrapper,

we manually parse information in order to facilitate some amount of experimentation.

Office Action – Rejection Date: 11-06-1991 During a telephone conversation with Mr. Kokulis on March 25, 1992 a

provisional election was made with traverse to prosecute the invention of Group

VII, claims 61-63. Affirmation of this election must be made by applicant in

responding to this Office action. Claims 1-60 are withdrawn from further

consideration by the Examiner, 37 CFR 1.142(b), as being drawn to a non-elected

invention.

Claim 63 is rejected under 35 U.S.C. S 112, second paragraph, as being

indefinite for failing to particularly point out and distinctly claim the subject matter

which applicant regards as the invention.

Claim 63 is vague and indefinite in the recitation of "recombinant

erythropoietin". The specification discusses several different recombinant systems

for production of EPO. It appears that different recombinant systems produce

different modifications of the protein. It is not clear that all different modifications

are intended to be encompassed by the claims.

Claims 61 and 62 are allowed.

Figure 2.7: Sample Rejection Letter (Office Action)

Page 50: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 32

Specifically, we extract information such as claims and citations from (a) original

patent application; (b) amendments; (c) rejections; and (d) interference documents.

Figure 2.9 shows a sample XML representation of the file wrapper.

2.4 EVALUATION AND ACCURACY

In Section 2.3, we described our methodology to download and parse documents

to extract relevant information. The extracted information is reconstructed into XML

files using appropriate field mark-ups. In order to ensure the usability of the data, a

formal analysis of the quality of the data is required. We discuss potential sources of

Office Action – Interference Date: 11-20-1992 The cases involved in this interference are:

Junior Party Patentees: Naoto Shimoda and Tsutoiau Kawaguchi

…. Serial No.: 06/784,640 filed 10/04/85, Patent No. 4,879,272 issued 11/07/89

For: Method and Composition for Preventing the Absorption of a Medicine

Assignees: Chugai Seiyaku Kabushiki Kaisha, Ukina, Kita-Tokyo, Japan

...

Senior Party Applicant: Fu-Kuen Lin

…. For: PRODUCTION OF ERYTHROPOIETIN

Serial No. 007/609,741

Assignees: Amgen, Inc., Thousand Oaks, California, A Corporation of Delaware

….

Count 1 “An erythropoietin-containing, pharmaceutically acceptable composition wherein

human serum albumin is mixed with erythropoietin.”

The claims of the parties which correspond to Count 1 are:

Lin: Claims 61-63

Shimoda et al.: Claims 3-4

….

Figure 2.8: Sample Interference Document

Page 51: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 33

errors and suggest possible solutions to reduce the errors. Section 2.4.1 evaluates the

automatic parser discussed in Section 2.3.1.

2.4.1 EVALUATION OF THE EXTRACTED PATENT DATA

Our document repository has been manually constructed from scratch. Hence, we

do not have any pre-labeled ground truth against which the extracted patent data can

<FileWrapper> <AppNumber>957013</AppNumber>

<Date>11-06-90</Date>

<Examiner>Sharon Nolan</Examiner>

<Assignee>Kirin-Amgen Inc.</Assignee>

<Inventor>Fu-Kuen Lin</Inventor>

<Application> <Number>957013</Number>

<Claim number=“3”>A polypeptide according to claim 1 wherein 15 the

exogenous ONA sequence is a cDNA sequence<Claim>

<Description> … </Description>

….

</Application>

<Rejection> <Date>11-06-91</Date>

<RejectedClaims> <Claim>A composition according to claim 61 containing a

therapeutically effective amount of recombinant

erythropoietin.</Claim>

</RejectedClaims>

<AcceptedClaims> <Claim>61</Claim>

<Claim>62</Claim>

</AcceptedClaims>

<WithdrawnClaims> <Claim>A purified and isolated polypeptide having part or all of the

primary structural conformation and ….</Claim>

…. <Claim>An improvement in the method for detection of a specific

single stranded polynucleotide of unknown sequence in a

heterogeneous cellular….</Claim>

</WithdrawnClaims>

….

</Rejection>

</FileWrapper>

Figure 2.9: Sample File Wrapper in XML

Page 52: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 34

be evaluated. For the purpose of this evaluation, we randomly choose 50 patents out of

the total 1150 patents in the repository (~1/20th

). These 50 patents are manually

marked up with the ground truth and also stored as XML files with the exact same

structure as the automatically parsed patent documents. An evaluation script compares

the automatically parsed XML files field by field with the manually marked up files.

The precision and recall for the extracted data is shown in Table 2.2.23

2.5 TEXT INDEX

An inverted text index; similar to the inverted indexes at the back of a book, maps

every unique token (usually words or some grouping of characters in text) to all its

occurrences in the corpus. The most basic indexes store only the list of documents

each word appears in and support simple boolean queries using logical operators such

as AND, OR and NOT. Depending on the information need and complexity, indexes

can get quite complex and sometimes even larger than original text documents, storing

additional data to support more complex queries.

Full-text search along with metadata is at the heart of many IR tools. Apache

Lucene is a free text mining library completely written in Java and provides a large

23 Precision is the fraction of retrieved instances that are relevant. Recall is the fraction of

relevant instances that are retrieved [M2].

Table 2.2: Field-by-Field Accuracy of Extracted Patent Data

Field Precision Recall

Inventor 0.96 1.0

Assignee 0.96 1.0

Title 1.0 1.0

Abstract 1.0 1.0

Examiner 1.0 0.96

Claims 1.0 1.0

Technical Description 1.0 1.0

Page 53: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 35

variety of functions to create, modify, and search text indexes [5]. It is based on the

Vector Space Model (VSM) and supports a scoring function based on term frequency

and inverse document frequency (tf-idf) [84,111]. This section describes the

development of the text indexes used throughout this research. We provide some

necessary background related to text-indexes and Lucene. Sections 2.5.1 and 2.5.2

give brief introductions to the VSM and tf-idf respectively. Section 2.5.3 introduces

the notion of fields and how they help specific information needs. Section 2.5.4

introduces Apache Solr, a search library built on top of Lucene and describes the

indexes we developed.

2.5.1 VECTOR SPACE MODEL

The VSM is an algebraic model most commonly used in IR for representation of

documents [111]. Each document is represented in n dimensions, each dimension

representing one unique token in the vocabulary. Such a representation allows for

computing the similarity of documents with each other. A query can also be

represented as a document and hence, enables computing similarity with documents to

perform IR.

Several similarity measures are used to score documents such as Euclidean

distance, Manhattan distance, and Jaccard’s similarity, etc.. Unlike Euclidean

similarity, cosine similarity measures the angle between documents and is not affected

by the mere length of the document. Cosine similarity is the most preferred scoring

measure in IR.

( ) ∑

√∑ ( ) √∑ ( )

where q and d are the VSM representation of the query and document respectively;

and n is the number of dimensions. A simple example is shown in Figure 2.10.

Page 54: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 36

2.5.2 TF-IDF

Term frequency (tf) is the number of times the term occurs in a document. It is

based on the notion that the most frequent words of a document describe its major

theme or content:

where is the term frequency of the term t in document d, Ct,d is the count of the

term in the document, and Td is the total number of words in the document.

Common words such as ‘the’, ‘if’, ‘and’, ‘hello’, etc., also known as stop words,

are used very frequently across all documents and do not provide any true information

content. Hence, such terms should receive low score, irrespective of their high

frequencies. The inverse document frequency (idf) measures the general importance of

a term in the corpus, penalizing terms that frequently appear across large number of

documents, including stop words:

Figure 2.10: Cosine Similarity in VSM

Page 55: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 37

where t represents the term, N is the total number of documents in the corpus; and dft

is the document frequency of the term. Tf-idf is very widely used to score documents

against a query [84]. Several modifications of tf-idf have been introduced including

Okapi BM25 and BM25F [108]. Lucene uses query boosting and document boosting

in addition to tf-idf to score documents against a query.24

2.5.3 FIELDS AND SCHEMA

A Field is an arbitrary portion of the document that could include textual

information or metadata. Fields may also overlap in content and are defined for

specific application needs. For example, an application that only searches for authors

of publications would clearly benefit from an index which defines a field whose

vocabulary consists of Authors in the corpus. Similarly an application which defines

specific search criteria over the Abstract of a publication would require an index

defined over only the Abstracts of publications in the corpus. Indexes can support

multiple applications by defining more than one field. For example, both applications

described above could be used over an index which defines two fields – one over the

Authors of publications and the other over the Abstract. Additionally, fields can be

scored independently, hence improving the focus of the search. Each field can be

indexed with different parameters such as different stop word lists, tokenizers25

and

filters. The storage options for each field can be independently specified. Usually,

shorter fields such as titles and metadata are stored in the index in order to be retrieved

during the searching phase. Lengthier fields are often indexed, but not stored to save

24 Details on Apache Lucene’s scoring function can be found at -

http://lucene.apache.org/java/3_0_0/scoring.html (Accessed on 03/01/2012). 25

Tokenization is the process of parsing characters in a stream based on a certain pattern. For

example, the white space tokenizer identifies tokens that are separated by white spaces.

Page 56: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 38

space. Having established that fields are advantageous for various reasons, they need

to be predefined in an index schema. For each document, i.e. patents, court cases and

publications, we have built a text index based on the XML schemas discussed in

Section 2.3. These indexes are used for IR throughout the rest of this research.

2.5.4 SOLR

Apache Solr is a search library based on Lucene [6]. Solr provides several added

functionalities, which are seen very commonly today across many existing search

engines on the web. These functionalities include aggregations, faceting, dynamic

fields and so on. Faceting is the process of grouping search results based on a

particular property they share in common. For example, a search result for books on

Amazon, will allow the user to filter books or view the search results through the

Author facet. This functionality is extremely useful, especially in the case of querying

large amount of data to quickly filter the results to a relevant set. We especially use the

dynamic fields feature of Solr to create a common schema for all documents. All text

based fields are configured with a suffix “_text” and all metadata fields with “_meta”.

Creating a schema this way allows us to arbitrarily modify the fields for current

documents and add new information sources without having to modify other code

interfacing with the schema.

2.6 RELATED WORK

The discussion in this chapter focusses around – (1) the current publication

standards and challenges faced in accessing information from the patent system; and

(2) the unstructured nature of documents that makes additional parsing techniques a

necessity to extract relevant information. There is a great deal of research in the areas

of information interoperability, management, and extraction which are closely related

to the development of document repositories. This section provides a brief overview of

Page 57: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 39

existing research related to these areas and discuss possible future extensions to our

document repositories.

2.6.1 INTEROPERABILITY, INFORMATION FRAMEWORKS AND SEMANTIC WEB

Interoperability between various entities in the government is a very important

factor [56,112]. Due to the need for interoperability, many governments are adopting

interoperability frameworks which support a wide range of document formats such as

PDF, HTML for web publishing, XML for semi-structured representation, PNG and

JPEG for images, and standard web services such as REST [56]. The problem of

interoperability is rather more general and not limited to the government sources

alone. While existing interoperability frameworks deal mainly with system

heterogeneities, the ‘linked online data’ community strongly believes that the internet

is transforming into a web of data as opposed to simply a web of documents [13]. The

goal of the semantic web is to make the information computer understandable, rather

than simply computer readable [13]. Several governments are realizing the importance

of semantics and are strongly supporting ontologies and external knowledge entities in

their interoperability frameworks [57]. One future direction is to study the impacts of

such frameworks on improving access to the information in the patent system. In the

context of scientific publications, Berners-Lee and Hendler claim – “In the next few

years, we expect that tools for publishing papers on the web will automatically help

users to include more of this machine-readable markup in the papers they produce”

[10]. Future research can also explore techniques to improve publishing of legal and

government data in the patent system with the help of automated tools to annotate

data.

2.6.2 DIGITAL REPOSITORIES

Academic institutions are increasingly using digital repositories such as DSpace

and Fedora to publish, access and archive educational material [17,27,82,125]. Such

Page 58: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 40

repositories can be used to manage any form of digital data including documents.

Branin claims that such repositories are slowly being adopted by non-academic

institutions such as smaller government entities as well [17]. To study how such

repositories can help advance the current state of information management in the

patent system can be a fruitful area of research for several reasons. Firstly, digital

repositories such as DSpace and Fedora comply with standards for repository

interoperability such as the OAI-PMH26

. Additionally, they support the use of

ontologies such as Dublin Core27

for metadata and domain knowledge based on

OWL/RDF which can be embedded upon the digital repositories to improve retrieval

[82,125]. While DSpace, Fedora, and the likes are still very much evolving, they offer

a lot of potential for growth and integration with existing database and text indexing

technologies and thus provide a very strong platform for building document

repositories.

2.6.3 DOCUMENT PARSING AND INFORMATION EXTRACTION

Feature extraction and document parsing involve several subtasks based on

Natural Language Processing and related fields [84]. The addition of these extracted

features can potentially enhance the quality of the document repository by aiding in

browsing and retrieval [77]. Named Entity Recognition (NER) has been used to

categorize terms in text into biomedical entities such as genes and drugs [40]. The

information in documents not only exists as terms or shorter phrases, but also in the

form of longer sentences and fields such as the claims of a patent. Difficulties in

parsing patent claims and potential solutions to the same have been discussed

[114,116,130]. In general, identifying claims in text can provide important information

about that document. Blake discusses a methodology based on statistical parsing of

26 Open Archives Initiative – Protocol for Metadata Harvesting.

http://www.openarchives.org/OAI/openarchivesprotocol.html (Accessed on 03/01/2012). 27

Dublin Core Metadata Initiative Specifications – http://dublincore.org/specifications/

(Accessed on 03/01/2012).

Page 59: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 2. DOCUMENT REPOSITORY 41

sentences to identify scientific claims in publications [14]. Ultimately, the vocabulary

used among the various information sources can be immense and so is the scope for

feature extraction. Techniques such as NER and statistical parsing can be further

enhanced and trained on the data from this repository to improve feature extraction.

Chapter 3 explains our knowledge-based approach which dynamically annotates

knowledge to the documents based on the information need.

Page 60: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

Chapter 3.

METHODOLOGY

3.1 INTRODUCTION

The patent system is comprised of many information sources which collectively

provide a valuable source of knowledge for any technology related task. However, the

diversity among the information sources makes information retrieval from the patent

system challenging. Firstly, technology (domain) specific terminological

inconsistencies drastically affect search. Traditional term based search methodologies

do not account for the use of synonyms, abbreviations, and hyponyms, etc.. Secondly,

there is little or no interoperability between sources, caused by the fact that each

information source is managed by independent and disjoint organizations and

agencies. Lastly, most current methodologies tackle terminological inconsistencies

and information source interoperability as separate issues. An integrated framework

for IR would require combining both the methodologies to search and integrate

multiple sources, while keeping in mind the user’s context and underlying information

need. In this chapter, we will discuss three distinct methodologies addressing the

above issues – (a) knowledge based approach using domain knowledge to tackle

terminological issues; (b) developing a Patent System Ontology (PSO) to provide a

shared vocabulary between information sources and interoperability; and (c) an

Page 61: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 43

information retrieval framework that combines methodologies from (a) and (b) along

with user feedback to search and integrate information across multiple sources.

Terminological inconsistencies are very typical, especially in domain specific text.

These inconsistencies are caused due to the variant usage of a term, i.e. its synonyms,

abbreviations, parent concepts, etc.. For example, the term ‘Whale’ and ‘Cetacea’ are

synonymous.28

While a domain expert may understand the meaning of ‘Cetacea’, it is

harder for one who is not an expert in animal nomenclature. Similarly, legal

terminology may not be well understood by technical experts. Domain ontologies are

sources of knowledge, developed by experts in the field to produce a shared

vocabulary within a technical domain. Gruber defines ontologies as – “formal, explicit

specification of a shared conceptualization” [53]. Several studies have looked at using

domain knowledge to improve IR [8,35,45,46,47,63,84,88,132]. However, the

terminological usage significantly varies between information sources. In fact, domain

knowledge from several areas, e.g. technical and legal, are simultaneously required to

achieve high level of semantic interoperability in the patent system. Our knowledge

based methodology builds on existing developments and addresses the above issue of

applying domain knowledge to different information sources. Specifically, we deal

with biomedical ontologies to enhance a user’s query to include related terms and

discuss how the technique can be modified in order to improve precision and recall.29

The various types of documents in the patent system such as patents, court cases,

and scientific publications are very strongly inter-related, even though they are

semantically, syntactically, and structurally very different. For example, a patent

litigation document frequently refers to related patent numbers, scientific publications,

patent inventors and assignees, and domain experts such as authors and editors of

prominent journals. These cross-references are seen across all other document types in

28 Wikipedia article on Whales – http://en.wikipedia.org/wiki/Whale (Accessed on 03/01/2012).

29 The measures Precision and Recall are defined in Section 4.2.

Page 62: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 44

the patent system and implicitly provide strong relevancy measures between

documents. Since our goal is not explicitly targeted to produce the best search results

for a query from all information sources, but to provide a set of strongly related

documents instead, we develop a Patent System Ontology (PSO) to formalize the

representation of documents and explicitly state the cross-references (or relations)

between them.

The use of domain knowledge and the patent system ontology provide the basis for

searching and integrating multiple sources. IR is seldom a one-step process, but in fact

a multi-stage process. Information from the results of one search forms the query for

another search and so on. For example, the search for prominent court cases could

potentially lead to a more focused search of the patents involved. Moreover, it is hard

to disambiguate the context of the user query in a single step and thus, user input must

also be given significance. We develop the IR framework which combines the use of

domain knowledge, the patent system ontology and user feedback to provide a

powerful multi-domain search.

The rest of this chapter is organized as follows: Section 3.2 presents our

knowledge-based methodology to expand the user’s query in order to provide higher

recall. We realize that this isn’t sufficient to produce high precision results, and thus

discuss strategies to provide a high coverage, yet an acceptable precision. Section 3.3

presents the patent system ontology, which provides a structured and standardized

representation for the information sources in order to facilitate information source

interoperability. We present a detailed discussion regarding the development of the

ontology and its advantages. Section 3.4 presents the IR framework; an iterative

methodology to search and integrate information across multiple sources in the patent

system. The implementation details of our tool are briefly discussed in this section.

There is a plethora of related research which forms the basis for our work; related

research is discussed in Section 3.5.

Page 63: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 45

3.2 BIO-ONTOLOGIES

Biomedicine and related fields are rapidly advancing, giving rise to an exponential

growth in information and data. This rise in information is exposing the lack of

standards for terminology, representation and information exchange within sub-

domains. This affects both researchers and applications which rely on the generated

biomedical data. For example, if researchers are allowed to coin their own term for an

existing concept each time they write about it, it would be impossible to maintain a

shared vocabulary and understanding between researchers in the domain. Over the

past decade, bio-ontologies have extensively been developed and used in the field of

biology. Bodenreider and Stevens argue that although ontologies initially started out

as primarily a Computer Science (CS) effort to help annotate biological data, the

ontologies have been increasingly adopted by the biologists themselves to annotate

and share biomedical data [15]. Also, unlike fields such as physics or chemistry,

biological data is seldom represented in pure mathematical form. Hence, sharing

knowledge has been the driving force for such transition from a pure CS effort to a

combined effort with biologists playing an equally important role [15]. The resulting

domain knowledge is being used by a wide range of applications including

genome/genotype/phenotype tagging [76], information retrieval [35,63], and cross-

database searching [70,103], etc.. Such a wide range of applications clearly establish

the significance of biomedical ontologies in such a rapidly advancing field. In this

section, we will explain how applications benefit from the use of bio-ontologies

through examples and discuss our methodology in using biomedical ontologies for

information retrieval in the patent system.

There are several initiatives and groups which develop and maintain biomedical

ontologies aimed at providing a shared vocabulary and advancing research in the

domain. The Gene Ontology (GO) provides a controlled vocabulary of terms for gene

and gene product characteristics [7]. On the other hand, the Symptom Ontology (SO)

Page 64: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 46

covers purely signs and symptoms [122]. The ontologies vary drastically in their

domains (e.g. genes v/s symptoms), size (e.g. the GO has 35786 concepts while the

SO has 934 classes), and representation languages (OWL, OBO, and RDF, etc.). This

results in inconsistencies between available biomedical ontologies. For example, if an

application needs to use two ontologies with completely different representation

languages, the application will most likely have to support two entirely different APIs.

These inconsistencies are resolved by BioPortal; an online open repository of over 250

biomedical ontologies in various forms such as OBO, OWL, RDF and Protégé Frames

[95]. BioPortal provides a thorough list of web services in order to query the ontology,

abstracting the various underlying formats to a standard API. The web services

provide a convenient and programmatic access to the biomedical ontologies that can

be conveniently integrated into several applications, avoiding the need to separately

index each of the ontologies. BioPortal has grown from 72 ontologies in 2008, to 134

ontologies in 2009, and continues to show an increasing number of ontologies being

added to the repository and is clearly the largest repository available online [95]. Our

work uses BioPortal for querying biomedical ontologies. However, based on the use

case and the data set that we are working with, we limit our usage of ontologies to a

much smaller subset to keep the results and methodology tractable. The selected

ontologies are summarized in Table 3.1.

In order to understand the importance of bio-ontologies, let us consider an

example. Suppose we want to find any study on ‘chronic kidney disease’. In order to

limit our focus, we will add a geographic constraint on the study, such that any

reported results must be correlated with Tyrol, Austria. Hence, the formulated query

will look like: {‘chronic kidney disease’ AND Tyrol}. A search for this query on the

local TREC corpus retrieves zero documents. In this example (see Figure 3.1), we use

the National Drug File Ontology (NDF) to extract the semantics of the phrase ‘chronic

kidney disease’ to include synonyms such as ‘esrd’, ‘end stage renal disease’, ‘end

stage kidney disease’ and so on. The new query is represented following the PubMed

Page 65: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 47

representation as follows: {‘chronic kidney disease’ [NDF] AND Tyrol} where

[NDF] indicates that the preceding term or phrase is expanded using the NDF

ontology. Upon further examination, we realize that the phrase ‘chronic kidney

disease’ is actually never used in the text of the document; however, its synonyms,

‘esrd’ and ‘end stage renal disease’ are used. This clearly shows that without the

synonymy knowledge from NDF, this document would have never been retrieved. We

also observe that the terms ‘DM’, ‘DM-2’ and ‘type 2 diabetes mellitus’ are

synonymously used. This not only shows us the inconsistent terminological usage

between documents and authors, but within the same document. This example presents

a very restricted case. However, if we were to relax the constraints by moving up the

hierarchy in NDF to ‘kidney diseases’, we arrive at a slightly broader set of 3

publications in the TREC corpus. In fact, we could move to a geographically broader

Table 3.1: Summary of the Selected Biomedical Ontologies

Ontology Number of

Classes

Details

Medical Subject Headings

(MeSH)

229698 National Library of Medicine’s

controlled vocabulary and

classification [N1].

National Cancer Institute

Thesaurus (NCI Thesaurus)

89129 Clinical care and Health care [G4].

National Drug File (NDF) 40104 Classification of drugs, ingredients

and their clinical use [B5].

Gene Ontology (GO) 35786 Provides a controlled vocabulary for

genes and gene product characteristics

[A1].

COSTART 1641 Maintained by the Food and Drug

Administration (FDA) for controlling

adverse reaction terminology [C4].

Symptom Ontology 934 Provides a controlled vocabulary for

signs and symptoms, and their

relationships [S5].

International Classification

of Diseases (ICD – 9)

21669 Standard classification for diseases

[W4].

Page 66: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 48

region and query for ‘kidney diseases’ with correlation to Austria, this would result in

many more results.

The application of biomedical ontologies is not limited to biomedical publications

alone. Figure 3.2 illustrates the use of the NCI Thesaurus for information retrieval in

patent documents. Following the previous example in Figure 3.1, we search for the

concept ‘epor’ in the claims of all patents in our repository. The search results in zero

patents being retrieved, as in the previous example. NCI provides knowledge that

‘epor’ is synonymous to ‘erythropoietin receptor’ and ‘epo-r’. The new query thus

retrieves a total of 20 patents, each of which mentions the concept ‘epor’ in their

claims. Another interesting observation is that the patent is titled “Use of cytokine

receptors …” which is a parent concept to ‘epor’ according to NCI Thesaurus. This

expansion of the user query forms the basis of our methodology.

Figure 3.1: The Importance of Domain Knowledge in Retrieving Scientific

Publications

Page 67: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 49

The rest of this section is organized as follows: Section 3.2.1 discusses the general

form of the expanded query. Query expansion has been reported to perform erratically;

sometimes the techniques improve performance but deteriorate performance at other

times [59]. Having established in the previous examples that synonymy is an

important aspect of query expansion irrespective of the type of document, we must

understand the causes of such random behavior of using synonymy and related

expansions. Section 3.2.2 discusses the effects of choosing the correct source for query

expansions. Section 3.2.3 discusses how different indexing parameters such as scoring

functions can affect the search. Section 3.2.4 discusses the effects of varying the

granularity of the query, i.e. at the sentence level, paragraph level or the whole

document level. The effects of querying different fields in the documents such as the

Title, Abstract, etc., are also discussed. We realize that automatic expansion

techniques may not always produce good results. Hence, Section 3.2.5 discusses an

extension of a co-occurrence visualization based tool, MINOE [37], to allow users to

Figure 3.2: The Importance of Domain Knowledge in Retrieving Patent

Documents

Page 68: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 50

navigate ontology hierarchies and manually include search terms as an exploratory

model.

3.2.1 QUERY EXPANSION: GENERAL FORM

Query expansion techniques have been around in IR for quite some time [8,84].

They can be categorized into three general forms based on user assistance, manual

thesaurus, and automatic thesaurus construction [84]. Query expansion techniques

which rely on an external resource such as thesaurus or an ontology are increasingly

being adopted in IR methodologies [11]. In this section, we will focus on using

ontologies to expand the user’s initial query. A mathematical form for the same is

presented.

In addition to synonyms, domain ontologies provide additional relation between

terms in the form of hierarchical categorization into subclasses and super-classes (via

the rdfs:subClassOf relation). Figure 3.3 describes an example where both synonymy

relations and hierarchical relations are to expand the user query. As an example, we

take the TREC topic 236 – “What [TUMOR_TYPES] are found in zebrafish?” and

attempt to illustrate how ontological relations are used. We assume the baseline query

for this topic is ‘Tumor AND Zebrafish’. For the sake of representation, we follow the

PubMed syntax, where ‘Tumor [MeSH]’ indicates that the term ‘Tumor’ is to be

expanded using the MeSH ontology. In order to extract synonyms from the MeSH

ontology, the baseline query ‘Tumor AND Zebrafish’” can be rewritten as ‘Tumor

[MeSH] AND Zebrafish [MeSH]’, which actually translates into:

Q: {Tumor OR Cancer OR Neoplasm …} AND {Zebrafish OR Danio Rerio …}30

30 The default query expansion uses the OR operator to expand synonyms and the AND operator

between search clauses.

Page 69: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 51

However, this search results in a large collection of documents. We navigate the

MeSH hierarchy to include more specific concepts such as ‘Leukemia’ and

‘Melanoma’ and perform the search which results in a smaller set and possibly more

precise set of documents (see Figure 3.3). For the sake of readability, other sub-classes

of ‘Tumor’ are not displayed. In cases where the user query is more specific, it helps

to move up the hierarchy and include parent concepts as well. This form of vertical

expansion could proceed in both directions resulting in a query which looks like:

Qtumor:= {Tumor OR Cancer OR Neoplasm OR Leukemia OR Melanoma OR

Diseases OR …}

Figure 3.3: Query Expansion along MeSH Hierarchy to Retrieve Relevant

Documents

Page 70: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 52

Qtumor:= {{Tumor OR Cancer OR Neoplasm OR …} OR {Leukemia OR Melanoma

OR …} OR {Diseases OR …} OR …}

Qtumor:= {{synonyms} OR {children} OR {parents} OR ….}

Alternatively, this can be represented as a vector of terms:

Mtumor =

In fact, domain ontologies provide even more knowledge than synonyms and class

hierarchy. For example, NDF provides over 100 related drug names for the disease

‘Anemia’ (see Figure 3.4) via the ‘contraindicated_drug’ property. Although this is

very specific to this ontology, future enhancements to this methodology could include

such information, if it pertains to the user query. For example, if a query specifically

asked for drugs related to the disease ‘Anemia’ from the NDF ontology, the property

‘contraindicated_drug’ would be useful.

Generally, including high level concepts will improve the recall but will affect the

Figure 3.4: Relations in Domain Ontologies

Page 71: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 53

precision of the search. Hence, we would like to penalize the more general terms, and

boost the more specific ones by weighting the query appropriately. This grouping

allows us to assign weights to query terms, so that we have a chance to manipulate

precision. Therefore,

QTumor = Tumor [MeSH] => WT

MTumor

where WT is a vector of weights in the range [0, 1].

Different documents make different use of technical language. For example, a

court case makes far less use of technical jargon than a scientific publication would. If

the same expansion scheme is applied to both types of documents, the results could be

imprecise. In some cases, it helps to expand to more general terms and in other cases

to more specific terms. Hence, it is important to estimate what form of expansion is

appropriate for different types of the documents. Also, we cannot apply the ranking

schemes or query expansion schemes to all types of documents alike. Therefore, we

define independent weight vectors for the expanded query as appropriate for each

information source (see Figure 3.5). Hence the resulting queries for patents and court

cases are:

QPatent, Tumor = WT

Pat MTumor

QCase, Tumor = WT

Case MTumor

where WT

Pat and WT

Case are different weight vectors corresponding to patent and court

case information sources, respectively.

A similar procedure is followed for the other bio-terms in the query. Ideally, the

weight vectors should be learnt but for now we will heuristically assign weights to the

expanded query.

Page 72: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 54

3.2.2 EFFECTS OF CHOOSING THE RIGHT ONTOLOGY

The methodology described in Section 3.2.1 expands the query terms to related

terms using one or more biomedical ontologies. Using multiple ontologies for

expansion can potentially increase the coverage (recall) of the search. Such high

recalls may be desirable for some fraction of applications. However, as explained

earlier in Section 3.2, groups developing biomedical ontologies focus on a different

sub-domain and hence, using multiple ontologies can lead to potential issues which

will eventually improve recall, but lower precision. In this section, we outline potential

sources of imprecision by comparing several ontologies and discuss how selection of

ontologies can affect the search results.

Figure 3.6 shows a comparison of three different ontologies for the same concept

‘erythropoietin’. Each ontology classifies the concept under different contexts. For

example, NCI thesaurus classifies ‘erythropoietin’ in the context of a ‘protein’ or

‘amino acid’ while NDF additionally classifies ‘erythropoietin’ in the context of a

Figure 3.5: General Form of the Expanded Query

Page 73: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 55

‘carbohydrate’ and a ‘hormone’. While these higher level contexts are still highly

related, expansion along NDF may result in terms such as ‘carbohydrates’,

‘chemical’, and ‘drug’, which will not be derived from NCI. Moreover, concepts from

different ontologies may contain conflicting information. For example, ‘epoetin alfa’

and ‘erythropoietin’ are synonymous as per NCI thesaurus and have a hyponym-

hypernym relation in NDF. While the knowledge provided by both the ontologies is

correct in the context under which they are classified, choosing one ontology over the

other could alter our search results. Furthermore, choosing both the ontologies may

produce a conflict, whether the term ‘epoetin alfa’ should be considered a synonym, or

a hyponym. Depending on the vocabulary of the ontology, some query terms may not

even be covered under the ontology’s domain. For example, GO classifies

‘erythropoietin receptor binding’ (a synonym of ‘erythropoietin’) as a ‘molecular

function’.

In the case of certain queries, selection of ontologies is very obvious and easy. For

example, an obvious source for gene names is GO and an obvious source for drug

names is NDF. Additionally, it also helps if the user of an application knows exactly

which ontology to select for query expansion. However, given that – (a) not all queries

Figure 3.6: Comparison between Multiple Biomedical Ontologies

Page 74: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 56

can be disambiguated easily; and (b) not all users are domain experts; some criterion

for ontology selection becomes important. In the description of the TREC data set by

Hersh et al [59], a source of terms is suggested for each of the 36 topics. This is a great

starting point in order to continue query expansion and perform experiments. In

reference to automated ontology evaluation and selection, Sabou et al [110] claim that

ontology selection is generally based on algorithms which compute the popularity

[32], richness of knowledge [3], and topic coverage [22]. Maiga and Williams present

a user-input based ontology evaluation and selection tool [83]. However, for our

problem, where we are given a query and need to choose the appropriate ontology, the

problem of ontology selection strongly depends on the context more than ontology

parameters such as size and popularity. A potential research direction could involve

studying how word sense disambiguation techniques and simple classification models

can help in ontology selection [91,102].

Although this provides an exciting sub-topic for research, it is out of scope of this

thesis. Thus, we will manually choose ontologies in order to perform query expansion.

In Chapter 4, we perform experiments by manually choosing ontologies and study the

effect of ontologies on information retrieval in the patent system.

3.2.3 EFFECTS OF INDEXING PARAMETERS

Section 2.4 discusses several parameters including choice of tokenizers, stop word

lists, stemmers and scoring functions that can be manipulated when indexing

documents. These parameters could apply to specific fields or the entire document

index and eventually affect Information Retrieval (IR) either positively or negatively.

For example, experiments by Ide et al suggest that morphological expansion provides

better results than using stemming [63]. Also, instead of using the standard English

stop word list, some studies use special stop word lists in order to filter common

words specific to that domain [138]. Other possible variations in indexing techniques

Page 75: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 57

include the usage of different tokenizers. The standard English tokenizer ignores

punctuations, white spaces and special symbols. However, in the biomedical domain,

several names of genes or drugs are a combination of special symbols, numbers and

characters such as ‘BRCA-1’ and ‘p53-gene’. Ide et al claim they achieved the best

results for tokenizers which indexed at the most granular level, and then combined all

characters to form the original biomedical term during the querying phase [63]. Most

of the variations discussed so far account for little improvement in the overall

performance of an IR system [59]. Scoring functions are another important parameter

to choose when constructing text indexes. Okapi-BM25 and BM25F are variations of

the original tf-idf scoring model, which have shown improved IR performance

[84,108]. BM25 is defined by [108]:

∑ ( ) ( ) ( )

( ) ( )

where tf(qi,D) represents the term frequency of query term qi in document D, |D| is the

length of the document and avgdl is the average length of all documents. Constants k

and b are usually chosen to be in the range [1.2, 2.0] and 0.75 respectively. The

inverse document frequency [108],

( ) ( )

( )

where N is the total number of documents in the corpus and n(qi) is the number of

documents containing the query term qi.

In our methodology, we implement both scoring functions and study how it affects

the information retrieved, especially with the extracted domain knowledge,

implementation effort, etc.. Pérez-Iglesias provides a Lucene implementation of the

Page 76: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 58

BM25 scoring function which makes it easier to integrate with our work flow and

framework [66].

3.2.4 SCOPE OF THE QUERY TERMS

In this section, we experiment with the scope of the query terms using two

parameters – (1) limiting terms to specific fields such as titles and abstracts; and (2)

the distance between multiple (AND) clauses in the query.

Different fields of a document such as titles and abstracts provide different depths

of details about the documents. While the abstract may provide an overview, for

majority of the times, the title tries to capture the major theme into a single sentence.

Following this notion, we assume that the terms appearing in the title can potentially

act as the strong descriptors of the document. Certain patent-related applications

emphasize especially on the terms used in the claims, rather than other fields.

Scientific publications available from PubMed do not always contain the full-text. In

fact, the documents are indexed with their descriptors, which are derived from the

MeSH vocabulary. We study the effect of the field of search in our methodology, by

limiting searches to specific fields of the documents such as the title, abstract, MeSH

descriptors for PubMed documents and so on. Based on the results, it would be

possible to derive an interpolated model such that each field is individually weighted

in accordance with its importance for that specific application.

When searching for specific content, very short queries (e.g. single term queries)

will not be very effective due to the volume of available information. Adding more

terms to a query in AND clauses is equivalent to adding more constraints, thus,

making the search more specific. However, the tf-idf model can give a high score to

documents which contain the search clauses, even if they are not in relation to one

another. To ensure that the documents more relevant to the query get a higher score,

we impose a distance constraint to the search clauses, following that they will not be

Page 77: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 59

very far away from one another in the document, if the document is relevant. Table 3.2

shows the retrieval results for the TREC topic 231 for different distances between

search clauses. This preliminary experiment validates our hypothesis.

3.2.5 INTERACTIVE MODEL FOR VISUALIZATION

Information needs come in various abstraction levels. For example, the TREC

topic 231 (“What [TUMOR_TYPES] are found in Zebrafish?”) has a specific

information needs. The expected results are also very precise, usually within a few

sentences. In contrast, a general search for technologies in the medical imaging space

is much broader, resulting in a much larger number of documents. These varying

abstraction levels of information needs are hard to be captured in the user’s query. As

an alternative to the automatic query expansion, we developed a visual exploratory

model based on term co-occurrence.

Term co-occurrence is a strong indicator of context and association. A visual

model of term co-occurrence can provide significant information about the query

terms, their association, and other surrounding terms. We extend the visualization

module of such a co-occurrence based model, MINOE, originally designed for

exploring marine ecosystems [37]. In Section 3.2.1, we explained that both vertical

and horizontal (synonym) expansion of terms are useful. As an alternative to the

automatic query expansion, we annotate MINOE’s visual co-occurrence graphs with

domain ontologies, to allow users to manually explore the hierarchies of biomedical

ontologies over the document repository.

Table 3.2: Effect of the Distance between Search Clauses

Distance Precision Recall F-Measure

Entire Document 0.03 0.876 0.05

Within 25 terms 0.574 0.876 0.69

Page 78: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 60

The user interface for this tool is flexible and has several features (see Figure 3.7).

Each term represents the entire concept including its synonyms and hence, the search

will automatically include all synonyms. The term connections represent an

association (co-occurrence) between two terms. The size of the terms and the

connections on the graph represent the strength of the connection. The users can

navigate hierarchies by choosing to add child concepts or parent concepts until a

satisfactory abstraction level is reached. This integration of domain ontologies and

MINOE’s visualization module result in a powerful combination and an exploratory

tool.

3.3 PATENT ONTOLOGY

Interoperability between information sources is essential in order to perform multi-

source IR. In this section, we describe a patent system ontology which provides

standardized representation and a shared vocabulary of the information sources to

facilitate interoperability. The ontology will also provide the required declarative

syntax to express multi-source queries, rules, and relevancy metrics.

Figure 3.7: Visualizing Concept Co-occurrences using MINOE

Page 79: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 61

There is a large community working towards the development of ontologies,

knowledge representation, and engineering [16,28,53,54,55,72,81,93,94,127,128].

Several ontology development methodologies have been proposed and implemented

over the years. We reviewed some of the methodologies which are most applicable to

the development of our patent system ontology [20,28,54,93]. In general, the

development of ontologies consists of several steps starting from the conceptualization

of the domain, defining the properties inter-relating the defined classes, instantiating

the classes with physical objects and the verification of the constructed ontology. In

their paper Ontology 101, Noy and McGuinness state that ontology development is

essentially an iterative approach where the ontology evolves to satisfy the

requirements of the application it is being designed for [93]. We follow the Ontology

101 development methodology to (1) define the scope and the application of the

ontology; (2) conceptualize each information source and build a hierarchy of classes;

and (3) define properties and relations on each of the classes. The resulting ontology is

instantiated with actual physical documents from the document repository.

It is important to determine the specification language in which the ontology will

be coded. Several specification languages have evolved over the years including frame

based languages such as F-Logic and OIL, and descriptive logic based languages such

as DARPA Agent Markup Language and Ontology Inference Layer (DAML+OIL),

Resource Description Framework (RDF) and Web Ontology Language (OWL)

[98,107]. Description Logic (DL) based languages were developed to overcome the

lack of formal logic-based semantics in frame based languages. Several factors need to

be considered when choosing a specification language for the ontology which include

expressivity, semantics, reasoning capabilities, availability of tools, re-use and

personal preference. RDF is a widely used language to conceptualize domains. OWL

is a W3C recommendation which is built on top of the semantics of RDF to provide

higher expressivity levels. These higher expression levels allow us to define disjoint

classes, ‘sameAs’ or different individuals and class property restrictions among others

Page 80: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 62

[98]. Several tools have also been developed for the construction and modeling of

ontologies such as Protégé and Chimaera [24,104]. Protégé is widely used in the

ontology engineering community. Protégé supports both OWL and RDF, and provides

useful features and plugins allowing us to query and visualize the ontology. Taking

into account the above mentioned considerations, we choose OWL as the specification

language and Protégé-3.4 as our development tool for the patent system ontology.

However, not all OWL axioms are highly scalable and hence, to the extent possible we

make maximum use of the RDF subset of the OWL axioms.

The rest of this section is organized as follows: Section 3.3.1 presents a list of

competency questions which are used to define the scope of the ontology and perform

a preliminary evaluation in Section 3.3.4. The generated competency questions are

typical application scenarios and directly reflect the potential of the ontology. In

Section 3.3.2, the domains are conceptualized and classes are extracted based on the

competency questions. Relations are defined over the classes and cross-references are

explicitly stated. The resulting ontology is populated with actual instances of physical

documents for further evaluation and use. The current scope of the ontology is limited

to patents, court cases, and file wrappers.

3.3.1 DEFINING SCOPE OF THE ONTOLOGY

Ontologies are typically developed with specific applications as targets. Gruninger

and Fox suggested that a set of competency questions be developed; these are

questions that the ontology is expected to answer [54]. Developing these questions not

only helps define the scope of our ontology but also allows us to verify the usefulness

of the ontology both throughout and after the development phase [93]. In Chapters 1

and 2, we mentioned a few of our target applications such as patent prior art search,

patent claim invalidation, and patent infringement analysis. These applications aren’t

very different from one another and in fact, they go hand in hand in most scenarios.

Page 81: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 63

Keeping these applications in mind, we define a set of competency questions which

confine to a single domain such as patents, and also span multiple domains. The

competency questions in no way limit the use of the ontology to these applications

alone, rather they are examples of questions the ontology must be capable of

answering at the minimum. The list of competency questions presented is not meant to

be an exhaustive list, but to illustrate how the metadata and text fields parsed from the

documents in Chapter 2 are used in the context of the patent-related applications.

Patent Domain:

Return all patent documents which contain the phrase ‘recombinant erythropoietin

receptor’ in the claims

Return all the patent documents which contain the phrase ‘recombinant

erythropoietin receptor’, at least 3 claims, issued before 02-02-1999 and assigned

to Genetics Inc.

Court Case Domain:

Return all court cases which contain the term – ‘erythropoietin’

Return all court cases which involve the company Amgen Inc. either as the

plaintiff or defendant, and from the District Court of Massachusetts

Scientific Publication Domain:

What percentage of articles in the journal Blood are contributed from authors

located outside the US?

Return all articles by author John Doe from the journal Nature

Multi-domain:

Page 82: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 64

Return all patents which contain the term – ‘erythropoietin’ in their claims, which

are involved in at least one court litigation.

Search the titles of scientific publications for the terms from the claims of patent

5,955,422

The questions can get more complex depending on the requirement of the user.

The results of one query can further be re-filtered with additional constraints:

Return all court cases with the term ‘erythropoietin’. From these court cases,

return the patents involved. From these patents, follow the backward and forward

citations to identify more important patents.

Notice that the last bullet point is the method we followed to identify the 5 core

patents assigned to Amgen, and the 135 patents relevant to our use case. In each of the

questions, the main terms (or objects) are underlined and show that there is some

relationship between these terms. First, these terms are grouped together into concepts

or classes such that they represent a collection of items corresponding to that term.

Second, relations are drawn between classes such that the competency questions can

be sufficiently expressed as a query using those classes and relationships. This is also

known as a bottom-up approach in constructing an ontology.

Relations in OWL are binary relations, i.e. they can be used to relate exactly two

classes, two individuals or an individual to a value. These can be represented in triple

form as {subject, predicate, object}. The values that the subject and object take on can

be restricted by defining the domain and the range of the relation; where domain refers

to the subject end of the relation and range refers to the object end of the relationship

[61]. OWL additionally allows us to define logical characteristics such as transitivity

and symmetry on these binary relations which enhance the meaning of this relation.

For example, if the ‘=>’ relation is defined as a transitive relation, then {A => B} can

be used to infer {B => A}. Hence, if properly defined, new knowledge can be derived

Page 83: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 65

from existing knowledge. Additionally, we can define necessary and sufficient

conditions on classes which can be used to logically classify instances into classes

[61]. For example, we could define a patent document to be a document with exactly

one Title and Abstract, and at least one Claim. This means, even if we don’t explicitly

state that a certain document with exactly one title and abstract, and at least one claim

is a Patent, it can be inferred. However, as mentioned in Chapter 2, the information

sources are very diverse and this leads to many issues to extract information from the

documents. Potential issues include erroneous or missing information and hence, if we

were to define very strict properties, then a patent document could be misclassified

because some information was missing. For this reason, we relax the properties on the

relations in our implementation of the patent ontology.

3.3.2 CONCEPTUALIZATION

Figures 3.8 and 3.9 show a conceptual view of the patent and court case

documents respectively. The relations between two entities (shown as a black line) are

directional from patents and court cases out to other classes, e.g. {Patent, hasTitle,

Title}. The relations are not symmetric and hence the inverse {Title, hasTitle, Patent}

does not hold true. In both Figures 3.8 and 3.9, we notice that the remaining classes

can be grouped under either metadata or textual information. This form of

classification helps to address all the metadata at once, instead of individually calling

out to each one. For example, if an application requested for all metadata of a patent,

using the ontology we can return all metadata entities such as Title, Date,

Classification, etc.. We can further group metadata and textual information into a

single parent node Information. When the patent and court case hierarchies are

combined, classes which are common to both documents will refer to the same

concept and not two different concepts.

Page 84: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 66

This form of abstraction is not only possible for classes, but also for relations,

made possible by the rdfs:subPropertyOf construct. Court cases and Patents are related

to each of the classes shown in Figures 3.8 and 3.9. These relations, such as ‘hasTitle’,

‘hasAbstract’, and ‘hasPlaintiff’, etc., can also be abstracted into a common parent

relation ‘hasInformation’. This relation has a domain of either Patent or Court Case

and Information as a range.

Figure 3.9: Conceptual View of Court Case

Figure 3.8: Conceptual View of Patent Documents

Page 85: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 67

File wrappers are not documents themselves, but in fact a collection of documents.

This makes modeling file wrappers trickier than the other documents such as patents

and court cases. Firstly, a vocabulary of all kinds of documents contained within the

file wrapper must be defined. Since each of these documents refers to a particular

event of communication between the applicant and patent office, we will call it Event

instead of document to avoid confusion between the class Document and a file

wrapper event. The events of importance to us are shown in Figure 3.10. We group

application events and office actions separately to allow representation of queries such

as – “Return all office actions for file wrapper A”. Each file wrapper event must be

individually modeled keeping in mind the information it contains. For example, each

examiner Rejection contains critical information such as – the allowed claims, the

rejected claims, and the withdrawn claims (see Figure 3.11). Similarly, other events

such as Interference, Restriction, and Amendments are also modeled in our patent

system ontology.

The Patent, Court Case, and File Wrapper classes shown in Figures 3.8-3.10 are

different types of documents available from different information sources. The patent

system comprises many such information sources and many such documents. In the

Figure 3.10: Events Contained in a File Wrapper

Page 86: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 68

top level ontology for the patent system (shown in Figure 3.12), all types of

documents are abstracted into a single parent class (Document). The Document class

can be sub-classed any number of times to include other forms of documents such as

regulations and laws which are currently not in the scope of our study. The classes

Document, Information, and Event correspond to the three root nodes of the patent

system ontology. Additionally, the classes Inventor, Examiner, Author, and Judge,

etc., can be abstracted into a common parent node such as Person.

As mentioned earlier, information sources in the patent system implicitly cross-

reference one another (see Figure 3.13). These implicit cross-references serve as

relevancy measures when comparing documents from different information silos.

When manually comparing two documents, these cross-references are rather obvious

to the human eye. For example, a human could easily spot a reference to a patent

document in the court case. These references can very quickly help identify relevant

documents to a user query. The true power of the patent system ontology lies in the

ability to integrate information across multiple information sources. The patent system

ontology is extended to relate two classes or individuals from different domains to

explicitly represent the cross-references. Applications built around the patent system

ontology can dynamically derive relevancy based on the pre-defined cross-references.

Figure 3.11: Excerpt from the Patent System Ontology: Rejection class

Page 87: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 69

Figure 3.13: Cross-Referencing between Documents in the Patent System

Figure 3.12: Top Level Ontology for the Patent System

Page 88: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 70

3.3.3 POPULATING THE ONTOLOGY

The ontology is populated with information from actual physical documents from

the document repository. The XML files are parsed and for each parent-child node –

(a) Instances of both parent and of the child are created. If these instances already

exist, they are only updated with new information, if any

(b) The parent and child instances are related to one another through the

appropriate object or data-type property. If the property does not exist, it will

be created (see Figure 3.14).

The instantiation is done automatically using the standard Jena and Protégé Java

libraries. Once the instantiation is complete, an OWL reasoner such as Pellet is

triggered to check for consistency and make inferences. For example, an entity in the

class ‘Patent’ will be additionally classified as a ‘Document’, since ‘Patent’ is a

subclass of ‘Document’. The current version of the knowledge-base is populated with

the 1150 U.S. patents and 30 court cases from our corpus. Other patent documents

which may have been found in court cases or through patent citations, but not in the

Figure 3.14: Populating the Patent System Ontology

Page 89: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 71

original 1150 documents are instantiated but contain no information about the patent

since the original document itself is unavailable in our corpus. However, we ignore

any documents which are not a part of our corpus when performing the tests. The file

wrapper for U.S. patent 5,955,422 has also been partially incorporated into the

knowledge-base. Currently, only the first amendment, rejection, interference and the

original application from the file wrapper are populated.

Triple stores are specialized databases to manage large amount of information

written in RDF [18,89,96]. Most triple stores also have limited support for OWL. Due

to the size of the ontology, we create a local instance of a triple store (Virtuoso) and

store all the triples in it. Using a triple store will allow us to scale our ontology to

millions of instances. Moreover, ontology editors such as Protégé require loading the

ontology each time the application is executed. The triple stores provide a persistent

store for the triples and significantly lower the loading time. Currently, the ontology

can be queries using SPARQL through both Protégé and Virtuoso interfaces [6,104].

3.3.4 USING THE DECLARATIVE SYNTAX: EXPRESSING QUERIES AND DEVELOPING

RULES

The patent system ontology provides declarative syntax (RDF and OWL) to

express queries and rules to embed heuristics. This section provides examples to

understand how this is possible.

3.3.4.1 Expressing Competency Questions as SPARQL queries

Table 3.3 shows examples of how we can represent any natural language question

in SPARQL to query the ontology, as long as the classes and relations required to

express the query are defined in the ontology. The queries do not always have to

return documents, but can return other classes like Inventors or Examiners as well.

These SPARQL queries will generally be handled at the application level and will be

Page 90: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 72

abstracted from users. Applications can request any information they want from the

ontology. In fact, even the applications do not have to fully know the details of the

ontology. The ontology can be queried for all its relations for a particular class or

between two classes. For example, the query:

SELECT ?rel WHERE {

?pat type Patent .

?pat ?rel Information

}

will return all relations (variable ?rel) which have the class Patent as the domain. In

other words, all relations defined on patents such as hasTitle, hasAbstract,

hasIPCClass, etc., will be returned. Hence, updating the underlying ontology with new

information will automatically update the application using it as well.

3.3.4.2 Expressing Heuristics as Rules

Rules are declarative statements which operate over the entities defined in the

ontology. This provides a way to express relations that are more than simple binary

relations using if-then clauses. The Semantic Web Rule Language (SWRL), which

combines OWL and RuleML, extends the expressivity of OWL [62]. An inference

Table 3.3: Expressing Competency Questions in SPARQL

Competency Questions SPARQL Query

Return all court cases which involve the

company Amgen Inc. as the plaintiff

and from the District Court of

Massachusetts

SELECT ?case WHERE {

?case type CourtCase .

?case hasPlaintiff “Amgen Inc.” .

?case hasCourt “District Court…”

}

Return all patents which contain the

phrase ‘recombinant erythropoietin

receptor’ in the claims and IPC class

“A61K”

SELECT ?pat WHERE {

?pat type Patent .

?pat hasClaim ?clm .

?clm hasTerm1 “recombinant …” .

?pat hasIPCClass “A61K” .

}

Page 91: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 73

engine or a reasoner executes the rules and infers new facts in the knowledge base.

SWRL however comes at a price of decidability and computational complexity [101].

The use of DL-safe rules for reasonable complexity is suggested [87]. We use the

Pellet reasoner and Jess inference engine to reason over the developed rules [41,117].

The rules are developed based on similarity heuristics between documents. Examples

of the heuristics for the rules are shown in Table 3.4.

The rules operate over the metadata and cross-referenced properties defined in the

patent system ontology and infer pairs of similar documents. In order to differentiate

between the inferences made by each rule, we define a property

hasSimilarDocument_* for each rule, where * indicates the identifier for the rule. This

allows us to apply several weighing schemes to the rules to distinguish between the

more general and the more important specific rules. To illustrate, consider the example

shown in Figure 3.15, where Patents 1 and 2 are both owned by the same company

‘Amgen’ and invented by the same inventor. According to our rule base, these patents

should be considered similar to one another according to at least two rules. However,

intuitively, a large company such as Amgen is likely to own patents covering a

broader range of topics than a single inventor would. If Amgen has ‘n’ patents, then

we will assume each link contributes a weight of 1/n. Similarly, if Inventor1 contains

Table 3.4: Expressing SWRL rules

Heuristic or Relevancy Metric SWRL Rule

Two patent documents by the

same inventor are potentially

similar

hasInventor(?pat1, ?inv1) ∧ hasInventor(?pat2,

?inv1) → hasSimilarDocument_1(?pat1, ?pat2)

Two patents that appear in court

litigation are potentially similar.

Also, the court case is related to

both the patents

patentsInvolved(?case, ?pat1) ∧

patentsInvolved(?case, ?pat2) →

hasSimilarDocument_2(?pat1, ?pat2) ∧

hasSimilarDocument_2(?case, ?pat1) ∧

hasSimilarDocument_2(?case, ?pat2) ∧

hasSimilarDocument_2(?pat1, ?case)

Page 92: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 74

‘m’ patents, then each link has a weight of 1/m. Since n>m, the more general rules

would be assigned a lesser weight. The resulting similarity score between the

documents is a weighted sum of the number of rules that infer the two documents as

similar:

( ) ∑ ( )

where Wi represents the importance of the rule and inference(i) = 1 if ‘A

hasSimilarDocument_i B’ or 0 otherwise. For illustration purpose, in this paper we

simply give all rules equal weights and the score is equal to the number of rules that

have concluded that the two documents are similar.

3.4 IR FRAMEWORK

In IR, the information desired is seldom achieved with a single query. Queries are

typically reformulated several times based on intermediate search results until the

information need is satisfied [119]. This reformulation could include the addition of

synonyms, new search terms, and other constraints. When performing multi-source

Figure 3.15: Expressing Heuristics through Rules in Patent System Ontology

Page 93: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 75

search, information obtained from searching one domain is applied to another. The

methodology in Sections 3.2 and 3.3 provide the backbone for automating this

process. While the domain ontologies ensure that the correct semantics are applied for

efficient retrieval, the patent system ontology standardizes domain representation and

integration. In this section, we present an Information Retrieval (IR) framework which

integrates the methodologies from Sections 3.2 and 3.3 in multiple stages to enhance

multi-source IR (see Figure 3.16):

1. Expand Query: In this stage, the user’s initial query is expanded according to

the methodology described in Sections 3.2.1. Manually selected bio-ontologies

act as the source for concepts and appropriate weight vectors are selected.

2. Search Information Sources: Information sources are independently searched

with applicable restrictions from Section 3.2.4, i.e. the scope of the search. The

required vocabulary and syntax for searching the information sources is

contained in the patent system ontology. For example, the patent system

ontology provides the syntax for searching the titles of documents – hasTitle:

‘erythropoietin’. The information sources are searched independently in this

stage to retrieve highly relevant documents from each source.

3. Cross-Reference Information: The cross-referenced information holds key for

Figure 3.16: IR Framework

Page 94: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 76

multi-domain retrieval. The cross-references explicitly defined in the patent

system ontology are used as relevancy measures to correlate search results

between information sources. For example, a relation defined in the patent

system ontology – {caseA, patentsInvolved, patentA} will cause the

framework to extract patent numbers from the court case. These patent

numbers can be used to repeat or enhance the search for the patent domain.

Similarly, biomedical terminology can be extracted from one document and

used to search other documents. For example, if the drug ontology is used to

identify drugs in the abstract of publications, the newly identified drugs can be

used to search the patent domain. In fact, they can directly feedback into Step

1, where newly added terms can be expanded using the biomedical ontologies.

Also in Step 2, the new search terms could be limited to searching only the

claims of the patent.

4. User Feedback: Besides the diverse information and knowledge sources, the

users in the patent system domain area also come from a diverse background –

scientific/technical, legal, business, and more. The intention of the user must

be captured through the search process in order to ensure that the results

retrieved are indeed relevant to the user. User-relevancy feedback has been an

important part of IR research [8,84]. The user relevancy feedback stage is out

of this thesis’ scope and will not be discussed. However, user feedback is an

important component of the framework and will be included in future

implementation.

To illustrate the methodology and IR framework, consider the example shown in

Figure 3.17. Based on the initial query ‘erythropoietin’, Stage – I of the framework

expands the query based on the Drug Ontogy to:

Erythropoietin [Drug Ontology] = QInitial = {{erythropoietin, epo}, {epoetin alfa,

epogen, procit…}…}

Page 95: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 77

In Stage – II of the framework, the expanded query is used to search the TREC

corpus and associated diseases such as anemia are extracted. These terms can be fed

back to Stage-I to re-apply expansion on the new terms to give us:

Anemia [Disease Ontology] = anemia, {aplastic anemia…}

Similarly,

ESRD [Disease Ontology] = {esrd, chronic kidney disease…}

In Stage – III, these new terms, are used to search the claims of the US patent

documents in conjunction with the original query to highly relevant patent documents

– {5,955,422, 5,547,933, 5,618,298, 5,620,868, 5,756,349, …}. This process can

continue as long as desired and operates on other information sources as well.

Figure 3.17: Example to Illustrate IR Framework

Page 96: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 78

3.4.1 IMPLEMENTATION DETAILS

In this section, we provide a brief overview of the implementation of the IR

framework and its basic features. The IR framework is implemented entirely in Java

with abstractions of several modules that are critical for the system. Some of the

features of the IR framework include:

Feature Modules for query expansion such as the one explained in Section

3.2.1

Generic API for integration with sources of domain knowledge such as

BioPortal

Jena libraries and triple store integration for modifying the patent system

ontology through new constructs, cross-references or rules.

Solr and Lucene libraries to create, update and query the text indexes

Automatic query generation, abstracting the syntactic details from the user.

Automatic UI and search configuration through a pre-defined properties file.

The current implementation does not directly interface with the information

sources, rather interfaces to a local copy of the document repository. The work flow

(see Figure 3.18) is divided into two stages. The first stage, the offline phase consists

of – (a) parsing the document repository; (b) updating the references and rules in the

patent system ontology; and (c) creating or updating the text indexes according to the

patent system ontology. The patent system ontology is not directly queried for the

following reasons:

1. Semantic technologies are not scalable to larger amount of data.

2. Text mining libraries, such as Lucene, outperform triple store implementations.

The second stage, the online phase, involves the UI which communicates with the

text indexes and fetches domain knowledge dynamically from BioPortal. The tool

Page 97: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 79

implements the four stages of the IR framework described in Section 3.4 in the

backend, while the UI is used to display search results and collect user feedback, etc..

3.5 RELATED WORK

There is a wealth of research done in the area of IR and related topics such as

Information Extraction, Document Summarization, Text Mining, Data Mining and

Machine Learning. The methodology discussed in this chapter is based on – (a)

knowledge-based methods, such as query expansion, which make use of domain

ontologies; and (b) using ontologies to achieve interoperability between information

sources to facilitate multi-domain searching. This section summarizes the works

closely related to our methodology.

Figure 3.18: Current Implementation of the IR Framework Methodology

Page 98: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 80

3.5.1 KNOWLEDGE-BASED IR

Several studies in the recent years have made use of domain ontologies and

derived knowledge annotations for IR and related tasks [11]. GoPubMed is a search

engine which uses the MeSH and the GO ontologies to annotate and search the

PubMed index to retrieve biomedical publications [35]. The TREC Genomics track

(2003-2007) had several research groups working on information retrieval on a subset

of the PubMed index [59]. Some of the more successful methodologies employ the use

of domain knowledge, especially for synonymy [63,121]. Although the use of

synonymy is reported to be erratic [59], it accounts for a majority of the improvement

in the top performing systems. Domain knowledge has been used to improve retrieval

in the patent document space as well [45-47,88,132]. Mukherjea and Bamba use

knowledge sources to annotate the physical documents to improve recall [88]. Their

ranking mechanism, however, is based on non-semantic measures such as citation

counts. The use of domain knowledge for other related tasks such as summarization,

clustering and visualization has been shown [74,126,132]. The PATExpert project has

developed an ontology for patent documents which focuses on the European patent

system [46,47,132]. However most of the above methods are tuned to work with a

single information silo, and must be extended to work with multiple information

sources.

3.5.2 OTHER APPROACHES TO IR

Several methods approach document retrieval from a non-semantics perspective.

These methods typically use metadata information to cluster and classify relevant

documents. Citation analysis and link analysis typically focus on the incoming and

outgoing citations of a document [42,100]. Other general metadata based

methodologies focus rank the document based bibliographic information such as the

rank of a journal [31]. Kang et al cluster patent documents based on their technology

Page 99: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 81

classification to improve retrieval [73]. Xue and Croft explore an automatic query

generation method to retrieve patent documents which extracts noun phrases from pre-

specified fields of the patent document [137]. However, these methodologies are

outperformed by knowledge-based methodologies. Potential future work could explore

how to best combine semantic methodologies with the others.

3.5.3 ONTOLOGY DEVELOPMENT AND INTEROPERABILITY

The nature of the problem we are addressing demands information which is

scattered across many diverse information sources in the patent system.

Interoperability between these information sources is essential to facilitate multi-

domain searching. A variety of ontology-based methods have been proposed for

integrating diverse knowledge domains [86,106,115,131]. While some support having

a single unified ontology for all purposes, having such an ontology is not scalable.

Furthermore, no one organization will take charge of maintaining the ontology.

Alternative architectures suggest having separate ontologies representing each

knowledge domain, and integrating them through either the application directly,

providing ontology mappings, or via a top level ontology [103,131,132].

In our methodology, we develop the patent system ontology, which provides

structural interoperability between the information sources. In some sense, we achieve

semantic interoperability by using domain ontologies to integrate information from

several domains. However, a much higher level of interoperability can be achieved if

legal ontologies and biomedical ontologies are combined. Examples of legal

ontologies include structural ontologies, technology classifications (USPC31

and

IPC32

) and so on. We develop the IR framework to combine the patent system

31 The United States Patent Classification codes can be accessed at –

http://www.uspto.gov/web/patents/classification/ (Accessed on 03/01/2012). 32

The International Patent Classification codes can be accessed at –

http://www.wipo.int/classifications/ipc/en/ (Accessed on 03/01/2012).

Page 100: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 3. METHODOLOGY 82

ontology and the domain knowledge from biomedical ontology as a first step towards

this goal.

Page 101: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

Chapter 4.

PERFORMANCE EVALUATION

4.1 INTRODUCTION

Performance evaluations help establish aspects of the system that perform well and

give insight into how the methodology can be potentially improved upon. In this

chapter, we perform a formal evaluation of our methodology against the document

repository described in Chapter 2. In our methodology, the problem of retrieving

information across multiple sources in the patent system is tackled in multiple stages.

First, the query expansion methodology is integrated with domain knowledge to

improve retrieval from a single information source at a time. Next, the patent system

ontology is used to integrate information across multiple sources and retrieve a set of

highly relevant documents. Since both methodologies focus on different stages of the

IR framework, their experimental setups, and evaluation criteria differ. Hence, we

evaluate the query expansion methodology and the patent system ontology

independently.

The chapter is divided into three parts: Section 4.2 provides some necessary

background on SPARQL, a language to query RDF ontologies, and formal evaluation

measures used in IR such as precision and recall. Section 4.3 evaluates the

performance of the knowledge-based query expansion methodology on the documents.

Page 102: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 84

The results are compared to baseline references that are generated by querying the

document corpus without the use of domain knowledge. Section 4.4 demonstrates the

functionality of the patent system ontology through use case scenarios based on two

applications – (1) patent prior art search, and (2) infringement analysis. A series of

questions that are typical in the applications is generated in order to query the

ontology. A summary of the discussion is provided in Section 4.5, abstracting the

benefits, and the limitations of the methodology, laying a strong foundation for future

experimentation and potential improvements. Section 4.5 provides a summary of the

discussions in the chapter.

4.2 BACKGROUND AND RELATED WORK

Many formal measures are defined in IR literature to evaluate the performance of

systems. In this section, we provide some background on the various formal measures

that will be used throughout the chapter. Specifically, section 4.2.1 defines recall,

precision, f-measure, average precision, document mean average precision, and

‘precision @ k’. The patent system ontology is evaluated based on a series of queries

representing two application use cases. The queries are written in SPARQL and

require some understanding of the syntax of the language. Section 4.2.2 provides a

brief overview on the SPARQL language and some common constructs that are used

in this chapter.

4.2.1 EVALUATION METRICS

The most common evaluation metrics in IR are recall and precision measures.

Statistically speaking, recall measures the coverage of the search, or the fraction of

relevant documents retrieved and can be defined as [84]:

Page 103: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 85

where TP is the number of true positives and FN are the number of false negatives.

Precision measures the number of relevant documents out of the total number of

documents retrieved and is defined as [84]:

where FP is the number of false positives. A third measure used in IR, the F-measure,

is the harmonic mean of the precision and recall, and is defined as [84]:

The Average Precision is the mean of precision values calculated at each position

where a relevant result is found. The Mean Average Precision (MAP) is the mean of

the Average Precision for a set of queries over a corpus. The MAP measure is

increasingly being used to evaluate search results [59]. The TREC corpus uses MAP

to evaluate results at the passage level and document levels. However, we will use

MAP to only evaluate the results of the document retrieval.

Since most users only view the results from the top 10-30 hits, the precision and

recall measured over the entire set of results is not a highly relevant measure. The

precision at certain smaller values of retrieval are much more relevant. Thus, we report

‘precision @ k’, where k is the number of retrieved results at which the precision is

reported.

4.2.2 SPARQL

Over the years, several query languages for RDF graphs have been developed.

Some of the commonly used ontology query languages include RDF Query Language

(RDQL), SPARQL Protocol and RDF Query Language (SPARQL) and Semantic Web

Page 104: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 86

Rule Language (SWRL) [62]. In this section, we present some background on

SPARQL as a means to query RDF graphs. SPARQL is syntactically very similar to

Structured Query Language (SQL), a language commonly used to query relational

databases. Similar to SQL, SPARQL is provides many features and clauses such as

CONSTRUCT, DESCRIBE and ORDER BY amongst many others enabling the

creation of complex queries. Although SPARQL is a query language for RDF, since

OWL is built over the RDF semantics, SPARQL can be used to query OWL

ontologies as well. The simplicity and ease of use of SPARQL, which is in-built in the

OWL API33

, has encouraged us to use SPARQL to query the patent system ontology.

The SPARQL queries that have been used in this chapter mainly consist of two

parts – the query variation, and the triples (see Figures 4.6 and 4.7 for example).

SPARQL provides different query variations that can be used to query RDF graphs.

These are SELECT, DESCRIBE, CONSTRUCT and ASK. We use the SELECT

keyword to extract raw values from the graph. The other variations are not used in our

work, but highly useful when dealing with RDF graphs. The query triples are used to

specify information that needs to be extracted. The triples are of the form “?subject

?predicate ?object” where any term with a leading question mark is a variable that

can match multiple entities. For example, “?subject a CourtCase” will return all

entities in the ontology, which are of type CourtCase. A detailed description of the

SPARQL query language is available in the W3C documentation [118].

4.3 KNOWLEDGE-BASED METHODOLOGY USING BIO-ONTOLOGIES

Terminological inconsistencies and high use of domain specific semantics render

pure term based methodologies ineffective. In order to retrieve relevant information, a

strong integration of domain knowledge is important. In addition to domain

33 The Javadocs for the OWL API can be found at -

http://owlapi.sourceforge.net/documentation.html (Accessed on 03/01/2012).

Page 105: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 87

knowledge integration, methodologies must be able construct complex queries that are

capable of improving document retrieval in an automated fashion. The query

expansion methodology described in Section 3.2 attempts to achieve this by extracting

semantically relevant terms from external domain ontologies. In this section, we will

evaluate our query expansion methodology with a reference to existing literature and

baseline results. The results are supported with a thorough analysis.

The documents significantly differ in many aspects and thus, the methodology will

apply in different ways to each type of document. For this purpose, the methodology is

tested independently on the documents. We generate the baseline references by using a

simple term based model to query the document corpus. Additionally, for the

publication data set, the results from the 2007 TREC genomics competition are used

as a reference. Our goal is to improve performance with respect to these baseline

values and construct a strong foundation for future improvements.

This section is organized as follows: Section 4.3.1 queries the document corpus

without the use of domain knowledge, to generate the baseline estimates. Section 4.3.2

explains the general experimental setup for integrating domain knowledge and query

expansion. Based on the general experimental setup, Sections 4.3.2.1 and 4.3.2.2

independently evaluate the methodology on the patent and scientific publication

corpuses independently.

4.3.1 BASELINE

The first step in the evaluation is to establish the baseline references for

comparison in each document type. In our use case, the keyword ‘erythropoietin’ is

used to search through the patent database and to generate a baseline reference. The

search for ‘erythropoietin’ in the patent data set results in a large number of

documents. We use the 135 patents as the ground truths to calculate the precision and

recall measures. The benchmarking search results in a recall of 0.67 but with a low

Page 106: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 88

precision of 0.125. As users rarely look beyond the top 10-20 search results [84], this

baseline search alone with a large number of documents is ineffective. In addition to

precision and recall results, we are especially interested in retrieving the five core

patents, since they are important in our use case. We compute the average rank of

these five core patents to further evaluate the effectiveness of our system. An average

rank of 3 would indicate all five core patents are retrieved in the top 5 results. The

average rank of the five core patents found to be 51.4 out of the 1150 patents. Table

4.1 lists the rank of the five core patents for the baseline search.

For the publication data set, the original topics in TREC are used without

modifications as queries to provide the baseline reference against our methodology. In

addition, the published results from the 2007 TREC genomics competition are also

used as a reference [59]. The baseline document MAP is 0.036, which is better than

the minimum document MAP of 0.032 but worse than the top scores and median

scores of 0.328 and 0.186 respectively, achieved in the TREC competition. The

baseline results are summarized in Table 4.2.

As discussed in Chapter 2, court cases are written for general consumption and

make little use of technical jargons. A search for the term ‘erythropoietin’ alone

retrieves all 30 relevant court cases resulting in a recall (and precision) of 100%. Since

our court case database is currently limited, we focus on evaluating our methodology

on the patent and scientific publication documents.

Table 4.1: Baseline Reference: Rank of Core Patents

Patent Number Rank out of 1150 Patents

5,547,933 49

5,621,080 50

5,618,698 51

5,955,422 53

5,756,349 54

Page 107: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 89

4.3.2 QUERY EXPANSION

Term-based models search the underlying corpus for the terms specified in the

user queries. In Chapter 3, we illustrated that these terms alone are not sufficient to

retrieve documents due to the heavy use of synonymy in the documents. The

terminological inconsistencies are tackled by including synonyms along with the

original query terms to search the documents. The query expansion method explained

in Section 3.2 queries external knowledge sources such as domain ontologies to

extract the required semantics. In order to facilitate expansion, each query term is

treated as a concept, i.e. a collection of terms and phrases that are interchangeably

used in the texts of the documents. For example, the concept ‘erythropoietin’ is a

collection of the terms – {‘epo’, ‘erythropoietin’, ‘epoetin alfa’ …}. Additionally,

related concepts through hierarchical expansions are also included in the query to

provide a broader coverage. However, expanding the original query terms could also

potentially lead to imprecise results. In this section, we describe the experimental

setup to evaluate the knowledge-based query expansion methodology.

The first step in expanding queries is to map the terms in the query to the actual

concepts in the biomedical ontologies. The mapping is done by searching BioPortal

for the query terms and retrieving concept Uniform Resource Identifiers (URI). The

mapping process not only retrieves concepts that have the query term as a preferred

name for the concept, but also those concepts which have the query term listed as a

synonym. For example, the term ‘tumor’ is mapped to the concept ‘neoplasms’ in

Table 4.2: Baseline Reference for Evaluating the Query Expansion

Methodology

Type Recall Average Precision

Patent 0.67 0.125

Publications 0.76 0.0361

Court Case 1 1

Page 108: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 90

MeSH. Once the concept URIs are fetched, the ontologies are traversed hierarchically

to retrieve parent and child concepts as well as the concepts in the several levels above

and below the concept. The newly added hierarchical concepts automatically include

their synonyms. For example, if ‘colony stimulating factors’ is identified as a parent

concept, its synonyms such as ‘csf’ and ‘mgif’ will also be included into the query as

parent concepts. The resulting expanded query will be of the form:

Qterm = term [ALL] =

[

]

where term [ALL] is used to indicate that initially ALL ontologies are searched. In

order to vary the depth of expansion, we use several weighting schemes such as

[ ]

[ ]

,

[ ]

[ ]

, and

[ ]

where WSyn, WPar, WGPar, WChi, WGChi represent expansions including only

synonymy, parent concepts up to one level, parent concepts up to two levels, child

concepts up to one level and child concepts up to two levels respectively. Similarly,

the terms can be expanded all the way up to the roots, or the leaves of the hierarchy.

Queries can consist many terms. Not all the query terms need to be expanded. Before

automatically expanding the queries, the queries are pre-processed to indicate which

terms need to be expanded. For example, the query “What tumor types are found in

Zebrafish?” is pre-processed to “What [neoplasms][MeSH] are found in Zebrafish

[NDF]?”. The processed query indicates that the term ‘neoplasm’ must be expanded

using the MeSH ontology and the term ‘Zebrafish’ must be expanded using the NDF.

This forms the basis for our experiments that are to follow. In the process of

Page 109: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 91

experimentation, several modifications based on different weighting schemes and

different ontologies, etc., will be studied.

4.3.2.1 Query Expansion for Retrieval of Patent Documents

The query ‘[erythropoietin][ALL]’ is chosen as the starting point for expansion.

We aim to retrieve the relevant documents from the set of 1150 patents in our

repository. A search for the term ‘erythropoietin’ in BioPortal returns results from 4

ontologies – MeSH, NDF, NCI Thesaurus and GO. All four ontologies are used to

expand the query to up to two levels of parent and child concepts using the weighting

schemes described earlier. Figure 4.1 shows the recall and the average precision for

the query expansions on patent documents. As shown in Figure 4.1, using synonymy

Figure 4.1: Average Precision and Recall for Query Expansions on Patent

Documents

Page 110: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 92

(i.e. the concept ‘erythropoietin’) alone does not improve recall. In fact, the average

precision drops to 0.114 when compared to the baseline reference of 0.124. However,

the hierarchical expansions show improvements in both recall and precision. The

addition of only one level of parents, or children improves recall to 0.97 and precision

to 0.131. While the results improve with the addition of immediate parent and child

concepts, adding concepts any farther away in the hierarchy do not change the results

significantly. This is mainly due to the fact that the ontologies provide very few terms

beyond immediate parents and children.

All forms of hierarchical query expansions result in about the same precision and

recall. Although it is difficult to distinguish between the different hierarchical results,

the expansions have an effect on the average rank of the five core patents. As we

expand to parent concepts that are farther away from the original concept, the average

rank of the 5 core patents deteriorates (e.g. above 450 for grandparents). Intuitively,

this makes sense because as we traverse higher up in the hierarchy to parent concepts,

we are generalizing the search. On the other hand, the average rank improves to

around 67 as we add child concepts. While this is still lower than the baseline search,

we attempt to improve the average rank with further experimentation. Table 4.3 shows

the average rank of the five core patents for the expanded queries.

The current weighting schemes give equal weights to all concepts. Having

achieved a high recall, we attempt to further improve the precision by applying

different weights to the terms. Based on the results, the expansion to child concepts

showed most improvement over the baseline reference. In order to study the effect of

weighting, we experiment with one level each of parent concepts and child concepts.

We define three heuristic weighting functions to analyze how they affect the search

results. These functions are as follows:

Page 111: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 93

[ ] [

] [

], and [

]

However, the use of different weighting functions only has a marginal effect on the

results (e.g. for W3, precision goes up from 0.1310 to 0.1314). Ideally, these

weighting vectors should be automatically learnt from the corpus.

The four bio-ontologies used for expansion may share some terminology.

However, since they cover different sub-domains, the terminology may be classified

differently. This could lead to potential conflicts from the use of multiple ontologies.

For example, the NDF ontology states ‘epoetin alfa’ as a child concept of

‘erythropoietin’ where as NCI thesaurus states they are synonyms. One way of

resolving conflict is to give one level of concepts preference over the other. In our

expansion, we gave precedence to the concept that is closest to the leaves (i.e. child

concepts) in the tree hierarchy. A second way to resolve conflicts is to selectively use

ontologies. This may reduce the overall coverage, but in turn have a positive effect on

precision. We compare the use of individual ontologies versus the use of multiple

ontologies (see Figure 4.2). The domain knowledge provided by NDF improves the

Table 4.3: Change in Average Rank of Core Patents with Level of

Expansion

Level of Expansion Average Rank of Core Patents

Synonymy 133

Parents 428

Grandparents 469

Children 67

Grand Children 67

Parents and Children 232

Grandparents and

Grandchildren 270

Page 112: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 94

precision to 0.161 when compared to the other ontologies. The recall drops from 0.97

to 0.95 which is acceptable in most cases. Clearly the NDF performs better than the

other ontologies. Upon examination, we realize that the terms extracted from the NDF

include industry standard drug names such as Epogen, which are commonly seen

across relevant documents. This implies that the selection of ontologies is an important

aspect in the expansion of queries. The average rank of the core patents also improves

to less than 50.

The low values of precision can be attributed to the fact that the concept

‘erythropoietin’ is used in many different contexts. This implies that ‘erythropoietin’

Figure 4.2: Comparison between use of Multiple Ontologies vs. Individual

Ontologies

Page 113: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 95

itself is a general term, and hence a search for it in the patent database would return all

documents covering a wide range of aspects including its production, its composition,

etc.. Since the ground truth is defined by following forward and backward citations to

the five core patents, the ground truths themselves covers a wide range of topics

related to erythropoietin. While the query expansion using biomedical ontologies

improves recall by a significant amount, it is difficult to generate a query which covers

all 135 documents with a high precision. However, a more specific query can be

constructed by adding more clauses, in order to retrieve a subset of documents and as

a result improve precision. By adding more keywords and restrictions such as fields to

search (Title, Claims, etc., instead of the entire document) the size of the search results

will tend to be more manageable. Since the expected results are fewer, we measure the

precision @ 15 (precision at the first 15 retrieved documents). These results are

summarized in Table 4.4.

4.3.2.2 Query Expansion for Retrieval of Scientific Publications

The TREC data set provides 36 topics over which the methodologies are

evaluated. Each topic is a question asking for a list of specific entity types. The 14

Table 4.4: Precision and Average Rank of Core Patents for Fielded Search

on Patent Documents

Query Precision @

15

Avg. Rank of

Core Patents

‘Erythropoietin’ in All Fields 0.18 49.4

‘Production of Erythropoietin’ in All

Fields 0.23 47.4

‘Production of Erythropoietin’ in Title 0.50 3

‘Production of Erythropoietin’ in

Abstract 0.31 19.4

‘Production of Erythropoietin’ in

Claims 0.12 6.5

‘Production of Erythropoietin’ in

Description 0.21 41.8

Page 114: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 96

entity types such as Proteins, Genes, and Diseases, etc., are based on terminology from

different biomedical sources such as MeSH and GO [7,90]. The rules allow us to

modify the original query, but the interaction with knowledge sources must be in an

automated fashion. For the analysis presented in this section, we restrict ourselves to

all entity types that can be extracted from the MeSH ontology. This results in a total of

10 topics from the original 36 specified in the TREC data set. The resulting 10 topics

are pre-processed to clearly specify the terms that must be used for expansion. The

terms that are to be expanded are renamed to match the exact concept name used in

MeSH to avoid any errors in mapping. For example, the entity type ‘Proteins’ is

renamed to ‘Amino Acids, Peptides or Proteins’. All other noun phrases are used to

query, but not expanded. Since the entity types are fairly general, we only expand to

the subclasses. In order to study the effect of the depth of expansion on retrieval, we

extract terms up to 7 levels of subclasses, starting at the entity type. Table 4.5

summarizes the modified queries, the selected knowledge sources, and the entity

types.

We developed a query parser and constructor which is responsible for query

formulation and ensures the automatically generated queries are syntactically correct.

The expanded terms are arranged in a series of ‘OR’ boolean clauses and replace the

original term in the query that was expanded. For example, in the query “[Tumor]

AND Zebrafish”, if [Tumor] is expanded to ‘Neoplasm’ ‘Leukemia’, and ‘nerve

sheath tumor’, then the original query will be automatically transformed as follows:

“[Tumor] AND Zebrafish” (“Neoplasm” OR “Leukemia” OR “Nerve Sheath

Tumor”) AND (Zebrafish)

Subtle modifications in the way the query is generated can result in unexpected

behavior. For example, if “nerve sheath tumor” was not enclosed in quotes as a

phrase query, the search would include the terms ‘nerve’, ‘sheath’ and ‘tumor’ in

Page 115: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 97

separate OR clauses. This is different from searching for the phrase “nerve sheath

tumor” and may result in a low precision. The parser ensures that phrases are properly

enclosed in quotes to avoid inaccuracies. Figure 4.3 summarizes the results. The best

performance is observed at a depth of 3 with a document MAP of 0.199. These results

Table 4.5: Pre-Processed Queries to Evaluate Query Expansion on Scientific

Publications

TREC 2007

Topic Original Query Pre-processed Query

200 What serum [PROTEINS] change

expression in association with high

disease activity in lupus?

[Amino Acids, Proteins or

Peptides][MeSH] AND lupus

203 What [CELL OR TISSUE TYPES]

express receptor binding sites for

vasoactive intestinal peptide (VIP)

on their cell surface?

[Cells OR Tissues][MeSH] AND

(receptor binding sites) AND

(vasoactive intestinal peptide VIP)

AND (cell surface)

204 What nervous system [CELL OR

TISSUE TYPES] synthesize

neurosteroids in the brain?

[Cells OR Tissues][MeSH] AND

(nervous system) AND neurosteroid

AND brain)

211 What [ANTIBODIES] have been

used to detect protein PSD-95?

[Antibodies] AND PSD-95

215 What [PROTEINS] are involved in

actin polymerization in smooth

muscle?

[Amino Acids, Proteins or

Peptides][MeSH] AND "smooth

muscle"

217 What [PROTEINS] in rats perform

functions different from those of

their human homologs?

[Amino Acids, Proteins or

Peptides][MeSH] AND (rat AND

human AND homolog AND

function)

219 In what [DISEASES] of brain

development do centrosomal genes

play a role?

[Brain Diseases][MeSH] AND

(centrosome "brain development”)

220 What [PROTEINS] are involved in

the activation or recognition

mechanism for PmrD?

[Amino Acids, Proteins or

Peptides][MeSH] AND (involved in

the activation or recognition

mechanism for PmrD)

226 What [PROTEINS] make up the

murine signal recognition particle?

[Amino Acids, Proteins or

Peptides][MeSH] AND (murine

AND signal AND particle AND

recognion)

231 What [TUMOR TYPES] are found

in Zebrafish?

[Tumor][MeSH] AND Zebrafish

Page 116: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 98

are a significant improvement over the baseline queries and compare to average

performances in the 2007 TREC competition results.

Upon further examination, we realize some queries perform better than the others

(see Figure 4.4). The reason for this is because many of the terms actually appearing in

Figure 4.3: Effect of Depth of Query Expansion on Retrieval of Scientific

Publications

Figure 4.4: Performance of Query Expansion on Individual Topics

Page 117: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 99

the text of the publications are not available under the selected concept for expansion.

For example, in the query, “[Brain Diseases][MeSH] AND (centrosome "brain

development”)”, the ground truth contains – ‘Schizophrenia’. MeSH classifies this

under a different parent and not ‘[Brain Diseases]’. Since we only extract subclasses

of the concept ‘Brain Diseases’, the term ‘Schizophrenia’ is never retrieved from

MeSH. Hence, choosing appropriate domain knowledge and mapping the query terms

to the correct concepts becomes important. For example, if we used ‘Central Nervous

System Diseases’ as our starting concept for expansion, the term ‘Schizophrenia’

would have been retrieved, improving search results. However, this will drastically

increase the number of query terms resulting in long querying times. Figure 4.5 shows

the increase in the number of query terms as the depth of expansion increases. Our

goal is to choose a depth of expansion that gives us good results, and yet provides

reasonable query times. If we consider only those queries for which appropriate

domain knowledge is available, a high MAP can be achieved.

In most disciplines, journals share only the metadata and Abstracts of publications

publicly instead of the full-text. In order to see how our methodology would perform

Figure 4.5: Number of Query Terms with Increasing Depth of Query

Expansion

Page 118: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 100

on the PubMed index (which includes Abstracts, Article Titles, and other metadata

such as Journal Titles, Date of Publication, etc.), we restrict the query to only

Abstracts of the documents. The search results in a Document MAP of 0.100. While

this value is lower than the searches performed on full-text, it is still a significant

improvement over the baseline values.

The proximity of the query terms to one another is also an important factor to be

considered. Generally, if concepts in a query are very far apart in a document, the

document is less likely to be relevant to the query. We modify the query parser to

generate proximity queries such that the original boolean query is modified as follows:

(“Neoplasm” OR “Leukemia” OR “Nerve Sheath Tumor”) AND (Zebrafish)

“Neoplasm Zebrafish”~100 OR “Leukemia Zebrafish”~100 OR “Nerve Sheath

Tumor Zebrafish”~100

where “Neoplasm Zebrafish”~100 implies that the terms ‘neoplasm’ and ‘zebrafish’

must be within 100 words of each other. The proximity queries perform extremely

well for some queries, but decrease the overall document MAP to 0.052.

There are various other characteristics of scientific publications that can be

exploited to improve retrieval of documents. The MeSH descriptors are indexed along

with scientific publications to indicate the general theme of the topic. Especially in the

absence of full-text, MeSH descriptors have been shown to improve retrieval in

conjunction with searching Abstracts [63]. Other forms of experimentation may

include expanding more than one set of terms in the query. For example, in the query

“[Tumor][MeSH] AND Zebrafish”, the term ‘Zebrafish’ could also be expanded to

include synonyms such as ‘Danio Rerio’ such that the query becomes:

“[Tumor][MeSH] AND [Zebrafish][MeSH]”

Page 119: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 101

4.4 EVALUATING PATENT SYSTEM ONTOLOGY AND IR FRAMEWORK

Our goal is to facilitate the retrieval of a collection of relevant documents across

multiple information sources in the patent system. The diversity of the information

sources combined with little or no interoperability between the sources, imposes a

serious challenge for such retrieval. Our patent system ontology, described in Section

3.3, provides a standardized representation for the various types of documents and

explicitly relates them based on the cross-references. As a result, the patent system

ontology facilitates information integration across multiple types of documents. This

section evaluates the patent system ontology based on its capability to answer a series

of questions, generated to represent two use case scenarios – (1) a patent prior art

search, and (2) infringement analysis. The queries, partly borrowed from the

competency questions described in Section 3.3.1, are translated into equivalent

SPARQL queries. The ontology is queried through a Virtuoso SPARQL end point

[96]. The main focus is on illustrating he use of cross-references, although formal

measures such as precision and recall are provided where applicable. Since the current

implementation of the patent system ontology does not include scientific publications,

we constrain the evaluation to patent documents, court cases and file wrappers.

The standardized terminology to represent documents in the patent system

ontology can potentially serve as a backbone for applications. Applications can query

the patent system ontology for required terminology, or guidelines. For example, an

application can request the patent system ontology to explain the contents of a patent

document. This would result in response that indicates the various metadata and

textual fields contained in a patent document, and their relationships with other

documents.. Additionally, the declarative syntax can be used to express heuristics in

the form of rules to represent similarity measures, or guidelines for applications to

follow. We evaluate a simple rule-based methodology to express similarity heuristics

via the patent system ontology through an example.

Page 120: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 102

The rest of this section is organized as follows: Sections 4.4.1 and 4.4.2 describe

the use case scenarios and evaluate the patent system ontology through a series of

well-constructed queries. Section 4.4.3 illustrates the rule-based similarity measures

through an example.

4.4.1 USE CASE SCENARIO: PATENT PRIOR ART SEARCH

A patent prior art search is required during both the acquisition phase and the

enforcement phase of the patent system. For example, a patent examiner may want to

do a prior art search in order to examine a patent application, or an inventor may need

to determine the patentability of an invention. The prior art search is done to ensure

the patentability of the invention, i.e. it is novel and non-obvious. Patent prior art can

be any printed publication in the form of patents, scientific publications, or even PhD

theses. However, for this example, we will limit the prior art to issued patents and

court litigations.

Patent prior art searches are driven by heuristics and strategies that vary from user

to user. However, most users follow a general outline. The search is based on

exploring and learning information from results and constantly refining query.

Typically, the first step in patent prior art research is to search using a keyword that

broadly relates to the information need. Considering the volume of patent documents,

a search for a broad keyword could result in several thousands of patents. For

example, a search for the concept ‘protein’ returns over 100,000 documents. This is

also seen from the results of Section 4.3.1.1, where the concept ‘erythropoietin’

covered a large collection of documents. It is possible to reduce the search space by

adding more terms or constraints to the query such as field restrictions, date

restrictions, etc.. With a reduced search space, it is possible to scan the abstracts of

some patent documents and identify the important technology classes. Searching for

the keywords under those specific classes will result in more patent documents which

Page 121: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 103

may or may not be relevant. After identifying some relevant patents, some of the

possible next steps could be to follow the forward and backward citations, study the

patents of the most relevant inventor or assignee, etc., to get more relevant results. At

every stage, new keywords can be added and this process is typically repeated until the

results start to converge. The search is then independently applied to the patent

application database, and scientific publications, etc..

Patents which have been involved in court cases have an obvious importance and

provide a good starting point for conducting the patent prior art search. In this

exercise, we choose to first search the court case documents and then extract relevant

patents.

Step – I: Search for all court cases containing the term ‘erythropoietin’

The SPARQL query shown in Figure 4.6 searches all documents of the type court

case for the concept ‘erythropoietin’. To perform this search, first the bodies of the

court cases are retrieved. We use the FILTER REGEX clause to search the extracted

text via the ‘resourceVal’ property to retrieve only those court cases which contain the

term ‘erythropoietin’. Ideally, in the IR framework, the term based search will be

handled by the knowledge-based query expansion method. As mentioned in Section

4.2.1, all 30 court documents are returned for the baseline query ‘erythropoietin’.

SELECT DISTINCT ?pat

WHERE {

?case a CourtCase .

?case hasBody ?body .

?body resourceVal ?text .

FILTER REGEX (?text, "erythropoietin", "i") .

}

Figure 4.6: SPARQL Query to Retrieve Court Cases Related to

Erythropoietin

Page 122: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 104

Hence, for the purpose of demonstrating the patent system ontology, we continue to

use SPARQL’s FILTER REGEX clause.

Step – II: Enlist the patents involved in these court cases

The query in Fig. 4.7 requests for all the patents which have been involved in the

30 court cases related to ‘erythropoietin’ via the ‘patentsInvolved’ property. 11 patent

documents are retrieved with a precision 0.72. It must be noted that not all 11 patents

may be present in our corpus of 1150 patents. Hence during our instantiation process,

no further information about these patents such as inventors, assignees, etc., may be

available.

Step – III: Identify the U.S. class, inventors and assignees of these patents

For all the patents retrieved in Step-II that are available in the knowledge base, we

identify the most prominent inventors, the assignees, and technology classes the

SELECT DISTINCT ?pat

WHERE {

?case a CourtCase .

?case hasBody ?body .

?body resourceVal ?text .

FILTER REGEX (?text, "erythropoietin", "i") .

?case patentsInvolved ?pat .

}

Results

5411868

5621080

5547933

5618698

5756349

5955422

5441868

4703008

4677195

5547993

5322837

Figure 4.7: SPARQL Query to Retrieve Patents Involved in Court Cases

Related to Erythropoietin

Page 123: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 105

patents fall under. This is done by adding SPARQL query triples that request for

individuals along the ‘hasUSClass’, ‘hasInventor’ and ‘hasAssignee’ properties on the

extracted patents. Figure 4.8 summarizes these results. By removing the DISTINCT

clause, it is possible to get an estimate of which of these results occur the most. The

figure shows the top 5 occurring technology classes, inventors, and assignees for the

query triplets that are added.

Step – IV: Extract patents with specified technology class, inventors or assignees

The extracted technology classes, inventors, and assignees are used to query the

patent corpus to extract additional patents that were not initially retrieved through the

term-based search. The query is shown in Figure 4.9 and the results are summarized in

Table 4.6. The new patents retrieved based on the inventors’ result in a higher

precision when compared to the technology classes or assignees. This is because the

set of inventors is specific, while the technology classes and assignees cover a broader

SELECT ?usclass ?inv

?assignee

WHERE {

?case a CourtCase .

?case hasBody ?body .

?body resourceVal ?text .

FILTER REGEX (?text,

"erythropoietin", "i") .

?case patentsInvolved ?pat

.

?pat hasUSClass ?usclass

.

?pat hasAssignee

?assignee .

?pat hasInventor ?inv .

}

Results

US Class Inventor Assignee

514/8 Lin, Fu-Kuen Kirin-Amgen, Inc.

530/350

Hewick, Rodney,

M.

Amgen, Inc.

536/23.51 Seehra, Jasbir, S. Kiren-Amgen, Inc.

435/325 Seenra, Jasbir, S. Genetics Institute, Inc.

435/69.6

Figure 4.8: SPARQL Query to Extract US Patent Classification, Names of

Assignees and Inventors from Patent Documents

Page 124: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 106

range of topics.

Step – V: Search backward citations of the patents

Alternatively, the backward US patent citations are extracted for each of the 11

patents returned by the query shown in Figure 4.10. Many of these patents can have

overlapping backward citations; however, with the DISTINCT clause the size of the

resulting list of patents is around 40. This query results in a precision of 0.93, with a

recall of 0.29. If we also search the forward citations, we will generate a larger list of

SELECT ?usclass ?inv ?assignee

WHERE {

{ ?pat hasInventor Lin_Fu-Kuen .}

UNION

{ ?pat hasInventor Seenra_Jasbir_S . }

UNION

{ ?pat hasAssignee Genetics_Institute_Inc . }

UNION

{ ?pat hasAssignee Kiren-Amgen_Inc .}

}

Figure 4.9: SPARQL Query to Extract Patent Documents Related to a Set of

Inventors, Assignees and/or US Patent Classification

Table 4.6: Precision for Results Obtained by Querying Patent System

Ontology for Documents Related to a Set of Inventors, Assignees or US

Classification

Query Precision

Top 5 Technology Classes 0.183

Inventors 0.8

Assignees 0.256

Combined 0.186

Page 125: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 107

patents, some of which may be highly relevant. The ground truths for the patent set

were developed by following the forward and backward citations. Hence, this query is

expected to yield high precision results, but is discussed for demonstration purposes.

The knowledge base can be incrementally searched based on the results obtained

in Figures 4.6-4.10. Furthermore, the court cases and scientific publications can be

searched for cross-referenced entities such as the newly retrieved patents or inventor

names, etc..

4.4.2 USE CASE SCENARIO: FILE WRAPPER EXAMPLE

In this section, we build on the previous example of the patent prior art search, to

illustrate the process of infringement analysis. An infringement analysis is necessary

to enforce the rights of a patent and prevent others from infringing the inventor’s

rights. An infringement analysis is typically conducted by three different parties – (1)

the company whose patent is being infringed, (2) the company who is infringing the

patent, and (3) the court. Literal infringement is the type of an infringement, where the

claim of one patent literally states the exact same limitations as the claim in another

patent. Literal infringement cases are easy to resolve, but extremely rare [113]. When

the claims of two patents do not literally infringe, it is important to determine the

scope of each limitation of the claim under the ‘doctrine of equivalents’ [113]. For

SELECT DISTINCT ?pat2

WHERE {

?case a CourtCase .

?case hasBody ?body .

?body resourceVal ?text .

FILTER REGEX (?text, "erythropoietin", "i") .

?case patentsInvolved ?pat .

?pat hasCitation ?pat2 .

}

Figure 4.10: Querying Patent System Ontology for Backward Citations

Page 126: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 108

this, the patent’s entire file history will have to be studied, and the focus is set on the

wordings of the claim and how they evolved. As in the previous example, a series of

questions are developed to represent an infringement analysis use case.

Figure 4.7 shows the list of patents involved in these court cases. Among these

patents, US patent 5,955,422 is identified as a very frequently occurring patent, which

also happens to be one of Amgen’s core patents. We choose to study the file wrapper

of the US patent 5,955,422 to analyze the evolution of the claims.

Step-I: Enlist the contents of the file wrapper

The query shown in Figure 4.11 displays all the events contained within the file

wrapper. This list is obtained via the ‘contains’ property. We order the results by the

date in which they occurred. Notice that the initial application (07/609741) and the

final issued patent (5,955,422) are both part of the file wrapper.

One of the important aspects of a litigation is to determine the priority date34

of a

patent. The patent system ontology enables us to view the nature of the application,

34 The priority date of a patent application is the date used to establish the novelty and non-

obviousness of the invention. Priority dates can also date back to parent applications.

SELECT DISTINCT ?doc

WHERE {

ont:FileWrapper_5955422

ont:contains ?doc .

?doc ont:hasDate ?date .

}

ORDER BY ?date

Results

Type Name

Patent

Application 07_609741

Applicant Event 07/609741_Amendment_1

Applicant Event 07/609741_Interference_1

Office Action 07_609741_Rejection_1

Applicant Event 07/957073_Amendment_1

Issued Patent 5955422

Figure 4.11: SPARQL Query to Display Contents of a File Wrapper, Ordered

by the Date

Page 127: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 109

i.e. whether filed as a continuation, continuation-in-part, divisional or a fresh

application and determine the original priority date that applies to the claims.35

If this

application is a continuation or a divisional, in a more complex query, it will be

possible to trace back to the priority dates, i.e. the parent application.

Step – II: Extract the initial claims

The initial claims as filed by the applicant are generally very different from what

are finally allowed. When determining the scope of the claims, the differences

between initial claims and the final accepted claims provide important information.

The scope of the claims is determined by the added limitations36

which make the claim

acceptable. The issued patent by itself will not contain the original claims. However

from a file wrapper, this information can be extracted as shown in Figure 4.12.

Step – III: Study the examiner’s rejection

Figure 4.13 provides a snapshot of the subsequent rejection by the examiner as an

35 For definitions of legal terms, please refer – http://www.uspto.gov/main/glossary/ (Accessed

on 03/01/2012). 36

Limitations are individual clauses which form a single patent claim.

SELECT DISTINCT ?claim ?text

WHERE {

07_609741 hasClaim ?claim

.

?claim resourceVal ?text

}

Claim Claim Text

1 A purified and isolated polypeptide having part or all

of the primary structural conformation … naturally

occurring erythropoietin and characterized by being

the … of an exogenous DNA sequence.

2 A polypeptide according to claim 1 further

characterized by being free of association with any

mammalian protein.

10 A polypeptide according to claim 1 which has the in

vivo biological activity of naturally occurring

erythropoietin.

Figure 4.12: SPARQL Query to Extract the Text of Claims from the Original

Patent Application

Page 128: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 110

instance of the Rejection Class (taken from protégé). The rejection provides

information regarding claims that are allowed, withdrawn, and other claims that are

disallowed under a restriction.37

Restrictions can be viewed via the ‘hasRestriction’

property. Since the document is in the form of a letter, the text of the restriction is

stored and accessible via the ‘resourceVal’ property.The actual text of the rejection

letter is also included under the ‘resourceVal’ annotation property. This facilitates

searching for information that is not explicitly modeled, such as relevant U.S.C. codes

or other regulations that may have led to the rejection or restriction. From Figure 4.13,

37 For the definition of legal terms, please refer - http://www.uspto.gov/main/glossary/ (Accessed

on 03/01/2012).

Figure 4.13: Class View of Patent Examiner’s Restriction in File Wrapper for

US Patent 5,955,422

Page 129: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 111

we see that out of the original 63 claims, claims 1-60 are withdrawn, and of the

remaining 3 claims, claims 61-62 are accepted and claim 63 is rejected.

Step – IV: Compare rejected claims and accepted claims

The claims that are allowed can be accessed via the ‘allowedClaim’ property. The

text of the claims can also be viewed as shown in Figure 4.12. The difference in the

two claims is very subtle. Claim 62 states –

“A preparation according to claim 61 containing a therapeutically effective amount of

erythropoietin.”,

and claim 63 states –

“A composition according to claim 61 containing a therapeutically effective amount of

recombinant erythropoietin.”

However, claim 63 is rejected on the grounds of being too vague. In a similar fashion,

we can compare the text of the claims at every stage of the prosecution of the

application including the final claims to identify the added limitation which made the

claims acceptable to the examiner.

This process of querying the file wrapper can continue as long as desired. The true

potential of the ontology will be visible when complex queries spanning more than

one information domain are presented. The ontology takes advantage of the highly

cross-referenced information and provides the required semantics to jump from one

domain to another with ease. However this is still a daunting task to perform

manually. The semantics will allow methodologies to automatically process the

information and answer complex queries. The fine granularity of the ontology can

support different applications and users.

Page 130: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 112

4.4.3 OTHER BENEFITS OF THE PATENT SYSTEM ONTOLOGY

The standardized representation of the patent system ontology allows us to use

declarative syntax such as SPARQL and SWRL to query the ontology and define

rules. In Section 3.3.4, we described how SWRL rules can be defined to specify

similarity heuristics over the patent system ontology. We defined 10 rules for

similarity which operate over the metadata and cross-references. In this section, we

present an example to illustrate how these rules can be used to infer document

similarity. We use the Jess rule engine to perform forward chaining and trigger the

rule engine. Three related patent documents – patent 5,955,422 (Doc1), patent

4,677,195 (Doc2) and 4,999,291 (Doc3) are shown in Figure 4.14. In order to compute

similarity between the documents, we use the Abstract of Doc1 to query the patent

index. Without the use of domain knowledge, the similarity score of Doc1 and Doc3

low. Using the query expansion methodology, the similarity score between the

Figure 4.14: Example to Illustrate a Simple Rule-Based Similarity Measure

Page 131: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 113

documents improves to 0.59. The results give Doc3 a higher score than Doc2.

However, Doc1 and Doc2 are highly related and have been challenged together in

several court litigations. This relationship between the documents is not captured

through the bio-ontologies. The rule-based inferences identify this relationship and

give the two documents a similarity score of 0.2 (i.e. 2 out of 10 rules infer that Doc1

and Doc2 are similar). Hence, appropriate linear combination of the two results will

give Doc2 a higher score than Doc3 in the results.

4.5 SUMMARY

This chapter evaluates the performance of our methodology. We provide a brief

background on formal evaluation measures such as precision, recall, and f-measure. In

addition, we discuss the average precision and ‘precision @ k’ measures, commonly

used in IR literature and TREC competitions as alternatives. In order to help

understand the evaluations of the patent system ontology better, some background on

SPARQL, a query language for RDF, is provided. The knowledge-based query

expansion methodology and the patent system ontology are evaluated independently.

To evaluate the query expansion methodology, first, we generated baseline

references for each type of document. For patents and court cases, we used the query

‘erythropoietin’ to query the document corpus, without any added knowledge to

generate the baseline reference. For publications, we used the original TREC queries

without any modification to generate the baseline references. Additionally, we use the

results from the 2007 TREC genomics competition as a reference. Since our court case

corpus is limited, we continue the analysis on patents and scientific publications.

The queries are generally expanded to several levels of concepts above and below

the original terms. By using weighting functions, the expanded query is evaluated at

different levels such as parents only, children only, and parents and children, etc.. In

order to automatically query the domain ontologies, we pre-process queries to specify

Page 132: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 114

which terms need to be expanded. Once the pre-processing of the queries is complete,

the knowledge base is queried for related concepts. A query parser ensures that the

expanded query is syntactically correct.

For patent documents, the effect of expansions is clearly seen in terms of recall,

but a little improvement over the baseline precision is observed. Weighting the terms

in the query also showed little improvement in results. Since the five core patents are

important to our use case, we also compute their average rank. The average rank of the

core patents shows improvement subclasses of ‘erythropoietin’ are added to the

expanded query. Traversing to superclasses in the hierarchies of the ontologies makes

the search more general decreasing the average rank of the core patents. Generally, the

precision for this data set is observed to be low, irrespective of the query. This is

because the data set covers a broad range of topics related to ‘erythropoietin’. In order

to narrow the scope of the search, we add additional terms in the query as constraints

and restrict the queries to specific fields on the patent document such as the Abstract,

or the Title. The ‘precision @ 15’ and the average rank of the core patents both

improved significantly over the baseline reference for this search.

For publications, we report results using the document MAP measure, which is

also used in the TREC competitions. This allows us to compare our methodology with

the results from the TREC competitions. The scope of our evaluation involves only the

queries for which terms that can be extracted from the MeSH ontology. The query

expansion methodology shows a significant improvement in the document MAP. The

results are comparable to other related works demonstrated in TREC. However, plenty

of scope for improvement is observed. We realize that some queries perform better

than others. The reason for the inconsistent performance among queries is the

insufficient domain knowledge provided by the domain ontologies. Hence, the

selection of appropriate domain ontologies is critical for good performance.

Expanding to deeper levels of subclasses below the original concept in the hierarchies

Page 133: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 4. PERFORMANCE EVALUATION 115

shows only a marginal improvement in results, but a drastic growth in the number of

query terms, increasing the time to execute the query.

Since PubMed indexes only metadata and Abstracts of scientific publications, the

full-text for most publications is not readily available. We evaluated our methodology

using only the Abstracts of documents instead of the full-text. The results are poorer

than full-text retrieval, but significantly better than the baseline results. However,

more experimentation is required to improve the performance using only Abstracts.

The patent system ontology provides a standardized representation for the different

types of documents, enabling information to be integrated. Our main focus was to

illustrate the use of the cross references to relate documents from multiple sources.

The patent system ontology is evaluated through two use case scenarios – a patent

prior art search, and an infringement analysis example. A list of questions is generated

based on typical questions that arise in the use case scenarios. These questions are

translated into SPARQL queries to query the patent system ontology. The cross-

references provide strong relevancy measures helping us quickly identify important

documents. In addition the use of metadata such as technology classifications, inventor

and assignee names, etc., show improvement in the results.

The patent system ontology can also be used to express heuristics through SWRL

rules. We discuss an example of expressing simple similarity heuristics through rules.

The example shows that good heuristics can be used to improve similarity rankings

between relevant documents. In general, in addition to the interoperability provided by

the patent system ontology, the declarative syntax allows additional knowledge related

to the patent system to be encoded into the ontology. Applications which build around

the patent system ontology can derive any additional information such as similarity

heuristics, dynamically from the ontology.

Page 134: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

Chapter 5.

CONCLUSION AND FUTURE WORK

Advancements in computer science and information technology have enabled us to

address serious issues with respect to information growth and management in the

science and technology space. In this chapter, we will provide a brief summary of our

methodology to retrieve information from multiple diverse information sources in the

patent system. Based on the developed framework, potential future directions are

discussed.

5.1 SUMMARY

There is a tremendous growth in research and developments in science and

technology. Intellectual Property (IP) related information for science and technology is

distributed across several heterogeneous information silos. The scattered distribution

of information, combined with the enormous sizes and complexities, make any attempt

to collect IP-related information for a particular technology a daunting task. Hence,

there is a need for a software framework which facilitates semantic and structural

interoperability between the diverse and un-coordinated information sources in the

patent system. Such a framework would form a basis for information integration and

retrieval across multiple sources. This thesis presents a methodology and a framework

toward improving retrieval of information from the patent system.

Page 135: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 5. CONCLUSION AND FUTURE WORK 117

We developed a repository of documents comprising of (a) issued patents; (b)

federal and district patent litigations; (c) scientific publications; and (d) file wrappers.

Specifically, we developed the repository around a use case in the biomedical domain,

erythropoietin, which is a hormone that is responsible for the production of red blood

cells. The document repository consists of 1150 issued patents, around 30 court

litigation documents, 162,000+ scientific publications from the TREC 2007 Genomics

data set38

, and the file wrapper for US patent 5,955,422. Common challenges faced in

collecting documents from the information sources include – (1) varying publication

formats such as HTML, XML and image files; (2) incompatible or missing interfaces

and web services to access information; and (3) unstructured representation of

information. Parsers are developed to automatically download and extract important

information. The extracted metadata and textual information are stored in well-marked

up XML files. The repository is made searchable through text indexes and a search

interface constructed using Apache Lucene and Apache Solr, respectively [5,6].

Based on the document repository, we discuss the underlying methodology to

integrate and search information across multiple information sources. First, we discuss

a knowledge-based query expansion methodology to enhance document retrieval

within a single information source. Domain knowledge is extracted from BioPortal, a

library of over 250 biomedical ontologies that have been created and maintained by

subject experts. Next, a patent system ontology is developed to improve structural

interoperability between information sources. The ontology provides the necessary

semantics for integrating information across multiple sources. Finally, the IR

framework provides an iterative multi-domain search methodology that combines the

knowledge-based query expansion methodology and the patent system ontology.

Several examples are presented to illustrate the methodology.

38 For details regarding the 2007 TREC genomics data set, see

http://ir.ohsu.edu/genomics/2007data.html. Accessed on 03/01/2012/

Page 136: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 5. CONCLUSION AND FUTURE WORK 118

The knowledge-based query expansion methodology is evaluated for patent

documents and scientific publications. Results are communicated through standard

measures such as recall and precision. The expanded queries showed improved

performance over the unexpanded queries for both types of documents and are

comparable to other successful implementations demonstrated in TREC [59]. The

patent system ontology is evaluated through queries spanning multiple information

sources. The queries simulate real world patent-related applications namely prior art

searches and infringement analysis. Results show that the methodology not only

improves retrieval, but also allows custom search strategies to be encoded into the IR

framework through the patent system ontology. Several limitations of the

methodology are identified during the analysis and are discussed appropriately.

5.2 FUTURE WORK

The development of our methodology is an important step toward intelligent

applications for multi-domain IR. In this section, we discuss potential research

directions based on our work. First, research in the application of digital repositories to

manage information in the patent system is suggested. We then explain the importance

of user relevancy feedback and related techniques to enhance user interaction with the

search process. Several related methodologies for IR are discussed that can be built on

top of the current methodology. Finally, extensions for the current methodology are

suggested as future research directions to scale to other technology domains and

information sources.

5.2.1 DIGITAL REPOSITORIES

Our methodology heavily relies on the textual content and metadata available from

the information sources. For the purpose of prototyping the methodologies, we

downloaded documents from the sources to a local repository. If we were to directly

Page 137: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 5. CONCLUSION AND FUTURE WORK 119

interface with the information sources, their diversity and little interoperability pose a

major limitation to our methodology. Digital repository software tools such as Fedora

and DSpace are increasingly being adopted by institutions to publish and preserve

internal research and data [82,125]. Digital repositories support standard protocols for

information exchange and representation such as the Open Archives Initiative –

Protocol for Metadata Harvesting (OAI-PMH) and the Dublin Core Metadata

Initiative (DCMI), which facilitate easy integration with other information on the

internet.39,40

A great deal of information, especially in the US patent system, is still preserved

and expressed through images and other multimedia. Several researchers have

attempted to study retrieval of images and other forms of non-textual content [80,132].

In addition to documents, digital repositories support many forms of digital media

such as images, audio, and other forms of multimedia. Additionally, domain

ontologies can be superimposed on top of the repositories enabling knowledge-based

methodologies such as ours, to be easily integrated with digital repositories. Many

information sources in the scientific publication domain such as IEEE and ACM

already make use of such digital repositories [2,64]. Future research could study the

impact of digital repositories as tools for information management in the government.

5.2.2 USER RELEVANCY FEEDBACK

Information in the patent system is consumed by a wide range of users of both

technical and legal expertise, from lawyers and patent examiners, to technical

organizations. The information needs for users vary from one another. For example, a

39 Open Archives Initiative – Protocol for Metadata Harvesting (Accessed on 03/01/2012).

http://www.openarchives.org/OAI/openarchivesprotocol.html 40

Dublin Core Metadata Initiative Specifications (Accessed on 03/01/2012).

http://dublincore.org/specifications/

Page 138: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 5. CONCLUSION AND FUTURE WORK 120

lawyer performing an invalidity search maybe interested in the legal aspect of the

documents while a technical startup company performing a patentability search maybe

interested in learning and applying the technology. Understanding such diverse

information needs from short queries is a hard problem. Hence, inputs to the IR

system must include contextual information about user and relevancy feedbacks in

addition to standard queries.

User relevancy feedback has been studies and shows promise toward improving

retrieval [71,109,119]. In addition, good user interface and user experience are

important to capture user feedback. We implement one such feature, faceting, which is

the process of aggregating the results over a defined property of the system. For

example, if a property ‘Inventors’ is defined for patent documents, a facet on

‘Inventors’ will show the number of documents belonging to individuals of the type

‘Inventors’. Faceting along such properties can provide implicit information (quick

statistics in terms of counts) regarding the prominent entities in the results and will

enable users to quickly narrow down to relevant results. Other features include tag

clouds [26], co-occurrence graphs etc. [69]. Although user relevancy feedback is out

of the scope of the current research, we implemented faceting as a feature in the search

tool (Section 3.4.1) using Apache Solr libraries [6].

5.2.3 QUERY EXPANSION, SEMANTIC INDEXING AND OTHER METHODOLOGIES

Query expansion techniques modify user queries by appending new terms that are

derived from either external sources or from the search results [11,84]. As a result,

query expansion techniques are not bound to any technology domain, as long as

relevant domain knowledge is available. A limitation to this approach is that queries

can often become lengthy and may result in undesired side effects such as delays in

the retrieval of results, or overload the servers that host the text indexes. As an

alternative, some studies have explored the possibility of indexing the terms along

Page 139: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

CHAPTER 5. CONCLUSION AND FUTURE WORK 121

with the domain knowledge such that the queries will still return the same results

without having to query external knowledge sources and thus reducing query time

[88]. However, this methodology does not have the flexibility to dynamically choose

domain knowledge as the query expansion methodologies, but limited to the domain

knowledge already indexed. Every time a new domain ontology is required, the entire

index will have to be re-built, consuming additional space and effort. A potential

future direction could study how these two methodologies can be integrated into a

hybrid approach such that more common domain ontologies are indexed along with

the documents, and other ontologies are queried dynamically when needed. Other

related research in the areas of natural language processing [14,33], distributed

semantic computing [25], and application of tensor algebra [12] are also valuable

additions to the current prototype.

5.2.4 SCALING TO MORE APPLICATIONS, MORE DATA SOURCES, AND MORE

SUBJECT DOMAINS

The scope of this thesis involves IP-related information for a biomedical use case.

However, the patent system covers a wide range of technology areas such as

environmental engineering, mechanical devices etc.. The domain knowledge for other

technologies may not be as advanced and complete as the biomedical domain. Hence,

for technologies which have little or no domain knowledge already available,

automatic ontology learning is a promising field of study to learn the required domain

knowledge [103,139]. Furthermore, the patent system involves several other

information sources such as laws, regulations and other agency repositories such as the

FDA drug database that are also valuable sources of knowledge. Future research

directions can explore how our methodology will scale to other subject domains, and

other information sources.

Page 140: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY

1. 35 U.S.C. Sec. 103 (United States Code). “Conditions for Patentability; Non-

Obvious Subject Matter,” 2010.

2. ACM Digital Library. http://dl.acm.org/ (Accessed on 03/01/2012)

3. Alani, H., and Brewster, C., “Ontology Ranking based on the Analysis of

Concept Structures, ” In Proceedings of the Third International Conference on

Knowledge Capture ( K-CAP 05), Banff, Canada, 2005.

4. Amati, G. and Van Rijsbergen, C., J., “Probabilistic Models of Information

Retrieval Based on Measuring the Divergence from Randomness,” ACM

Trans. Inf. Syst., 20 (4):357-389, October 2002.

5. Apache Lucene. http://lucene.apache.org/

6. Apache Solr. http://lucene.apache.org/solr/

7. Ashburner, M., Ball, C., A., Blake, J., A., Botstein, D., Butler, H., Cherry, J.,

M., Davis, A., P, Dolinski, K., Dwight, S., S., Eppig, J., T., Harris, M., A.,

Hill, D., P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., C.,

Richardson, J., E., Ringwald, M., Rubin, G., M. and Sherlock, G., “Gene

Ontology: Tool for the Unification of Biology,” The Gene Ontology

Consortium., Nature Genetics, 25 (1):25-29, May 2000.

8. Baeza-Yates, R. and Ribeiro-Neto. B., Modern Information Retrieval, ACM

Press, 1999.

9. Baron, J., R. and Thompson, P., “The Search Problem Posed by Large

Heterogeneous Data Sets in Litigation: Possible Future Approaches to

Page 141: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 123

Research,” Proceedings of the 11th International Conference on Artificial

Intelligence and Law (ICAIL 2007), Stanford, CA, Jun 4-8, 2007.

10. Berners-Lee, T., Hendler, J. and Lassila, O., “The Semantic Web,” Sci. Am.,

284 (5):34–43, 2001.

11. Bhogal, J., Macfarlane, A. and Smith, P., “A Review of Ontology Based Query

Expansion,” Information Processing and Management, 43 (4):866-886, July

2007.

12. Biswas, A., Mohan, S. and Mahapatra, R., “Semantic Technologies for

Searching e-Science Grids,” In H. Chen et.al (eds), Semantic e-Science, Annals

of Information Systems, 11:141-187, 2010.

13. Bizer, C., Heath, T. and Berners-Lee, T., “Linked Data - The Story So Far,”

International Journal on Semantic Web and Information Systems, 5 (3), 2009.

14. Blake, C., “Beyond Genes, Proteins, and Abstracts: Identifying Scientific

Claims from Full-Text Biomedical Articles,” Journal of Biomedical

Informatics, 43 (2):173-189, April 2010.

15. Bodenreider, O. and Stevens, R., “Bio-Ontologies: Current Trends and Future

Directions,” Brief Bioinform, 7 (3):256–274, September 2006.

16. Bodenreider, O., “The Unified Medical Language System (UMLS): Integrating

Biomedical Terminology,” Nucleic Acids Research, 32(1):267.270, January

2004.

17. Branin, J., J., “Institutional Repositories,” In Drake, M. A. (Ed.), Encyclopedia

of Library and Information Science, Boca Raton, FL: Taylor & Francis Group,

LLC, pp. 237-248, 2005.

18. Broekstra, J., Kampman, A. and Harmelen, F., V., “Sesame: A Generic

Architecture for Storing and Querying RDF and RDF Schema”, The Semantic

Web – ISWC 2002, Lecture Notes in Computer Science, 2342:54-68, 2002.

19. Brown, S., H., Elkin, P., L., Rosenbloom, S., T., Husser, C., Bauer, B., A.,

Lincoln, M., J., Carter, J., Erlbaum, M. and Tuttle, M., S., “VA National Drug

Page 142: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 124

File Reference Terminology: A Cross-Institutional Content Coverage Study,”

Stud. Health Technol. Inform., 107(1):477-81, 2004.

20. Bruijn, J., D. et al., “State-of-the-art Survey on Ontology Merging and

Aligning,” V1. SEKT-project report D4.2.1 (WP4), IST-2003-506826, 2003.

21. Bruninghaus, S. and Ashley, K., D., “Improving the Representation of Legal

Case Texts with Information Extraction Methods,” Proceedings of the 8th

International Conference on Artificial Intelligence and Law (ICAIL), St. Louis,

Missouri, pp. 42-51, 2001.

22. Buitelaar, P., Eigner, T. and Declerck, T., “OntoSelect: A Dynamic Ontology

Library with Support for Ontology Selection,” In Proceedings of the Demo

Session at the International Semantic Web Conference, Hiroshima, Japan,

2004.

23. Center for Drug Evaluation and Research, Office of Epidemiology and

Biostatistics, “COSTART: Coding Symbols for Thesaurus of Adverse

Reaction Terms,” 4th ed. Bethesda, M: US Food and Drug Administration,

Publication PB93-209138, 1993.

24. Chimaera Website. http://www.ksl.stanford.edu/software/chimaera (Accessed

on 03/01/2012).

25. Cohen, T. and Widdows, D., “Empirical Distributed Semantics: Methods and

Biomedical Applications,” Journal of Biomedical Informatics, 42:390-405,

2009.

26. Collins, C., Viegas, F., B. and Wattenberg, M., "Parallel Tag Clouds to

Explore and Analyze Faceted Text Corpora," IEEE Symposium on Visual

Analytics Science and Technology, pp. 91-98, October 2009.

27. Crow, R., “The Case for Institutional Repositories: A SPARC Position Paper,”

The Scholarly Publishing and Academic Resources Coalition, Washington,

DC, 2001.

Page 143: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 125

28. De Nicola, A., Missikoff, M. and Navigli, R., "A Software Engineering

Approach to Ontology Building,” Information Systems, 34(2):258-275, 2009.

29. Deerwester, S., Dumais, S., Landauer, T., Furnas, G. and Harshman, R.,

“Indexing by Latent Semantic Analysis,” J. Amer. Soc. Info. Sci., 41:391-407,

1990.

30. Derwent World Patent Index.

http://thomsonreuters.com/products_services/legal/legal_products/a-

z/derwent_world_patents_index/ (Accessed on 03/01/2012).

31. Devezas, J., L., Nunes, S. and Ribeiro, C., “FEUP at TREC 2010 Blog Track:

Using H-Index for Blog Ranking,” In The Nineteenth Text REtrieval

Conference Proceedings (TREC 2010), 2010.

32. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R., S., Peng, Y., Reddivari, P.,

Doshi, V. and Sachs, J., “Swoogle: A Search and Metadata Engine for the

Semantic Web,” In Proceedings of the thirteenth ACM International

Conference on Information and Knowledge Management (CIKM '04), ACM,

New York, NY, USA, pp. 652-659, 2004.

33. Dingare, S., Finkel, J., and Nissim, M., Manning, C. and Grover, C., “A

System For Identifying Named Entities in Biomedical Text: How Results From

Two Evaluations Reflect on Both the System and the Evaluations,” The 2004

BioLink Meeting: Linking Literature, Information and Knowledge for Biology,

ISMB, 2004.

34. DocketX. https://www.docketx.com/ (Accessed on 03/01/2012).

35. Doms, A. and Schroeder, M., “GoPubMed: Exploring Pubmed with the Gene

Ontology,” Nucleic Acids Research, 33:783-786, July 2005.

36. Eaton, A., D., “HubMed: A Web-Based Biomedical Literature Search

Interface,” Nucl. Acids Res. (1 July 2006), 34 (2):W745-W747, 2006.

37. Ekstrom, J. A.. Lau, G. T., Spiteri, D., Cheng, J. C. P. and Law, K. H.,

"MINOE: A Software Tool to Evaluate Ocean Management in the Context of

Page 144: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 126

Ecosystems,” Coastal Management, 38(5):457-473, First published on: 21 July

2010 (iFirst)

38. European Patent Office. http://www.epo.org/ (Accessed on 03/01/2012).

39. Fellbaum, C., “WordNet,” Theory and Applications of Ontology: Computer

Applications, pp. 231-243, 2010.

40. Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, M., and Sinclair, G.,

“Exploiting Context for Biomedical Entity Recognition: From Syntax to the

Web,” Joint Workshop on Natural Language Processing in Biomedicine and

its Applications, Coling, 2004.

41. Friedman-Hill, E., “Jess, the Rule Engine for the Java Platform,”

http://herzberg.ca.sandia.gov/jess/ (Accessed on 03/01/2012).

42. Fujii, A., “Enhancing Patent Retrieval by Citation Analysis,” In Proceedings of

the 30th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, New York, pp. 793-794, 2007.

43. Garfield, E., “New International Professional Society Signals the Maturing of

Scientometrics and Informetrics,” The Scientist, 9 (16), 1995.

44. German Patent Office. http://www.dpma.de/english/index.html (Accessed on

03/01/2012).

45. Ghoula, N., Khelif, K. and Dieng-Kuntz, R., "Supporting Patent Mining by

using Ontology-Based Semantic Annotations,” IEEE/WIC/ACM International

Conference on Web Intelligence (WI'07), pp. 435-438, 2007.

46. Giereth, M., Brugmann, S., Stabler, A., Rotard, M. and Ertl, T., “Application

of Semantic Technologies for Representing Patent Metadata,” First

International Workshop on Applications of Semantic Technologies, 2006.

47. Giereth, M., Koch, S., Kompatsiaris, Y., Papadopoulos, S., Pianta, E., Serafini,

L. and Wanner, L., “A Modular Framework for Ontology-Based

Representation of Patent Information,” Proceeding of the 2007 Conference on

Legal Knowledge and Information Systems: JURIX 2007, 165:49-58, 2007.

Page 145: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 127

48. Golbeck, J., Fragoso, G., Hartel, F., Hendler, J., Parsia, B. and Oberthaler, J.,

“The National Cancer Institute’s Thesaurus and Ontology,” Journal of Web

Semantics, 1(1), 2003.

49. Google and USPTO. http://www.google.com/googlebooks/uspto.html

(Accessed on 03/01/2012).

50. Google Patents. http://www.google.com/patents (Accessed on 03/01/2012).

51. Google Scholar. http://scholar.google.com/ (Accessed on 03/01/2012).

52. Griliches, Z., “Patent Statistics as Economic Indicators: A Survey,” Journal of

Economic Literature, 4:1661–1707, 1990.

53. Gruber, T., R., “Toward Principles for the Design of Ontologies used for

Knowledge Sharing,” Int. J. Hum.-Comput. Stud., 43(5-6):907-928, November

1995.

54. Gruninger, M. and Fox, M., S., “Methodology for the Design and Evaluation

of Ontologies,” In: Proceedings of the Workshop on Basic Ontological Issues

in Knowledge Sharing, IJCAI-95, Montreal, 1995.

55. Guarino, N., “Formal Ontology and Information Systems,” 1998.

56. Guijarro, L., “Interoperability Frameworks and Enterprise Architectures in e-

Government Initiatives in Europe and the United States,” Government

Information Quarterly, 24 (1):89-101, January 2007.

57. Guijarro, L., “Semantic Interoperability in eGovernment Initiatives,” Comput.

Stand. Interfaces, 2008.

58. Hein Online IP Library. http://heinonline.org/ (Accessed on 03/01/2012).

59. Hersh, W. and Voorhees, E., “TREC Genomics Special Issue Overview,”

Information Retrieval, Special Issue on TREC Genomics Track: Guest Editor:

Ellen Voorhees, 12(1):1-15, 2009.

60. Hofmann, T., “Probabilistic Latent Semantic Indexing,” In Proceedings of the

22nd Annual ACM Conference on Research and Development in Information

Retrieval, Berkeley, California, pp. 50-57, 1999.

Page 146: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 128

61. Horridge, M. A Practical Guide To Building OWL Ontologies Using Protégé 4

and CO-ODE Tools. The University of Manchester, March 2011.

62. Horrocks, I., Patel-Schneider, P., F., Boley, H., et. al, “SWRL: A Semantic Web

Rule Language Combining OWL and RuleML,” W3C Member Submission, 21

May 2004.

63. Ide, N., C., Russell, F., L. and Demner-Fushman, D., “Essie: A Concept-Based

Search Engine for Structured Biomedical Text,” J Am Med Inform Assoc.,

14:253-263, 2007.

64. IEEE Xplore Digital Library. http://ieeexplore.ieee.org/Xplore/guesthome.jsp

(Accessed on 03/01/2012).

65. IFW Insight. http://ifwinsight.com/ (Accessed on 03/01/2012).

66. Iglesias, J., P., Agüera, P., J., R., Víctor, F. and Yuval, F., Z, “Integrating the

Probabilistic Models BM25/BM25F into Lucene,” 30 Nov 2009.

67. Jaffe, A., B., Trajtenberg, M. and Henderson, R., “Geographic Localization of

Knowledge Spillovers as Evidenced by Patent Citations,” The Quarterly

Journal of Economics, 108 (3):577-598, 1993.

68. Japan Patent Office. http://www.jpo.go.jp/ (Accessed on 03/01/2012).

69. Jensen, L., J., Saric, J. and Bork, P., “Literature Mining for the Biologist: From

Information Retrieval to Biological Discovery,” Nature Reviews Genetics,

7:119-129, February 2006.

70. Jonquet, C., Musen, M. A. and Shah, N. H., “A System for Ontology-Based

Annotation of Biomedical Data,” International Workshop on Data Integration

in The Life Sciences 2008, DILS'08, Evry, France, Springer-Verlag, 5109,

Lecture Notes in BioInformatics, pp. 144-152, 2008.

71. Jordan, C. and Watters, C., “Extending the Rocchio Relevance Feedback

Algorithm to Provide Contextual Retrieval,” Advances in Web Intelligence,

Lecture Notes in Computer Science, 3034:135-144, 2004.

Page 147: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 129

72. Kalfoglou, Y. and Schorlemmer, M., “Ontology Mapping: The State of the

Art,” Knowl. Eng. Rev., 18(1):1-31, Jan 2003.

73. Kang, I., Na, S., Kim, J. and Lee, J., “Cluster-Based Patent Retrieval,”

Information Processing and Management, 43(5):1173-1182, Sep 2007.

74. Khelif, K., Hedhili, A. and Collard, M., “Semantic Patent Clustering for

Biomedical Communities,” Proceedings of the 2008 IEEE/WIC/ACM

International Conference on Web Intelligence and Intelligent Agent

Technology, 1:419-422, 2008.

75. Klein, D. and Manning, C., D., “Accurate Unlexicalized Parsing,” Proceedings

of the 41st Meeting of the Association for Computational Linguistics, pp. 423-

430, 2003.

76. Klein, T., E., Chang, J., T., Cho, M., K., et al., “Integrating Genotype and

Phenotype Information: An Overview of the PharmGKB Project,”

Pharmacogenomics, 1:167–70, 2001.

77. Lau, G., T. A Comparative Analysis Framework for Semi-structured

Documents, with Applications to Government Regulations. Ph.D. Thesis,

Department of Civil and Environmental Engineering, Stanford University,

Stanford, CA, August 2004.

78. LexisNexis. http://www.lexisnexis.com/en-us/home.page (Accessed on

03/01/2012).

79. Li, H., Councill, I., Lee, W., C. and Giles, C., L., “CiteSeerX: An Architecture

and Web Service Design for an Academic Document Search Engine,” In

Proceedings of the 15th International Conference on World Wide Web (WWW

'06), ACM, New York, NY, USA, pp. 883-884, 2006.

80. Liu, Y., Zhang, L., G. and Ma, W., Y., “A Survey of Content-Based Image

Retrieval with High-Level Semantics,” Pattern Recognition, 40 (1):262-282,

January 2007.

Page 148: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 130

81. Lopez, M., F., Perez, A., G. and Juristo, N., “METHONTOLOGY: from

Ontological Art towards Ontological Engineering,” In Proceedings of the AAAI

‘97 Spring Symposium, Stanford, USA, pp. 33-40, March 1997.

82. MacKenzie, S. et. al, “DSpace: An Open Source Dynamic Digital Repository,”

D-Lib Magazine, 9 (1), January 2003.

83. Maiga, G., and Williams, D., “A Flexible Approach for User Evaluation of

Biomedical Ontologies,” International Journal of Computing and ICT

Research, 2:62-74, December 2008.

84. Manning, C., D., Raghavan, P. and Schutze, H. An Introduction to Information

Retrieval. Cambridge University Press, 2009.

85. MEDLINE. http://www.nlm.nih.gov/pubs/factsheets/medline.html (Accessed

on 03/01/2012).

86. Mitra, P., Wiederhold, G. and Decker, S., “A Scalable Framework for

Interoperation of Information Sources,” Proceedings of the 1st International

Semantic Web Working Symposium (SWWS `01), Stanford University,

Stanford, CA, July 29-Aug 1, 2001.

87. Motik, B., Sattler, U. and Studer, R., “Query Answering for OWL-DL with

Rules,” Web Semantics: Science, Services and Agents on the World Wide Web,

3(1):41-60, Rules Systems, July 2005.

88. Mukherjea, S. and Bamba, B, “BioPatentMiner: An Information Retrieval

System for Biomedical Patents,” In Proceedings of the Thirtieth international

Conference on Very Large Data Bases, 30:1066-1077, 2007.

89. Mulgara triplestore. Available online: http://www.mulgara.org/ (Accessed on

03/01/2012).

90. National Library of Medicine, "Medical Subject Headings (MeSH) Fact

sheet,” May 2005.

91. Navigli, R., “Word Sense Disambiguation: A Survey,” ACM Comput. Surv.,

41(2), Article 10, 69 pages, 2009.

Page 149: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 131

92. NCBI Entrez Cross Database Search. Entrez. (Accessed on 03/01/2012).

http://www.ncbi.nlm.nih.gov/Entrez/

93. Noy, N., F. and McGuinness, D., “Ontology Development 101: A Guide to

Creating your First Ontology,” Stanford Knowledge Systems Laboratory

Technical Report KSL-01-05 and Stanford Medical Informatics Technical

Report SMI-2001-0880, March 2001.

94. Noy, N., F., “Semantic Integration: A Survey of Ontology-Based Approaches,”

SIGMOD Rec., 33(4):65-70, Dec 2004.

95. Noy, N., F., Shah, N., H., Whetzel, P., L., Dai, B., Dorf, M., Griffith, N.,

Jonquet, C., Rubin, D., L., Storey, M., A., Chute, C., G. and Musen, M., A.,

“BioPortal: Ontologies and Integrated Data Resources at the Click of a

Mouse,” Nucl. Acids Res., 37(2):W170-W173, 2009.

96. OpenLink Virtuoso. http://virtuoso.openlinksw.com/ (Accessed on

03/01/2012).

97. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C. and Johnson, D.,

“Terrier Information Retrieval Platform,” In Proceedings of ECIR 2005, Vol.

3408:517-519, Lecture Notes in Computer Science, Springer, 2005.

98. Dean, M. and Schreiber, G. (Eds.). OWL Web Ontology Language Reference.

W3C Recommendation, 10 February 2004.

99. PACER. http://www.pacer.gov/ (Accessed on 03/01/2012).

100. Page, L., Brin, S., Motwani, R. and Winograd, T., “The PageRank Citation

Ranking: Bringing Order to the Web,” Technical Report Stanford InfoLab,

1999.

101. Parsia, B., Sirin, E., Grau, B., C., Ruckhaus, E. and Hewlett, D., “Cautiously

Approaching SWRL,” Technical Report, University of Maryland, 2005.

102. Pedersen, T., “A Simple Approach to Building Ensembles of Naive Bayesian

Classifiers for Word Sense Disambiguation,” In Proceedings of the 1st North

American Chapter of the Association for Computational Linguistics

Page 150: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 132

Conference (NAACL 2000), Association for Computational Linguistics,

Stroudsburg, PA, USA, pp. 63-69, 2000.

103. Perez, A., G., Lopez, M., F. and Corcho, O., “Ontological Engineering: With

Examples from the Areas of Knowledge Management,” E-Commerce and the

Semantic Web, Advanced Information and Knowledge Processing, Springer-

Verlag, New York, Inc., Secaucus, NJ, USA, 2007.

104. Protégé Website. http://protege.stanford.edu/ (Accessed on 03/01/2012).

105. PubMed. http://www.ncbi.nlm.nih.gov/pubmed/ (Accessed on 03/01/2012).

106. Ray, S., “Interoperability Standards in the Semantic Web,” Journal of

Computing and Information Science in Engineering, ASME, 2:65-69, March

2002.

107. Resource Description Framework (RDF) Model and Syntax, W3C

Recommendation, 22 February 1999.

108. Robertson, E., S., Walker, S., Jones, S., Hancock-Beaulieu, M. and Gatford,

M., “Okapi at TREC-3,” In Proceedings of the Third Text REtrieval

Conference (TREC 1994), Gaithersburg, USA, November 1994.

109. Rocchio, J., J., “Relevance Feedback in Information Retrieval,” In Salton

(1971b), pp. 313-323, 1971.

110. Sabou, M., Lopez, V., Motta, E. and Uren, V., “Ontology Selection: Ontology

Evaluation on the Real Semantic Web,” In Proceedings of the 4th

International EON Workshop, Evaluation of Ontologies for the Web, colocated

with WWW2006, 2006.

111. Salton, G., Wong, A. and Yang, C., S., "A Vector Space Model for Automatic

Indexing," Communications of the ACM, 18 (11):613–620, 1975.

112. Scholl, H., J., “E-Government Integration and Interoperability: Framing the

Research Agenda,” Ralf Klischewski International Journal of Public

Administration, 30 (8-9), 2007.

Page 151: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 133

113. Schox, J. Not So Obvious: A Guide to a Patent Law and Strategy for Inventors

and Entrepreneurs. 2010.

114. Sheremetyeva, S., “Natural Language Analysis of Patent Claims,” In

Proceedings of the ACL Workshop on Patent Corpus Processing, Sapporo,

2003.

115. Sheth, A., P., “Changing Focus on Interoperability in Information Systems:

From System, Syntax, Structure to Semantics,” In Interoperating, Geographic

Information Systems, pp. 5-30, 1998.

116. Shinmori, A., Okumura, M., Marukawa, Y. and Iwayama, M., “Patent Claim

Processing for Readability: Structure Analysis and Term Explanation,” In

Proceedings of the ACL-2003 Workshop on Patent Corpus Processing,

Sapporo, Japan, Association for Computational Linguistics, Stroudsburg, pp.

56–65, 2003.

117. Sirin, E., Parsia, B., Grau, B., C., Kalyanpur, A. and Katz, Y., “Pellet: A

Practical OWL-DL Reasoner,” Web Semantics: Science, Services and Agents

on the World Wide Web, 5 (2):51-53, Software Engineering and the Semantic

Web, June 2007.

118. Prud'hommeaux, E. and Seaborne, A. SPARQL W3C Submission.

http://www.w3.org/TR/rdf-sparql-query/ (Accessed on 03/01/2012).

119. Spink, A., “A User-Centered Approach to Evaluating Human Interaction with

Web Search Engines: An Exploratory Study,” Information Processing and

Management, 38(3):401-426, May 2002.

120. Stave C., D. Field Guide to MEDLINE: Making Searching Simple. National

Library of Medicine (US), Philadelphia, PA: Lippincott Williams & Wilkins,

2003.

121. Strohman, T., Metzler, D., Turtle, H. and Croft, W., B., "Indri: A Language

Model-Based Search Engine for Complex Queries," Proceedings of

International Conference on New Methods in Intelligence Analysis, 2004.

Page 152: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 134

122. Symptom Ontology. (Accessed on 03/01/2012).

http://symptomontologywiki.igs.umaryland.edu/wiki/index.php/Main_Page

123. Thomson Delphion. http://www.delphion.com/ (Accessed on 03/01/2012).

124. Thomson Web of Science. (Accessed on 03/01/2012).

http://thomsonreuters.com/products_services/science/science_products/a-

z/web_of_science/

125. Thornton, S., Wayland, R. and Payette, S., “The Fedora Project: An Open-

Source Digital Object Repository Management System,” D-Lib Magazine, 9

(4), April 2003.

126. Trappey, A., J., C., Trappey, C., V. and Wu, C., Y., “Automatic Patent

Document Summarization for Collaborative Knowledge Systems and

Services,” Journal of Systems Science and Systems Engineering, 18 (1):71-94,

2009.

127. Uschold, M. and Gruninger, M., “Ontologies: Principles, Methods, and

Applications,” Knowledge Engineering Review, 11 (2):93-155, 1996.

128. Uschold, M., “Creating, Integrating, and Maintaining Local and Global

Ontologies,” Proceedings of the First Workshop on Ontology Learning (OL-

2000) in conjunction with the 14th European Conference on Artificial

Intelligence (ECAI-2000), Berlin, Germany, 2000.

129. USPTO. http://www.uspto.gov/ (Accessed on 03/01/2012).

130. Verberne, S., D’hondt, E., Oostdijk, N. and Koster, C., H., “Quantifying the

Challenges in Parsing Patent Claims,” In Proceedings of the 1st International

Workshop on Advances in Patent Information Retrieval (AsPIRe 2010), Milton

Keynes, UK, pp 14–21, 2010.

131. Wache, H., Vogele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann

H. and Hubner, S., “Ontology-Based Integration of Information - A Survey of

Existing Approaches,” In Proceedings of IJCAI-01 Workshop: Ontologies and

Information Sharing, Seattle, WA, pp. 108-117, 2001.

Page 153: INFORMATION RETRIEVAL ACROSS MULTIPLE INFORMATION …

BIBLIOGRAPHY 135

132. Wanner, L., Baeza-Yates, R., Brugmann, S., Codina, J., Diallo, B., Escorsa, E.,

Giereth, M., Kompatsiaris, Y., Papadopoulos, S., Pianta, E., Piella, G.,

Puhlmann, I., Rao, G., Rotard, M., Schoester, P., Serafini, L. and Zervaki, V.,

“Towards Content-Oriented Patent Document Processing,” World Patent

Information, 30(1):21-23, March 2008.

133. West, D., M. Digital Government Technology and Public Sector Performance.

Princeton University Press, Princeton, NJ, 2005.

134. WestLaw. http://www.westlaw.com (Accessed on 03/01/2012).

135. Wiemer-Hastings, P., “Latent Semantic Analysis,” In Encyclopedia of

Language and Linguistics, Elsevier, Oxford, UK, 2nd edition, pp. 706-709,

2004.

136. World Health Organization, “Manual of the International Statistical

Classification of Diseases, Injuries, and Causes of Death, 9th Revision,”

Geneva, Switzerland, 1977.

137. Xue, X. and Croft, W., B., “Automatic Query Generation for Patent Search,”

In Proceeding of the 18th ACM Conference on information and Knowledge

Management, Hong Kong, China, pp. 2037-2040, Nov 2009.

138. Yang, Y., “Noise Reduction in a Statistical Approach to Text Categorization,”

In Proceedings of the 18th Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval (SIGIR '95), Edward A.

Fox, Peter Ingwersen, and Raya Fidel (Eds.), ACM, New York, NY, USA, pp.

256-263, 1995.

139. Zheng, W. and Blake, C., “Bootstrapping Location Relations from Text,”

American Society for Information Science and Technology Annual Meeting,

Pittsburgh, PA, 2010.