43
Information Retrieval, Information Extraction, and Text Mining Applications for Biology Slides by Suleyman Cetintas & Luo Si 1

Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Information Retrieval, Information Extraction, and Text Mining

Applications for Biology

Slides by Suleyman Cetintas & Luo Si

1

Page 2: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Outline

Introduction

Overview of Literature Data Sources

PubMed, HighWire Press, Google Scholar, Other Sources

Structure of Biomedical Language

Biological Terminology

Lexical and Semantic Sources for Biology

Biomedical Literature Processing Applications

Beyond BioCreative: Advanced Applications

Summary

References

2

Page 3: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Introduction

Life-science research

Large and heterogeneous biological data

in the form of protein and genomic sequence data, expression profiles,

protein structures

Yet, significant amount of information in natural language

Most discoveries communicated by natural language

via publications, patents, reports, and e-texts on the www

controlled vocabulary terms used for other biological sources:

gene product annotations (e.g., Gene Ontology [GO] terms)

Database records (e.g., UniProt), containing comments, keywords,

descriptions etc.

3

Page 4: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Introduction

Structured database entries

enable efficient data retrieval, exchange, and analysis

recent tendency to enrich annotation records

general annotation databases such as UniProt (of 134K citations as of

2008) are of great practical value

Yet, only capable of covering a small fraction of biological

context information

can’t capture the richness of scientific information, argumentation in

the literature

Hard to cope up with the rapid accumulation of new publications

Text mining can help to link the database entries to the

evidence and argumentation in the literature

4

Page 5: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Introduction

Online literature collections

e.g., PubMed

70 million queries every month, >20 million publications (as of 2010)

crucial importance to experimental biologists, biomedical researchers,

database curators, etc.

Face double-exponential growth rates (due to new

journals & increasing number of journal articles)

Different needs

Scientific community needs efficient and effective information

retrieval for targeted literature searches

Pharmaceutical industry uses text-mining systems for their

competitive intelligence

Government institutions use such tools to have a global view of

the current research state

5

Page 6: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Overview of Literature Data Sources

Several efforts to make medical and life-science journal

information electronically accessible to the public through

the worldwide web

Efforts can be grouped under 3 categories:

1) Centralized institutional (PubMed) or academic (Highwire

Press & Holllis) repositories of peer reviewed articles or

abstracts

II) Article collection repositories by publishers (e.g.,

BioMedCentral, EMBASE)

III) Access to indexed scholar articles (e.g., Google Scholar,

Scirus) via web-crawlers

6

Page 7: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Overview of Literature Data Sources

Several efforts to make medical and life-science journal

information electronically accessible to the public through

the worldwide web

Efforts can be grouped under 3 categories:

1) Centralized institutional (PubMed) or academic (Highwire

Press & Holllis) repositories of peer reviewed articles or

abstracts

II) Article collection repositories by publishers (e.g.,

BioMedCentral, EMBASE)

III) Access to indexed scholar articles (e.g., Google Scholar,

Scirus) via web-crawlers

7

Page 8: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

PubMed

The most important resource for text mining applications

Includes citations (i.e., title, abstract, authors, and source

information) by participating publishers

by the National Center for Biotechnology Information (NCBI)

at the National Library of Medicine (NLM)

Basic Search:

can be accessed online by Entrez, a text based search and

retrieval system

Entrez improves the basic keyword searches by translating the user

query to Medical Subject Heading (MeSH) terms

MeSH: controlled vocabulary terms of medical domain, chemicals,

genes, proteins, etc.

8

Page 9: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

PubMed

Growth of PubMed citations between 1986-2010

9

Page 10: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

PubMed

Technology development timeline for PubMed (in light green

color) and other biomedical literature search tools (in light

orange color)

10

Page 11: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

PubMed

Programmatic Access:

PubMed also offers a more programmatic access to its content

through:

Entrez Programming Utilities

Open Source Projects

BioPerl, BioPhyton, BioJava, etc. for biologist programmers

The NCBI provides the My NCBI service, to periodically

retrieve new publications in PubMed matching a predefined

user query

The requester receives a corresponding notification via an e-mail alert

system

11

Page 12: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

PubMed

For a Local PubMed

it is possible to have a local relational database of all PubMed

citations

Obtain a licensed copy of the whole PubMed containing XML-

formatted citation records from NLM/NCBI

Mobile Access

Txt2MEDLINE: use SMS to access PubMed

PubMed Informer: Web-based PubMed monitoring tool,

facilitates PDA downloads and RSS feeds

12

Page 13: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Google Scholar

alternative to PubMed

not only peer reviewed articles, but also other scholarly texts

such as theses, books, preprint repositories

often returns larger retrieval sets, (yet with substantial number

of link-outs to PubMed records)

does not offer the advanced search functions that PubMed

offers

13

Page 14: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

HighWire Press

alternative to PubMed

an initiative of Stanford University

represents another complementary resource to PubMed

Access to peer-reviewed articles, providing search interface to

over 1160 journals, 4.8 million full-text articles (with over 1.9

million articles available free by HighWire partner publishers)

share many search characteristics with PubMed (there are also

differences of each)

HighWire , further

has graphical representation of articles’ citation map

allows user specifiy where to conduct the search (title, abstract, etc.)

14

Page 15: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Other resources

PubMed Central

Free access to full-text articles (not

only to abstracts)

contains articles published before 1966

publishers have also developed

platforms of searchable article

repositories such as EMBASE and

BioMed Central to improve the access

to their articles

15

Page 16: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Structure of Biomedical Language

A collection of homologous protein sequences often

share a common structural fold and tend to exhibit a

similar function

In natural language, a particular meaning may be expressed

using different but largely synonymous expressions

Natural language processing (NLP) is used to ‘decode’

human language

exploiting the regularities and constraints that occur at

multiple levels in human language

These 4 levels: words, syntax, semantics, pragmatics

16

Page 17: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Structure of Biomed. Lang.: Words

Tokenization and morphology: identification of words in

biology text

in English, word boundaries by whitespace, sentence

boundaries by ‘.’ (period or full stop).

there are too many complications as well

the JULIE (Jena University Language and Information

Engineering) laboratory provides tools for token and sentence

boundary detection

17

Page 18: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Structure of Biomed. Lang.: Words

Tokenization and morphology: identification of words in

biology text

very important stage

gene mention identification (BioCreative II – gene mention

task)

some teams explored the integration of publicly available gene

mention taggers, e.g. the ABNER application or the LingPipe system

linking these mentions to specific entries in biological

resources (gene normalization)

stemming – convert words to their roots, reduce variability

general stemmer – the Porter stemmer

specific biomedical stemmers

18

Page 19: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Structure of Biomed. Lang.: Syntax

Syntax: syntax or grammar of a language controls how

words are grouped into meaningful phrases

words can be associated with parts of speech (POS) tags

POS taggers are based on machine learning algorithms (e.g.,

hidden Markov models) trained on manually marked corpus

biomedical POS distribution slightly different than the general English

special taggers for biomedical domain: MedPost tagger, dTagger

POS tagging can be useful to

detect textual patterns expressing protein interaction

locate gene and protein mentions

19

Page 20: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Structure of Biomed. Lang.: Semantics &

Pragmatics

Semantics: capture the meaning

e.g., ‘c-Jun is activated by VRK1’ can be represented as an

operator ‘activate(VRK1,c-Jun)’

semantic representation abstracts away the syntax

Pragmatics: capture the larger context and its

contribution to meaning

text mining systems often rely on sentences as basic processing

unit for extracting associations between biological entities

descriptions of those relations goes beyond sentence

boundaries, and make use of referring expressions

20

Page 21: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Structure of Biomedical Language

Main NLP levels, from word tokenization to semantics

21

Page 22: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biological Terminology

Biological literature characterized by

heavy use of domain-specific terminology

~12% of all terms in biochemistry pubs are technical terms

a need for recognizing medical terms & their variations

automatically

2 main challenges

constant formation of new terms and new short forms

ambiguity or polysemy (multiple meaning of the same word)

22

Page 23: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biological Terminology

ambiguity or polysemy

text mining tools must select the correct sense of the word,

using the context behind (for disambiguating)

gene names are problem – as often shared across species

general English => 0.57% ambiguity

medical terms => 1.01% ambiguity

gene names => 14.20% ambiguity

biomedical & life science literature heavily depends on short

forms => further ambiguity

online tools for acronym-full name pairs:

ADAM, the Abbreviation Server, and AcroMine

23

Page 24: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Lexical and Semantic Sources for Biology

domain-specific technical terms

used for expressing functional descriptions of bio-entities,

relevant biological processes, experimental techniques

terminological repositories & dictionaries

important resources to interpret scientific articles

many have been developed

ontologies

developed for various subfields of biology

Gene Ontology (GO)

widely used as controlled vocabulary to describe biologically relevant

aspects of gene products

24

Page 25: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Lexical and Semantic Sources for Biology

Ontologies

Gene Ontology (GO)

Although primarily designed for annotation purposes, can also be used

as a lexical resource for indexing via the GoPubMed application

GOAnnotator

allows extraction of test-based GO annotations for a given protein

identifier (Swiss accession number)

GO Annotation Task in BioCreative I

showed that automatic detection of GO terms are more efficient in

case of short terms

25

Page 26: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Lexical and Semantic Sources for Biology

Word Level

SwissProt

biological annotation database

BioThesaurus

widely used resource combining gene and protein names from

multiple sources

TerMine

developed at the National Center for Text Mining (NaCTeM)

integrates automatic term recognition approach using linguistic and

statistical analysis of candidate terms

26

Page 27: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biomedical Literature Processing Applications

Provide access to information in scientific articles at

various levels of granularity

Building blocks for biomedical text processing can be

grouped with respect to the BioCreative tasks:

Document retrieval: core of the ‘interaction article’ subtask, to

select articles about protein-protein interactions

Entity mention: identification of mentions of biological entities

Entity normalization: linking biological entities (e.g., genes,

proteins, etc.) to biological resources (e.g., SwissProt, Entrez

Gene, etc.)

27

Page 28: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

BioMed. Lit. Proc. Apps: Document Retrieval

Requires the ability to process and index massive volumes

of data (e.g., the entire MEDLINE collection)

robust, efficient wrt space and time

Look for keywords that characterize a collection of

papers, based on keyword frequency

basis of neighbor searches in MEDLINE (the predecessor of

eT-Blast)

still the most heavily used system

Statistical analysis of word occurrences

many current literature mining systems rely on

calculated over the whole PubMed database, resulting in

weighted associations between biological entities

28

Page 29: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

BioMed. Lit. Proc. Apps: Document Retrieval

Statistical analysis of word occurrences

underlying assumption is that if two biological entities

frequently co-occur together, they should have some biological

relationship

can provide high recall

challenge in human interpretation

lacks semantic information on the type of biological association

CoPub Mapper system

provide online access to ranked co-occurrence associations extracted

from PubMed (btw genes and biological terms)

PubGene system

Generates a graphical protein interaction network based on protein-

protein literature co-occurances

29

Page 30: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

BioMed. Lit. Proc. Apps: Document Retrieval

Stemming

converts words into standardized forms (stems)

essential component of IR systems and search engines

one common shortcoming

two semantically different words can be collapsed to a common stem

used by systems such as eTBlast, to quantify the similarity btw

documents

CoPub System

detects over-represented terms from multiple abstract

collections

eTBlast

ranks retrieved PubMed records given an input article

30

Page 31: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

BioMed. Lit. Proc. Apps: Document Retrieval

Clustering Algorithms

used to group genes

according to their

expression profiles in

microarray experiments

using document similarity

calculation have been used

by PubClust, McSyBi

list of systems for

clustering and similarity

ranking on the right

31

Page 32: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biomedical Literature Processing Applications:

Gene Mention & Gene Normalization

Biologists search the annotation databases using

gene/protein names or symbols as queries

these names have been manually extracted from the literature

too time-consuming

unable to cover all synonyms or naming variants used by the

biologists

Automatic detection of protein & gene mentions

improves the coverage of annotation databases

enable semantically refined literature search

constitute a crucial initial step for other text mining systems

focus of BioCreative gene mention task

performance of 90% F measure

training data of 15K sentences & 5K test sentences 32

Page 33: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biomedical Literature Processing Applications:

Gene Mention & Gene Normalization

Most current bio-entity recognition systems

e.g., GAPSCORE, ABGENE

Can label text for protein or gene mentions

other systems such as ANBER

also identify cell lines or cell types

Chemical compound mentions

Another set of biological entities of interest

Oscar, open source system for chemical entity recognition

integrates dictionary of compound names

as well as using regular expressions, heuristics, and certain word

combinations to find chemical names in text

33

Page 34: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biomedical Literature Processing Applications:

Gene Mention & Gene Normalization

Mentions of species and taxonomic names

important for the emerging field of biodiversity

crucial step to link gene mentions to corresponding organism

source

Detecting bio-entity mentions alone is often not enough

to retrieve informative sentences

BioIE system detects (for a given query keyword) only

sentences related to protein families, functions, etc.

Other applications such as iHOP

given a gene or protein, maps it to its corresponding db identifier, and

retrieves related sentences with definition info, etc.

34

Page 35: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Biomedical Literature Processing Applications:

Gene Mention & Gene Normalization

Detecting bio-entity mentions alone is often not enough

to retrieve informative sentences

EBIMed and FACTA systems

for a given query protein, present a summary table of co-occurring

concepts based on PubMed abstracts

FABLE

retrieves co-occurring gene and protein mentions for a query keyword

results can be downloaded in XML or Excell format

For searching functional information for gene products

search with protein sequences is possible though METIS and

MedBlast systems

query sequence is linked to corresponding db record, and the

associated literature is retrieved afterwards

35

Page 36: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Beyond BioCreative: Advanced Applications

iHOP and InfoPubMed

allow retrieval of protein interaction sentences from PubMed

Chilibot

to find supporting relationship evidence between two

predefined entities of interest (genes, proteins, keywords)

Mutation-Finder

to extract amino-acid mutation mentions from large text

collections

MarkerInfoFinder

to detect information related to sequence variants of human

genes

36

Page 37: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Beyond BioCreative: Advanced Applications

PepBank Database (of peptide sequences)

a text-mining system was used to automatically detect and

extract peptide sequences from abstracts and full-text papers

Photo.ELM Database

integrated a text-mining system to detect S/T/Y

phosphorylation sites from the literature

MeInfoText & PubMeth

use text-mining to provide detailed information on gene

methylation and association with cancer

Epiloc System

a text-based subcellular location prediction tool

(complementing alternative sequence-based localization algs)

37

Page 38: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Summary: Biological Text Mining Applications

from the Biology User Perspective

Protein-relations

Function

annotation &

localization

relations

Gene group & lists

analysis

38

Page 39: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Summary: Biological Text Mining Applications

from the Biology User Perspective

Acronmy and

term extraction

Gene-disease

assocication

39

Page 40: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Summary: Biological Text Mining Applications

from the Biology User Perspective

Gene-disease

assocication

Bio-entity tagging

Text retrieval,

classification,

clustering,

similarity ranking

40

Page 41: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Summary: Biological Text Mining Applications

from the Biology User Perspective

Protein sequence

Gene group & lists

analysis

41

Page 42: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

References

Main References:

M. Krallinger, A. Valencia, :L. Hirschman. Linking genes to

literature: text mining, information extraction, and retrieval

applications for biology. Genome Biol. 2008; 9:S8.

Z. Lu, PubMed and beyond: a survey of web tools for searching

biomedical literature, Database. 2011.

For original images & references to the mentioned tools, please

either conduct an online search with their names or refer to

the original articles above

42

Page 43: Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Questions ?

Please let us know in case of any

questions/issues!

Further info: {scetinta, lsi}@cs.purdue.edu

43