Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake

ProMiner at MGIImplementing Dictionary-Based NER

Solutions for Mining Biomedical Literature

Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake

7th Fraunhofer Symposium on Text MiningOctober 6, 2009

From Algorithms to Applications ProMiner at Mouse Genome Informatics

(MGI) Background on MGI and our biocuration process Applying Named Entity Recognition (NER)

applications to improve MGI curator efficiency and minimize bottlenecks

Our implementation and results to date using ProMiner to annotate full-text scientific journal articles in HTML and PDF format

A comprehensive, integrated public information resource for mouse genetics, genomics and biology Facilitates use of the laboratory mouse as a model

for human biology Provides extensively curated mouse data

MGI Model Organism Database

www.informatics.jax.orgThe MGI website presents information on mouse biology in a publically accessible, content rich, continually updated online database

Mouse Genome Informatics

MGI content spans from DNA sequence to disease phenotype

The Mouse Information Resource

MGI integrates information on mouse genes and experimental data through a combination of manual curation, computational

curation, and collaboration with other online resources.

MGI Biocuration Workflow

Primary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

For literature curation we Review more than

160 scientific journals each month

Screen more than 12,000 articles a year

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

0

20000

40000

60000

80000

100000

120000

140000

Literature Acquisition at MGI

Year

Num

ber

of P

ubic

atio

ns A

dded


Primary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Curators pick papers based on Expression Mapping Homology New Genes Gene Ontology (GO) Alleles & Phenotypes

Sequences Inbred Strain Tumor Nomenclature General Interest

Screen for references to mouse, mice, murine


Primary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Selected articles are assigned reference numbers and entered into a master bibliography

In 2009…10,097 articles added~1122 per month(as of September 29, 2009)


Primary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Indexing is our internal process of associating article reference numbers to at least one entity within the MGI database. For gene indexing that entity is a gene.


Primary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Curators read each paper and enter information into MGI database using controlled vocabularies

Articles annotated based on Expression Mapping Homology New Genes

Sequences Inbred Strains Tumors Alleles &

Phenotypes

Papers Added 2006-2007 2007-2008 2008-2009

Master Bibliography 12,979 13,231 14,190

Phenotype Papers 9681 (75%) 10,322 (78%) 10,689 (75%)

GO Papers 8364 (64%) 7716 (58%) 9913 (70%)

Selected for Both 5974 (46%) 6,688 (51%) 7231 (51%)

Literature Acquisition at MGI

Many areas could benefit from text mining(as tools, not replacements for human curators)

Selected gene indexing as a prototype project to Minimize a bottleneck within our curation workflow

Text Mining and MGI Biocuration

Articles added to pipeline each month

1100 70% are selected for GO770

Articles gene indexed each month

200

More than 2000 articlesin gene indexing pipeline

A dictionary-based named entity recognition (NER) system that Complements our existing biocuration processes

and workflow Processes full-text PDF files in batch Uses MGI or comparable dictionaries of mouse

symbols, synonyms, and human orthologs Produces meaningful reports that aid curators Provides visualization tools Achieves high F-scores in published evaluations

Our Ideal System

Of all the dictionary-based NER tools we evaluated, ProMiner most closely fit our needs Rule-based protein and entity recognition using

pre-processed dictionaries (Entrez Gene, SwissProt, ATTC, and ECACC)

Batch processing of PDF Files (beta release) Standard and custom reports Customizable annotation projects and

dictionaries/term lists

Initiated collaborative pilot project between SCAI and MGI

at MGI

System requirements Runs on Linux systems, Sun-Ultra, and other

UNIX-based systems Requires minimum 1 GB RAM, 500 MB disk space

Java (v1.5 or higher) and Perl (v5.8 or higher) Uses GeneDB to retrieve data (requires 1 GB to store

index files). Includes an HTML-based (CGI) viewer

One processor can update ~1000 articles per project On a cluster of 16 processors, ProMiner can search the

entire MEDLINE literature base with 1 dictionary in ~2 hours

ProMiner Technical Specifications

MGI Operating Environment Dedicated Sun Fire X4100 Server with two dual core

AMD Opteron processors, 2.8 Ghz, 64 bit Solaris 10 V. 508 operating system , Java5 built-in Adobe Acrobat Pro Version 9.1

SCAI delivered… Installation scripts, ProMiner scripts and dictionaries Documentation and demos MGI project definition files for annotation using human

and mouse dictionaries

ProMiner Implementation at MGI

Testing, Testing, Testing

HTML Version 6.4 implemented in March

PDF Version 7.1 delivered in August

Reports to Scan for Gene References

This paper was indexed to mouse genes Tlr4 and Ly96

Annotation Dictionary Layers

Annotation Dictionary Layers

Preliminary Results 1 part-time curator working 5.5 hours a day

processing batches of 10 articles at a time

8 of 10 PDFs processed correctly, without errors Some PDF format (PDF/A) and color labeling errors We provide feedback to SCAI to enhance dictionaries

and PDF formatting

Manual Indexing Indexing with ProMiner

30 minutes per article 18-24 minutes per article

50 articles per week 60-70 articles per week

F-Score performance measurements in progress

ProMiner 7.1 annotates 75 full-text articles in PDF format in less than 20 minutes on our server

Processing time = 0.2333 (No. Articles )+ 0.5751R² = 0.9905

Complete performance testing and evaluate status of pilot project with SCAI

Consider extending pilot to continue testing ProMiner 7.1

Explore future collaborations Gene Ontology terms Protein-protein interactions Other curation functions

at MGI

Next Steps

MGI Judith Blake Nancy Butler Harold Drabkin Alex Diehl David Hill Monica McAndrews-Hill Sue McClatchy David Shaw Dmitry Sitnikov

MGI System Administration Matt Baya Mike McCrossin Iry Witham

Acknowledgments Fraunhofer SCAI

Juliane Fluck Heinz-Theodor Mevissen Symposium Organizers

MITRE Corporation Lynette Hirschman

Journal of Immunology

Documents

Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake