Upload
sheila-hewitson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
ProMiner at MGIImplementing Dictionary-Based NER
Solutions for Mining Biomedical Literature
Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake
7th Fraunhofer Symposium on Text MiningOctober 6, 2009
From Algorithms to Applications ProMiner at Mouse Genome Informatics
(MGI) Background on MGI and our biocuration process Applying Named Entity Recognition (NER)
applications to improve MGI curator efficiency and minimize bottlenecks
Our implementation and results to date using ProMiner to annotate full-text scientific journal articles in HTML and PDF format
A comprehensive, integrated public information resource for mouse genetics, genomics and biology Facilitates use of the laboratory mouse as a model
for human biology Provides extensively curated mouse data
MGI Model Organism Database
www.informatics.jax.orgThe MGI website presents information on mouse biology in a publically accessible, content rich, continually updated online database
Mouse Genome Informatics
MGI content spans from DNA sequence to disease phenotype
The Mouse Information Resource
MGI integrates information on mouse genes and experimental data through a combination of manual curation, computational
curation, and collaboration with other online resources.
MGI Biocuration Workflow
Primary Triage
Secondary Triage
Master Bibliograph
yIndexing
Expert Curatio
n
For literature curation we Review more than
160 scientific journals each month
Screen more than 12,000 articles a year
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
0
20000
40000
60000
80000
100000
120000
140000
Literature Acquisition at MGI
Year
Num
ber
of P
ubic
atio
ns A
dded
MGI Biocuration Workflow
Primary Triage
Secondary Triage
Master Bibliograph
yIndexing
Expert Curatio
n
Curators pick papers based on Expression Mapping Homology New Genes Gene Ontology (GO) Alleles & Phenotypes
Sequences Inbred Strain Tumor Nomenclature General Interest
Screen for references to mouse, mice, murine
MGI Biocuration Workflow
Primary Triage
Secondary Triage
Master Bibliograph
yIndexing
Expert Curatio
n
Selected articles are assigned reference numbers and entered into a master bibliography
In 2009…10,097 articles added~1122 per month(as of September 29, 2009)
MGI Biocuration Workflow
Primary Triage
Secondary Triage
Master Bibliograph
yIndexing
Expert Curatio
n
Indexing is our internal process of associating article reference numbers to at least one entity within the MGI database. For gene indexing that entity is a gene.
MGI Biocuration Workflow
Primary Triage
Secondary Triage
Master Bibliograph
yIndexing
Expert Curatio
n
Curators read each paper and enter information into MGI database using controlled vocabularies
Articles annotated based on Expression Mapping Homology New Genes
Sequences Inbred Strains Tumors Alleles &
Phenotypes
Papers Added 2006-2007 2007-2008 2008-2009
Master Bibliography 12,979 13,231 14,190
Phenotype Papers 9681 (75%) 10,322 (78%) 10,689 (75%)
GO Papers 8364 (64%) 7716 (58%) 9913 (70%)
Selected for Both 5974 (46%) 6,688 (51%) 7231 (51%)
Literature Acquisition at MGI
Many areas could benefit from text mining(as tools, not replacements for human curators)
Selected gene indexing as a prototype project to Minimize a bottleneck within our curation workflow
Text Mining and MGI Biocuration
Articles added to pipeline each month
1100 70% are selected for GO770
Articles gene indexed each month
200
More than 2000 articlesin gene indexing pipeline
A dictionary-based named entity recognition (NER) system that Complements our existing biocuration processes
and workflow Processes full-text PDF files in batch Uses MGI or comparable dictionaries of mouse
symbols, synonyms, and human orthologs Produces meaningful reports that aid curators Provides visualization tools Achieves high F-scores in published evaluations
Our Ideal System
Of all the dictionary-based NER tools we evaluated, ProMiner most closely fit our needs Rule-based protein and entity recognition using
pre-processed dictionaries (Entrez Gene, SwissProt, ATTC, and ECACC)
Batch processing of PDF Files (beta release) Standard and custom reports Customizable annotation projects and
dictionaries/term lists
Initiated collaborative pilot project between SCAI and MGI
at MGI
System requirements Runs on Linux systems, Sun-Ultra, and other
UNIX-based systems Requires minimum 1 GB RAM, 500 MB disk space
Java (v1.5 or higher) and Perl (v5.8 or higher) Uses GeneDB to retrieve data (requires 1 GB to store
index files). Includes an HTML-based (CGI) viewer
One processor can update ~1000 articles per project On a cluster of 16 processors, ProMiner can search the
entire MEDLINE literature base with 1 dictionary in ~2 hours
ProMiner Technical Specifications
MGI Operating Environment Dedicated Sun Fire X4100 Server with two dual core
AMD Opteron processors, 2.8 Ghz, 64 bit Solaris 10 V. 508 operating system , Java5 built-in Adobe Acrobat Pro Version 9.1
SCAI delivered… Installation scripts, ProMiner scripts and dictionaries Documentation and demos MGI project definition files for annotation using human
and mouse dictionaries
ProMiner Implementation at MGI
Testing, Testing, Testing
HTML Version 6.4 implemented in March
PDF Version 7.1 delivered in August
Reports to Scan for Gene References
This paper was indexed to mouse genes Tlr4 and Ly96
Annotation Dictionary Layers
Annotation Dictionary Layers
Preliminary Results 1 part-time curator working 5.5 hours a day
processing batches of 10 articles at a time
8 of 10 PDFs processed correctly, without errors Some PDF format (PDF/A) and color labeling errors We provide feedback to SCAI to enhance dictionaries
and PDF formatting
Manual Indexing Indexing with ProMiner
30 minutes per article 18-24 minutes per article
50 articles per week 60-70 articles per week
F-Score performance measurements in progress
ProMiner 7.1 annotates 75 full-text articles in PDF format in less than 20 minutes on our server
Processing time = 0.2333 (No. Articles )+ 0.5751R² = 0.9905
Complete performance testing and evaluate status of pilot project with SCAI
Consider extending pilot to continue testing ProMiner 7.1
Explore future collaborations Gene Ontology terms Protein-protein interactions Other curation functions
at MGI
Next Steps
MGI Judith Blake Nancy Butler Harold Drabkin Alex Diehl David Hill Monica McAndrews-Hill Sue McClatchy David Shaw Dmitry Sitnikov
MGI System Administration Matt Baya Mike McCrossin Iry Witham
Acknowledgments Fraunhofer SCAI
Juliane Fluck Heinz-Theodor Mevissen Symposium Organizers
MITRE Corporation Lynette Hirschman
Journal of Immunology