Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.

Lynette HirschmanThe MITRE Corporation

Bedford, MA, USA

RegCreative JamboreeNov 29-Dec 1, 2006

Text Mining for Biology


Outline

Overview of text mining- Retrieval and extraction- Where are we?

How text mining can help- Database consistency assessment- Tools to aid curators

Conclusions


Text Mining Overview

Information Extraction:Identify, extract & normalize

entities, relations

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes Lists,Tables:Kilobytes

Protease-resistant prion protein

interacts with...

Phrases: Bytes

Information Retrieval:Retrieve & classify

documents via key words

Question Answering:question to answer


The MOD Curation Pipeline and Text Mining

MEDLINE

1. Select papers

2. List genes for curation

3. Curate genes from paperBioCreAtIve: Gene Normalization

Extract gene names & normalize:

20 participants

BioCreAtIvE II: Protein annotation Find relations & supporting evidence in text: 28 participantsKDD 2002 Task 1;

TREC Genomics 2004 Task 2BioCreAtIvE II: PPI article selection


ORegAnno Curation Pipeline & Text Mining

MEDLINE

1. Select papers

2. List TFBS for curation

3. Curate genes from paperGene & TF Normalization: Extract gene, protein names & normalize to standard ID

Extract evidence passages and map to evidence types/sub-types

Curation queue management


State of the Art: Document RetrievalInput: query words

Output: ranked list of documentsApproach

- Speed, scalability domain independence and robustness are critical for access to large collections of documents

Techniques- Shallow processing provides coarse-grained result

(entire documents or passages)- Query is transformed to collection of words,

but grammatical relations between words lost - Documents are indexed by word occurrences - Search matches query bag-of-words against indexed

documents using Boolean combination of terms, or vector of word occurrences or language model


State of the Art: ExtractionFor news, automated systems exist now that

can: - Identify entities (90-95% F-measure*) - Extract relations among entities (70-80% F)

(information extraction)- Answer simple factual questions using large

document collections at 75-85% accuracy(question answering)

How good is text mining applied to biology?- Is biology easier, because it has structured

resources (ontology, synonym lists)?- Is it harder because of specialized biological

language, complex biological reasoning?F-measure is harmonic mean of precision and recall: 2*P*R/(P+R)Precision = TP/TP+FP; Recall = TP/TP+FN


Assessments: Document Classification

TREC Genomics track focused on retrieval- Part of Text Retrieval Conf, run by National

Institutes of Standards and Technology- Tasks have included retrieval of

Documents to identify gene functionDocuments for MGI curation pipelineDocuments, passages to answer queries, e.g., “what effect does the insulin receptor gene have on tumorigenesis?”

- 40+ groups participating starting 2004KDD Challenge Cup task 2002

- Yeh et al, MITRE; Gelbart, Mathew et al, FlyBase task


KDD Challenge Cup

Task: automate part of FlyBase curation:- Determine which papers need to be

curated for Drosophila gene expression information

- Curate only those papers containing experimental results on gene products (RNA transcripts and proteins)

Teamed with FlyBase, who provided - Data annotation plus biological expertise- Input on the task formulation

Venue: ACM conference on Knowledge Discovery and Data Mining (KDD)

- Alex Yeh (MITRE) ran Challenge Cup task


FlyBase: Evidence for Gene Products


Results

18 teams submitted results (32 entries)

Winner: a team from ClearForest and Celera- Used manually generated rules and patterns to

perform information extractionSubtask results

Best MedianRanked-list for curation: 84% 69% Yes/No curate paper: 78% 58%Yes/No gene products: 67% 35%

Conclusion: ranking papers for curation promising; open question: would this help curators?


BioCreAtIvE I: Workshop March 2004- Tasks (Participation)

Gene Mention (15)Gene Normalization: Fly, Mouse, Yeast (8)Functional Annotation (8)

BioCreAtIvE II: Workshop April 2006- Tasks (Participation)

Gene Mention (21)Gene Normalization: Human (20)Protein-Protein Interaction (28)


List unique gene IDs for Fly, Mouse, Yeast abstracts

A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.

Gene Normalization

Abstract ID Organism Gene IDfly_00035_training FBgn0000592fly_00035_training FBgn0026412

Sample Gene ID and synonyms:FBgn0000592: Est-6, Esterase 6, CG6917, Est-D, EST6, est-6, Est6, Est,

EST-6, Esterase-6, est6, Est-5, Carboxyl ester hydrolase


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

Pre

cisi

on

FLYMOUSEYEAST0.8 F-measure0.9 F-measure

BioCreAtIvE I Results: Gene Normalization

• Yeast results good:

High: 0.93 F

Smallest vocab

Short names

Little ambiguity• Fly:

•0.82 F

High ambiguity• Mouse: 0.79 F

Large vocabulary

Long names• Human: ~80%

(BioCreAtIvE II)


Impact of BioCreAtIvE IBioCreAtIvE showed state of the art:

- Gene name mentions: F = 0.83 - Normalized gene IDs: F = 0.8 - 0.9- Functional annotation: F ~ 0.3

BioCreAtIvE II- Participation 2-3x higher!- Results and workshop April 23-25, Madrid

What next?- New model of curator/text mining cooperation

Have biological curators contribute data (training and test sets)

Text mining developers work on real biological problems

- RegCreative is an instance of this model


How Text Mining Can Help

Quality & Consistency- Assess consistency of annotation- First step is to determine consistency of human

performance on classification or annotation tasks- Use agreement studies to improve annotation

guidelines and resources (training materials, annotated data)

Coverage - Text mining can speed up curation to achieve

better coverage Currency

- Faster curation improves currency of annotations


Inter-Annotator Agreement

Thesis: if people cannot do a task consistently, it will be hard to automate the task

- Also, data will be less valuable Method

- Two humans perform same classification task on a “blind” data set, using classification guidelines (after some designated training)

- Results are compared via a scoring metricOutcome: Determine whether guidelines are

sufficient to ensure consistent classificationStudy can be informal

- Used to flag places that need improvement- Or more formal, to measure progress over time


Checking Interannotator Agreement:An Experiment from BioCreAtIvE ICamon et al did 1st inter-curator agreement expt*

- 3 EBI GOA annotators annotated 12 overlapping documents for GO terms (4 docs/pair of curators)

- Results after developing consensus gold standard:Avg precision (% annotations correct): ~95%

Avg recall (% correct annotations found): ~72%Lessons learned

- Very few wrong annotations, but some were missed - Annotators differed on specificity of annotation,

depending on their biological knowledge- Annotation by paper meant evidence standard was

less clear (normal annotation is by protein)- Annotation is a complex task for people!

•Camon et al.,BMC Bioinformatics 2005, 6(Suppl 1):S17 (2005)


ConclusionsText mining can provide a methodology to assess

consistency of annotationText mining can provide tools

- To manage the curation queue - To assist curators, particularly in normalization

& mapping into ontologiesNext steps

- Define intended uses of RegCreative data- Establish curator training materials- Identify key bottlenecks in curation- Provide data, user input to develop tools

Major stumbling block for text mining- Handling of pdf documents!


Acknowledgements

US National Science Foundation for funding of BioCreAtIvE I and BioCreAtIve II*

MITRE colleagues who worked on BioCreAtIvE- Alex Morgan (now at Stanford)- Marc Colosimo- Jeff Colombe- Alex Yeh (also KDD Challenge Cup)

Collaborators at CNB and CNIO- Alfonso Valencia- Christian Blaschke (now at bioalma)- Martin Krallinger

* Contract numbers EIA-0326404 and IIS-0640153 .