Upload
laureen-roberts
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Semi-Automatic Indexing of Full Text Biomedical Articles
Washington D.C. October 25, 2005
Clifford W. GayClifford W. Gay
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
AcknowledgmentsAcknowledgments
Alan R. Aronson, PhD.Alan R. Aronson, PhD.
Mehmet Kayaalp, M.D., PhD.Mehmet Kayaalp, M.D., PhD.
3 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
OutlineOutline
IntroductionIntroduction The System: Medical Text Indexer (MTI)The System: Medical Text Indexer (MTI) The Data: Online biomedical journalsThe Data: Online biomedical journals The Task: Emulate Medline indexing using full textThe Task: Emulate Medline indexing using full text
ResultsResults Observations on PubMed Central articlesObservations on PubMed Central articles Model selection resultsModel selection results Recent workRecent work
IntroductionThe System: Medical Text Indexer (MTI)
The Data: Online medical journalsThe Data: Online medical journals
The Task: Emulate Medline indexing using full textThe Task: Emulate Medline indexing using full text
ResultsResultsObservations on PubMed Central articlesObservations on PubMed Central articles
Model selection resultsModel selection results
Recent workRecent work
5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Why Semi-Automatic Indexing?Why Semi-Automatic Indexing?
U.S. National Library of Medicine indexes 5000 U.S. National Library of Medicine indexes 5000 journal titlesjournal titles Supports over 60 million PubMed searches each monthSupports over 60 million PubMed searches each month Has 130 indexersHas 130 indexers Indexed 570,000 articles in 2004Indexed 570,000 articles in 2004
Will need to index 1,000,000 very soonWill need to index 1,000,000 very soon Automated support is helping to meet this demandAutomated support is helping to meet this demand
– MTI was used on 26% of articles in 2004MTI was used on 26% of articles in 2004
More about MTIMore about MTI Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ.
The NLM Indexing Initiative's Medical Text Indexer. Medinfo. 2004; 11(Pt 1): 268-72. PMID: 15360816
Title + Abstract et al.
Ordered list of MeSH Terms
MeSH Headings
UMLS Concepts
Postprocessing
Restrict to MeSH
TrigramPhrase
Matching
Rel. Cits.
PubMedRelated
Citations
ExtractMeSH
Phrasex
MetaMap
Phrases
Medical Text Indexer (MTI)Medical Text Indexer (MTI)
7 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
DCMS with MTI SuggestionsDCMS with MTI Suggestions
IntroductionThe System: Medical Text Indexer (MTI)The System: Medical Text Indexer (MTI)
The Data: Online biomedical journals
The Task: Emulate Medline indexing using full textThe Task: Emulate Medline indexing using full text
ResultsResultsObservations on PubMed Central articlesObservations on PubMed Central articles
Model selection resultsModel selection results
Recent workRecent work
9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Why Full Text?Why Full Text?
Medical Text Indexer uses article title and abstractMedical Text Indexer uses article title and abstract HoweverHowever
Human indexers taught not to use abstractHuman indexers taught not to use abstract Author’s complete intent may not be in abstractAuthor’s complete intent may not be in abstract Check tags may only appear in a table or methods Check tags may only appear in a table or methods
section.section. If MTI indexes from full text articles it mayIf MTI indexes from full text articles it may
Find central concepts missing from abstractFind central concepts missing from abstract Identify terms when article has no abstract Identify terms when article has no abstract More accurately select check tagsMore accurately select check tags Be in better compliance with indexing policyBe in better compliance with indexing policy
10 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Test Collection SelectionTest Collection Selection
Available online from PubMed CentralAvailable online from PubMed Central Consistent XML formatConsistent XML format
Identifies title, abstract, sections, tables, figures, Identifies title, abstract, sections, tables, figures, references, etc.references, etc.
500 articles from 17 diverse biomedical journals500 articles from 17 diverse biomedical journals Did not use: Did not use:
ReferencesReferences GraphicsGraphics MathMath
11 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Test CollectionTest Collection
5 Clinical journals (165):5 Clinical journals (165): Breast Cancer Research (11)Breast Cancer Research (11) Journal of Clinical Microbiology (80)Journal of Clinical Microbiology (80)
3 Organization based journals (28):3 Organization based journals (28): Journal of American Medical Informatics Assoc. (10)Journal of American Medical Informatics Assoc. (10) Proceeding of the National Academy of Sciences (11)Proceeding of the National Academy of Sciences (11)
9 Journals in other categories:9 Journals in other categories: Pharmacology (65); Biochemistry (65); Plants (46); Pharmacology (65); Biochemistry (65); Plants (46);
Molecular Biology (45); Learning (30); Hospitals (22)Molecular Biology (45); Learning (30); Hospitals (22)
IntroductionThe System: Medical Text Indexer (MTI)The System: Medical Text Indexer (MTI)
The Data: Online medical journalsThe Data: Online medical journals
The Task: Emulate Medline indexing using full text
ResultsResultsObservations on PubMed Central articlesObservations on PubMed Central articles
Model selection resultsModel selection results
Recent workRecent work
13 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Indexing TaskIndexing Task
Title + Abstract et al.
Ordered list of MeSH Terms
MeSH Headings
UMLS Concepts
Postprocessing
Restrict to MeSH
TrigramPhrase
Matching
Rel. Cits.
PubMedRelated
Citations
ExtractMeSH
Phrasex
MetaMap
Phrases
Title + Abstract et al.
Ordered list of MeSH Terms
MeSH Headings
UMLS Concepts
Postprocessing
Restrict to MeSH
TrigramPhrase
Matching
Rel. Cits.
PubMedRelated
Citations
ExtractMeSH
Phrasex
MetaMap
Phrases
Medline IndexingMedline Indexingbeta-Lactamasesbeta-Lactamases
/*genetics /*metabolism /*genetics /*metabolism EnterobacteriaceaeEnterobacteriaceae/drug effects /drug effects
/*enzymology/genetics /*enzymology/genetics
PlasmidsPlasmids/*genetics /*genetics
Genes,Genes, BacterialBacterial/genetics /genetics
Genotype Genotype
Kinetics Kinetics
Microbial Sensitivity TestsMicrobial Sensitivity Tests
Molecular Sequence DataMolecular Sequence Data
Research Support, Non-U.S. Research Support, Non-U.S. Gov't Gov't
Example ArticleExample Article
• DNA Transposable DNA Transposable Elements Elements
• Escherichia coliEscherichia coli• Genes, BacterialGenes, Bacterial• Cloning, MolecularCloning, Molecular• Klebsiella pneumoniaeKlebsiella pneumoniae• Amino Acid SequenceAmino Acid Sequence• Microbial Sensitivity Microbial Sensitivity
TestsTests• CephalothinCephalothin• Proteus mirabilisProteus mirabilis• ErwiniaErwinia• Salmonella typhimuriumSalmonella typhimurium• Enterobacteriaceae Enterobacteriaceae
InfectionsInfections• LactamsLactams
• beta-Lactamasesbeta-Lactamases• PlasmidsPlasmids• EnterobacteriaceaeEnterobacteriaceae• beta-Lactam Resistancebeta-Lactam Resistance• Conjugation, GeneticConjugation, Genetic• Cephalosporin ResistanceCephalosporin Resistance• CefotaximeCefotaxime• Nucleotide SequencesNucleotide Sequences• Molecular Sequence DataMolecular Sequence Data• CephalosporinsCephalosporins• Chromosomes, BacterialChromosomes, Bacterial• DNA, BacterialDNA, Bacterial
MTI Indexing
•MMIMMI •RELREL •MMI & RELMMI & REL
Recall = 0.67 Precison = 0.24 F2 measure = 0.492
15 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
EvaluationEvaluation
F2 Measure Measure Weighted harmonic mean of Recall and PrecisionWeighted harmonic mean of Recall and Precision Weights Recall twice as important as PrecisionWeights Recall twice as important as Precision Values: 0.0 to 1.0Values: 0.0 to 1.0
Computed for each article and averagedComputed for each article and averaged
IntroductionIntroductionThe System: Medical Text Indexer (MTI)The System: Medical Text Indexer (MTI)
The Data: Online medical journalsThe Data: Online medical journals
The Task: Emulate Medline indexing using full textThe Task: Emulate Medline indexing using full text
ResultsObservations on PubMed Central articles
Model selection resultsModel selection results
Recent workRecent work
17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Section Header ClassesSection Header Classes
Semantically equivalent section headersSemantically equivalent section headers MATERIALS AND METHODS class:
Materials and Method(s) Method(s) Scoring Methods Experimental Procedures Other Methods Tested
CAPTIONS class:CAPTIONS class: the titles and captions from tables and figuresthe titles and captions from tables and figures
18 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Section ClassSection Class Average FAverage F22
CAPTIONSCAPTIONS 0.3175 0.3175
ABSTRACTABSTRACT 0.29600.2960
INTRODUCTIONINTRODUCTION 0.28690.2869
RESULTSRESULTS 0.27900.2790
DISCUSSIONDISCUSSION 0.27340.2734
NO HEADERNO HEADER 0.25740.2574
…… ……
CONCLUSIONS 0.1961
ABBREVIATIONSABBREVIATIONS 0.13040.1304
Section Class PerformanceSection Class Performance
IntroductionIntroductionThe System: Medical Text Indexer (MTI)The System: Medical Text Indexer (MTI)
The Data: Online medical journalsThe Data: Online medical journals
The Task: Emulate Medline indexing using full textThe Task: Emulate Medline indexing using full text
ResultsObservations on PubMed Central articlesObservations on PubMed Central articles
Model selection results
Recent workRecent work
20 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
ExperimentsExperiments
Varied MTI components usedVaried MTI components used MetaMap Indexing (MMI)MetaMap Indexing (MMI) Related Citations (REL)Related Citations (REL)
Varied section classes processedVaried section classes processed Used model selectionUsed model selection Used binary weighting for sectionsUsed binary weighting for sections
A model is A model is A selection of section classes and A selection of section classes and The text in those sections The text in those sections That represents the articleThat represents the article
21 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Production BaselineProduction Baseline
Title+Abstract
MMI
REL
F2 = 0.457
22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Naive ModeNaive Mode
Title+Abstract
MMI
REL
Materials and Methods
Results andDiscussion
No Header F2 = 0.453( - 0.9%)All Section Classes
23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
MetaMap Indexing ModeMetaMap Indexing Mode
Title+Abstract
MMI
REL
Introduction
Results
Discussion
Other
No Header F2 = 0.373(-18.4%)
Captions
24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Augmented ModeAugmented Mode
Title+Abstract
MMI
REL
Introduction
Results
Discussion
Other
No Header
F2 = 0.475(+3.9%)
Captions
25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Refined Augmented ModeRefined Augmented Mode
Title+Abstract
MMI
REL
Captions
Results
Background
F2 = 0.485(+ 6.1%)
26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Full MTI ModeFull MTI Mode
Title+Abstract
MMI
REL
Introduction
Results
Discussion
Other
No HeaderF2 = 0.488(+ 6.8%)MMI model
Captions
27 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Refined Full MTI Refined Full MTI
Title+Abstract
MMI
REL
Results
Results andDiscussion
No Header F2 = 0.491(+ 7.4%)
Captions
Conclusions
28 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
MTI Performance SummaryMTI Performance Summary
Indexing ModelIndexing ModelRecallRecall PrecisionPrecision
Avg. FAvg. F22
Production Baseline (Ti, Ab)Production Baseline (Ti, Ab) 0.530.53 0.320.32 0.4570.457
Naive Mode (full text)Naive Mode (full text) 0.570.57 0.270.27 0.4530.453
Augmented Mode Augmented Mode (MMI + REL (Ti, Ab))(MMI + REL (Ti, Ab))
0.590.59 0.290.29 0.4750.475
Augmented Mode (refined)Augmented Mode (refined) 0.600.60 0.300.30 0.4850.485
Full MTI (MMI + REL Full MTI (MMI + REL common sections)common sections)
0.600.60 0.300.30 0.4880.488
Full MTI (refined)Full MTI (refined) 0.600.60 0.310.31 0.4910.491
IntroductionIntroductionThe System: Medical Text Indexer (MTI)The System: Medical Text Indexer (MTI)
The Data: Online medical journalsThe Data: Online medical journals
The Task: Emulate Medline indexing using full textThe Task: Emulate Medline indexing using full text
ResultsObservations on PubMed Central articlesObservations on PubMed Central articles
Model selection resultsModel selection results
Recent work
30 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Improvement PotentialImprovement Potential
With current modelWith current model No cut off at 25 terms yields No cut off at 25 terms yields
maximum recall of 0.79maximum recall of 0.79
If all good terms prioritized correctlyIf all good terms prioritized correctly F2 = 0.64 Improvement over baseline
7% 40%
31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Increase REL CitationsIncrease REL Citations
MTI currently uses 10 Related CitationsMTI currently uses 10 Related Citations
Optimal number for full text articles is 15Optimal number for full text articles is 15
Best model confirmed for this settingBest model confirmed for this setting
Additional Improvement in FAdditional Improvement in F22 = 0.01 = 0.01
32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
SummarizationSummarization
Selecting important text before MTI processingSelecting important text before MTI processing Using Yeh, Ke, Yang, Meng approachUsing Yeh, Ke, Yang, Meng approach Combines Combines
Latent Semantic Analysis and Latent Semantic Analysis and Salton’s Text Relationship MapSalton’s Text Relationship Map
Start with current modelStart with current model Document representation includesDocument representation includes
Bag of wordsBag of words MetaMap identified conceptsMetaMap identified concepts
NLM Indexing Initiative
Clifford W. GayClifford W. Gay
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Contact:Contact:Web:Web:
[email protected]@nlm.nih.govii.nlm.nih.gov/fulltext.shtmlii.nlm.nih.gov/fulltext.shtml
34 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
NONE SectionsNONE Sections
Most appear in articles that have no abstract Most appear in articles that have no abstract 20/2320/23
Some are errorsSome are errors 4 have “Introduction” header in publisher version4 have “Introduction” header in publisher version 2 appear within other sections with headers.2 appear within other sections with headers.
Many contain the primary text of the articleMany contain the primary text of the article Comments, Editorials, Letters (11/23)Comments, Editorials, Letters (11/23)
35 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Other SectionsOther Sections
Other section class has 525 sections (16%)Other section class has 525 sections (16%) Non-standard article organizationNon-standard article organization
Common in Review articlesCommon in Review articles
ExampleExample ß-Lactamases of ß-Lactamases of Kluyvera ascorbataKluyvera ascorbata, Probable Progenitors of , Probable Progenitors of
Some Plasmid-Encoded CTX-M Types Some Plasmid-Encoded CTX-M Types Bacterial strains.Bacterial strains. Antimicrobial agents and susceptibility testing.Antimicrobial agents and susceptibility testing. Kinetic and IEF analyses.Kinetic and IEF analyses. Genetic characterization of Genetic characterization of blablaKLUA.KLUA. Genetic environment of Genetic environment of blablaKLUA-1.KLUA-1. Arguments for mobilization of chromosomal Arguments for mobilization of chromosomal blablaKLUA gene.KLUA gene.
36 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Ranking FunctionRanking Function
Made ranking function for Related Citations more Made ranking function for Related Citations more like MetaMap Indexing.like MetaMap Indexing.
Resulted in a more inclusive modelResulted in a more inclusive model Materials and MethodsMaterials and Methods IntroductionIntroduction
F2 measure = 0.4865F2 measure = 0.4865
37 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Tuning Path WeightTuning Path Weight
Ratio of weights between the two indexing pathsRatio of weights between the two indexing paths MetaMap Indexing – 7MetaMap Indexing – 7 Related Citations – 2Related Citations – 2
No improvement possibleNo improvement possible
38 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Partial Weight for Singleton HeadersPartial Weight for Singleton Headers
OTHER section classOTHER section class Header is uniqueHeader is unique Contain content termsContain content terms
Gave section class weight between 0 and 1Gave section class weight between 0 and 1 Some recall improvementSome recall improvement No collection wide improvement in FNo collection wide improvement in F22