20

Click here to load reader

Amia06

Embed Size (px)

Citation preview

Page 1: Amia06

AMIA-2006 1

A Comparative Study of Supervised Learning

as Applied to Acronym Expansion in

Clinical Reports

Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Christopher G. Chute

University of Minnesota, DuluthMayo College of Medicine, Rochester

Page 2: Amia06

AMIA-2006 2

Overview

• Acronyms are ambiguous– in general, and in more specialized domains

• Acronyms can be disambiguated by expansion – expansions act as senses or definitions

• Acronym expansion can be viewed as word sense disambiguation– supervised learning from annotated examples

• Features trump learning algorithms– unigrams dominant

Page 3: Amia06

AMIA-2006 3

AMIA - Top Google Results

• American Medical Informatics Association

• Association of Moving Image Archivists

• Anglican Mission in America

• Associcion Mutual Israelita Argentina

Page 4: Amia06

AMIA-2006 4

RN in Wikipedia

• Registered Nurse

• Royal Navy

• Radio National

• Radio Nederland

• Richard Nixon

• Registered Identification Number

• Renovacion Nacional

Page 5: Amia06

AMIA-2006 5

Acronym Ambiguity not just a problem for General English…

• 33% of Acronyms in UMLS are ambiguous– Liu et. al. AMIA-2001

• 81% of Acronyms in MEDLINE abstracts are ambiguous, with an average of 16 expansions– Liu et. al. AMIA-2002

Page 6: Amia06

AMIA-2006 6

We view AE as WSD

• AE – sense 1: American Eagle– sense 2: Arab Emirates– sense 3: acronym expansion

• WSD– sense 1: Washington School for the Deaf– sense 2: web server director– sense 3: word sense disambiguation

Page 7: Amia06

AMIA-2006 7

Methodology

• Identify 16 ambiguous acronyms– 9 from Pakhomov, et. al. AMIA-2005– 7 newly annotated for this this study

• Manually annotate in clinical notes– 7,738 total instances from Mayo Clinic

database of clinical notes

• Use as training data for supervised learning

Page 8: Amia06

AMIA-2006 8

Acronyms (majority < 50%)

• AC – Acromioclavicular– Antitussive with Codeine– Acid Controller– 10 more

• APC – Argon Plasma Coagulation – Adenomatous Polyposis Coli– Atrial Premature Contraction– 10 more expansions

• LE– Limited Exam Lower

Extremity– Initials– 5 more expansions

• PE – Pulmonary Embolism– Pressure Equalizing– Patient Education– 12 more expansions

Page 9: Amia06

AMIA-2006 9

Acronyms (50% < majority < 80%)

• CP– Chest Pain– Cerebral Palsy– Cerebellopontine– 19 more expansions

• HD– Huntington's Disease – Hemodialysis– Hospital Day– 9 more expansions

• CF– Cystic Fibrosis – Cold Formula– Complement Fixation– 6 more expansions

• MCI– Mild Cognitive Impairment– Methylchloroisothiazolinone– Microwave Communications,

Inc.– 5 more expansions

• ID– Infectious Disease– Identification– Idaho Identified– 4 more expansions

• LA– Long Acting– Person– Left Atrium– 5 more expansions

Page 10: Amia06

AMIA-2006 10

Acronyms (majority > 80%)• MI

– Myocardial Infarction– Michigan– Unknown– 2 more expansions

• ACA– Adenocarcinoma– Anterior Cerebral Artery– Anterior Communication

Artery– 3 more expansions

• GE– Gastroesophageal– General Exam– Generose– General Electric

• HA– Headache– Hearing Aid– Hydroxyapatite– 2 more expansions

• FEN– Fluids, Electrolytes and

Nutrition– Drug Fen Phen– Unknown

• NSR– Normal Sinus Rhythm– Nasoseptal Reconstruction

Page 11: Amia06

AMIA-2006 11

Experimental Objectives

• Compare performance of ML methods– Naïve Bayesian classifier– J48/C4.5 decision tree learner – Support vector machine (SMO)

• Compare four different feature sets– POS tags from Brill-Hepple Tagger– Unigrams that occur 5 or more times

• Flexible window of size 5 around target

– Bigrams that occur 5 or more times• Flexible window of size 5 around target

– Unigrams + Bigrams + POS tags

Page 12: Amia06

AMIA-2006 12

Feature Extraction

• Horizon : up to 5 content words to left and right of target• Boundaries : cross sentences, but not clinical notes• Skip stop words• Bigrams are pairs of contiguous content words• Example (CF is target):

– Unigrams: “if she is found to be a carrier, then they will follow with CF carrier testing in her husband.”

– Bigrams: “if she is found to be a carrier, then they will follow with CF carrier testing in her husband.”

Page 13: Amia06

AMIA-2006 13

Results (majority < 50%)Feature Comparison (AC, APC, LE, PE)

30

40

50

60

70

80

90

100

Decision Trees Naïve Bayes SVM

Classifier

Accu

racy (

%)

POS bigrams unigrams ALL Majority

Page 14: Amia06

AMIA-2006 14

Results (50% < majority < 80%)Feature Comparison (CP, HD, CF, MCI, ID, LA)

30

40

50

60

70

80

90

100

Decision Trees Naïve Bayes SVM

Classifier

Accu

racy (

%)

POS bigrams unigrams ALL Majority

Page 15: Amia06

AMIA-2006 15

Results (majority > 80%)Feature Comparison (MI, ACA, GE, HA, FEN, NSR)

30

40

50

60

70

80

90

100

Decision Trees Naïve Bayes SVM

Classifier

Accu

racy (

%)

POS bigrams unigrams ALL Majority

Page 16: Amia06

AMIA-2006 16

Results (flexible window)Fixed vs. Flexible Window Performance

70

75

80

85

90

95

1 2 3 4 5 6 7 8 9 10Window Size

Accu

racy (

%)

fixed-bigrams fixed-unigrams fixed-unigrams+bigramsflexi-bigrams flexi-unigrams flexi-unigrams+bigrams

Page 17: Amia06

AMIA-2006 17

Conclusions

• Overall expansion accuracy at or above 90% regardless of distribution

• Differences in accuracy are largely due to features, not ML algorithms

• Addition of bigrams and POS tags helps performance, but unigrams dominant

• Flexible window improves upon fixed window feature selection

Page 18: Amia06

AMIA-2006 18

Future Work

• Expand all acronyms in a text, not just select few– expand based on prior expansions– utilize one sense per discourse constraint

• Integrate supervised methods with knowledge based approaches and clustering methods to reduce need for annotated examples

Page 19: Amia06

AMIA-2006 19

Acknowledgments

• We would like to thank our annotators Barbara Abbott, Debra Albrecht and Pauline Funk.

• This work was supported in part by the NLM Training Grant (T15 LM07041-19) and the NIH Roadmap Multidisciplinary Clinical Research Career Development Award (K12/NICHD)-HD49078.

• Dr. Pedersen has been partially supported by a National Science Foundation Faculty Early CAREER Development Award (#0092784).

Page 20: Amia06

AMIA-2006 20

Software Resources

• NSPGate (from Duluth/Mayo)– http://nspgate.sourceforge.net/

• Ngram Statistics Package (from Duluth)– http://ngram.sourceforge.net/

• WSDGate (from Duluth/Mayo)– http://wsdgate.sourceforge.net/

• WEKA (from Waikato) – http://www.cs.waikato.ac.nz/ml/weka/

• GATE (from Sheffield) – http://gate.ac.uk/