Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Introducing ODIENCBO Seminar Series
February 18, 2009
Example
IE using ontologies
Diagnosis Malignant MelanomaBreslow Depth 0.72 mmLateral Margin PositiveRegression ProbableUlceration NegativeTIL Focally Brisk
OE using documentspunch biopsyjunctional componentpagetoid spreaddermal melanocytesBreslow depthlymphocytic infiltratesregressionmicroscopic satellitesvascular invasiontumor infiltrating lymphocytesSpitz nevusepithelioid nevus
Two Tasks ~ One problem
Ontology
Text
Ontology Enrichment:Uses concepts as source of concepts and relationships to enrich and validate ontology
Information Extraction:Uses concepts as source of concepts and relationships to enrich and validate ontology
Specific Aims 2,3,4
Specific Aims 1,3,5
Specific Aims Specific Aim 1: Develop and evaluate methods for information extraction (IE) tasks using existing
OBO ontologies, including:
Named Entity Recognition (NER)
Co‐reference Resolution (CR)
Discourse Reasoning (DR)
Attribute Value Extraction (AVE)
Specific Aim 2: Develop and evaluate general methods for clinical‐text mining to assist in ontology development, including:
Concept Discovery (CD)
Concept Clustering (CC)
Taxonomic Positioning (TP)
Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.
Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.
Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.
Ontology Enrichment
• Machine assisted
‐ Extraction‐ Filtering and Organization‐ Visualization‐ Suggestions
• Human decision‐maker (developer, curator)
• Feedback and improvement of OE
Project OrganizationConcept Discovery Coreference Resolution ODIE 0.5
Kaihong LiuRebecca Crowley Wendy ChapmanKevin Mitchell
Wendy ChapmanGuergana SavovaMelissa Castine
Rebecca Crowley Kevin MitchellGirish ChavanEugene Tseytlin
Study and compare methods for ontology enrichment; design methods for evaluation
Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms
Develop and implement architecture and UI; Create framework for using results of research; Implement work of research groups
Domain
Will attempt to develop general tools whenever possible
• Priorities for evaluation of components in :
Radiology and pathology reports
NCIT as well as clinically relevant OBO ontologies (e.g. RadLex, FMA)
Cancer domains (including hematologic oncology)
Progress
• ODIE 0.5 pre‐release on NCBO SourceForge
• Annotation software and document sets
• Res Proj #1: LSP annotation project
• Res Proj #2: Coreference resolution annotation
• Starting Res Proj #3: Discourse Reasoning
• Toolkit for developers of NLP applications and ontologies
• Pre‐released on NCBO SourceForge as ODIE 0.5
• Current release focuses on NER and CD
• Support interaction and experimentation
• Package systems at the conclusion of working with ODIE
• Foster cycle of enrichment and extraction needed to advance development of NLP systems
• Ontology enrichment as opposed to denovo development
• Human‐machine collaboration as opposed to fully automated learning
ODIE Software
ODIE Download/InfoODIE Installer:
http://caties.cabig.upmc.edu/ODIE/odieinstaller.exe
GForge Site: https://bmir‐gforge.stanford.edu/gf/project/odie/
User Forums: https://bmir‐gforge.stanford.edu/gf/project/odie/forum/
ODIE on NCBO Tools Page: http://bioontology.org/tools/ODIE.html
Users/WorkflowODIE is intended for:
• users who want to use NCBO ontologies to perform various NLP tasks (+/‐may need to add concepts locally to achieve sufficient performance)
• users who want to enrich ontologies using concepts derived from documents (very early in process of ontology development)
Plans for ODIE 1.0Ability to import additional ontologies from Bioportal or from owl files
Ability to export proposal/enriched ontologies.
Ability to add and configure new processing resources (UIMA or GATE based)
Ability to build processing pipelines using processing resources
Will come out of the box with a processing pipeline and processing resources for NER, CD and COREF.
Research Project 1:Ontology Enrichment
Nearly completed survey of lexical, statistical and hybrid methods for ontology enrichment
Methodology to study “utility” of various approaches (Liu, PhD Thesis in progress)
First project underway involves the simplest of the methods to be studied – Lexicosyntactic Patterns (LSP) – regular expressions over POS
Concept Discovery
Kaihong LiuRebecca Crowley Wendy ChapmanKevin Mitchell
Study and compare methods for ontology enrichment; design methods for evaluation
LSP Patterns
The presence of certain “lexico‐syntactic patterns” can indicate a particular semantic relationship between two nouns
Example:
DIFFERENTIAL DIAGNOSIS INCLUDES, BUT IS NOT LIMITED TO, SPINDLE CELL NEOPLASM OF PERINEURIAL ORIGIN (SUCH AS SCHWANNOMA) AND SPINDLE CELL MALIGNANT MELANOMA
“such as” indicates hyponym relationship between two noun phrase
Technique 1 - LSP
PRURIGO NODULE (aka LICHEN SIMPLEX CHRONICUS)
COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA
LPS distribution result
PatternsPathology Corpus
852764 reports, 16157608 sentencesRadiology Corpus
209997 Reports, 4057228 sentences# Sentences Unique # of sentences # Sentences Unique # of sentences
NP especially NP 14 11 19 10NP also called NP 48 37 29 22NP such as NP 98 95 906 251NP's NP 202 45 5 2NP in NP 4851 1689 106 47NP aka NP 5396 460 2 2NP including NP 6291 4952 1403 747NP other NP 6940 2251 10622 1407NP like NP 7649 2267 410 235NP, NP 8211 5351 7385 3889NP of NP 14275 4032 2906 607NP in the NP 47124 23178 64044 29285NP is NP 92374 25024 7349 2896NP of the NP 246798 70735 173016 54895
Number of sentences contain lexico-syntactic pastterns
Step 1 ‐Domain Expert annotation
• Annotation tasks: 1. Meaningful medical phrases (MMP) that can stand alone
before LSP and after LSP.2. The phrases before and after LSP have to be related
•Before LSP •After LSP•LSP
Term1 Term2
PRURIGO NODULE LICHEN SIMPLEX CHRONICUS
BENIGN ECCRINE NEOPLASIA NODULAR HIDROADENOMA….. …….
• Calculate : total # of MMP , # of MMP per LSP
PRURIGO NODULE (aka LICHEN SIMPLEX CHRONICUS)
COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA
Step 2 ‐ Curator Judgment
1. Is the concept in the ontology?
2. If not, should it be added into the ontology?
3. If not, what is the reason?
For each term
1. What is the relationship between them?
2. Is this relationship exist in the ontology?
3. If not, should it be added into the ontology?
4. If not, what is the reason?
For each pair of terms
Term1 Term2
PRURIGO NODULE LICHEN SIMPLEX CHRONICUS
BENIGN ECCRINE NEOPLASIA NODULAR HIDROADENOMA….. …….
New Concept and Relationship Suggestion Rates
New Concept and Relationship Acceptance Rates
First experiment result–concept enrichment
Radiology ReportsProceed the LSP Following the LSP
Total # of meaningful medical Phrase
# of meaningful medical Phrase/ # of
LSP
Total # of meaningful medical
Phrase
# of meaningful medical
Phrase/ # of LSP
such as 17 100% 31 124%including 27 159% 66 264%
Pathology ReportsProceed the LSP Following the LSP
Total # of meaningful
medical Phrase
# of meaningful medical Phrase/ # of
LSP (25) Total # of meaningful
medical Phrase
# of meaningful medical Phrase/
# of LSP (25)such as 27 108% 55 220%
including 24 96% 35 233%aka 25 100% 28 112%
First experiment result– concept enrichment (NCIT)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
such as including aka
Suggestive rate Acceptance rate
First experiment – extracted relationships
36%
11%
15%
11%
67%
64%
75%
19%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
such as
including
aka
Hyponym
Meronym
Synonym
Other
First experiment – extracted relationships
LSPssuch as including aka
Perc
enta
ge
0
20
40
60
80
100
Hyponym relationship is not in the NCIT Hyponym relationship should be added into the NCIT
First experiment – Concept Enrichment for RadLex
Column1 # of TermsNot in
RadLexIn
RadLex Blank
Should be added to RadLex
Suggestion rate
Acceptance rate
Proceeding LSP 29 11 16 2 10 38% 91%
Following LSP 68 24 41 3 10 35% 42%
Total 97 35 57 5 20 36% 57%
Research Project 2:Coreference Resolution
Anaphoric relations are relations between linguistic expressions where the interpretation of one linguistic expression (the anaphor) relies on the interpretation of another linguistic expression (the antecedent)
Examples of Types of anaphoric relations:
Identity (or coreference)Set/subsetPart/whole
Anaphora resolution is a computational technique for the discovery of anaphoric relations
Coreference Resolution
Wendy ChapmanGuergana SavovaMelissa Castine
Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms
DefinitionsAnaphoric relations are relations between linguistic expressions where the interpretation of one linguistic expression (the anaphor) relies on the interpretation of another linguistic expression (the antecedent)
Type of anaphoric relations
Identity (or coreference)Set/subsetPart/wholeOther
Anaphora resolution is a computational technique for the discovery of anaphoric relations
ProgressCompleted and Ongoing:Annotation schema DevelopmentGuidelinesTraining of annotators
4 training sessionsIAA: after session 1 – in the 40’sIAA: after session 3 – in the 60’sPlanned:Complete Reference Standard (RS)Algorithm testing and further development
Data Sets for RS50 clinical notes (named entities annotated)
50 Pathology (disorders, tumors)
20 Pathology (conditions)
20 Radiology (conditions)
20 Discharge summaries (conditions)
20 ED (conditions)
20 ED (respiratory conditions)•Mayo
•Pitt
QUESTIONS ?
Visualization of document set
NER – viewing concepts
Multiple Ontologies
OE – Concept Suggestion
Ranked Suggestions
Adding Proposals