23
Porting a Natural Language Processing Algorithm to Extract Findings from Colonoscopy Pathology Reports Jennifer A. Pacheco 1 ; Andrew J. Gawron, MD, PhD 2 ; Kenneth Borthwick 3 ; Peggy L. Peissig, PhD 4 ; David S Carrell, PhD 5 ; Luke V. Rasmussen 1 ; Huan Mo, MD 6 ; William K. Thompson, PhD 1 1 Northwestern University, Chicago, IL; 2 University of Utah, Salt Lake City, UT; 3 Geisinger Health System, Danville, PA; 4 Marshfield Clinic, Marshfield, WI; 4 Group Health Research Institute, Seattle, WA; 6 Vanderbilt University, Nashville, TN March 22, 2016

Mayo Clinic - Porting a Natural Language Processing ...informatics.mayo.edu/phema/images/8/8f/TBI_2016_Will_NLP...Results I Using the annotator tool at a single site, a corpus of 200

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Porting a Natural Language Processing Algorithmto Extract Findings from Colonoscopy Pathology

    Reports

    Jennifer A. Pacheco1; Andrew J. Gawron, MD, PhD2; KennethBorthwick3; Peggy L. Peissig, PhD4; David S Carrell, PhD5; Luke

    V. Rasmussen1; Huan Mo, MD6; William K. Thompson, PhD1

    1Northwestern University, Chicago, IL; 2University of Utah, Salt Lake City, UT; 3GeisingerHealth System, Danville, PA; 4Marshfield Clinic, Marshfield, WI; 4Group Health Research

    Institute, Seattle, WA; 6Vanderbilt University, Nashville, TN

    March 22, 2016

  • Disclosure

    I Will Thompson is co-founder of Textractor Technologies LLC

  • Funding

    This work has been supported in part by funding from PhEMA (R01GM105688) and eMERGE (U01 HG006379, U01 HG006378 and U01HG006388).

  • Electronic Medical Records and Genomics Network

    I NHGRI funded project (currentlyin 3rd funding cycle)

    I Development of methods andbest practices for using the EHRas a tool for genomic research

    I Network members link EHR datato DNA biobanks, pool resultsacross sites

    I Several dozen electronicphenotypes completed and inprogress (www.PheKB.org)

    Example Phenotypes: Type 2 Diabetes,Peripheral Arterial Disease, QRS Duration,Colon Polyps, Resistant Hypertension

    http://www.phekb.org/

  • Phenotype: Colon Polyps

    I Colorectal cancer (CRC) is the 2nd leading cause ofcancer-related mortality in the United States

    I Colonoscopy is a widely used CRC screening modality, results inpolyp biopsies with histologies described in pathology notes

    I Serrated vs. adenoma cancer pathway

    I Goal: EHR-based phenotype and GWAS across eMERGE sites

  • Developing and Porting the NLP-based Algorithm

    [A] NLP pipeline; [B] Document annotator tool; [C] KNIME workflow forexecuting NLP; [D] KNIME workflow for evaluation of results againstgold standard

  • NLP Modules

  • Groovy Script UIMA Annotator

    Groovy script UIMA annotators are very easily configured, and scriptscan be updated and re-loaded without recompiling the entire NLPlibrary. Built-in functionality for Groovy scripts includes:

    I Support for cTAKES common type system

    I Selecting and filtering annotations

    I Creating annotations

    I Matching patterns of annotations and text

  • Groovy Script Annotator Implementation

    @Overridevoid initialize(UimaContext context) {super.initialize(context)config = new CompilerConfiguration()config.setScriptBaseClass("org.northshore.cbri.UIMAUtil")

    shell = new GroovyShell(config)// load in script file contentsthis.script = shell.parse(scriptContents)

    }

    @Overridevoid process(JCas jcas) {

    UIMAUtil.setJCas(jcas)this.script.run()

    }

  • Annotation Selection

    // select all Sentences containing// an EntityMentionsents = select

    type:Sentence,filter:contains(EntityMention)

    I Built-in support for cTAKES common type system

    I Pre-defined and on-the-fly filters

    I Compositional filters w. boolean functions

    I First class support for regex strings & operators

  • Annotation Selection (More Complex Example)

    // select all Sentence annotations contained in// "Findings" Segment that also contains an// EntityMention annotation and ends with// text "tubular adenoma"select(type:Segment).grep { seg ->seg.id == "FINDINGS"}.each {select(type:Sentence, filter:(and (coveredBy(seg),{it.coveredText==∼/.+tubular\s+adenoma},contains(EntityMention)))

    }}

  • Annotation Creation

    create(type:EntityMention,begin:0, end:10,polarity:1, uncertainty:0,ontologyConcepts:[create(type:UmlsConcept, cui:"C01"),create(type:UmlsConcept, cui:"C02")]

    )

    I Built-in support for cTAKES common types

    I Can nest calls to create on the fly

  • Annotation Matching

    sents = select(type:Sentence)patterns = [∼/(?i)(tubular|villous)\s+adenoma/]

    match(sents, patterns,{ create(type:EntityMention,

    begin:it.start(1), end:it.end(1),polarity:1, uncertainty:0,ontologyConcepts:[create(type:UmlsConcept, cui:"C01")]

    )})

    I Patterns can be specified over text and/or annotations

    I Specified functions are applied to every match

    I Action taken can be anything (create annotation one possibility)

  • Annotations Matching (More Complex Example)

    pat = (∼/(?s)(?@Head)(?=(?@Head)|\Z)/)AnnotationMatcher matcher =pat.matcher(includeText:false)

    matcher.each{ Map binding ->create(type:Segment,begin:binding.get("h1").begin,end:(binding.get("h2") ?

    binding.get("h2").begin: jcas.documentText.length()))}

  • Pathology Note Concept Annotator

  • Abstractor Tool

    https://github.com/lrasmus/DocumentAbstraction

    https://github.com/lrasmus/DocumentAbstraction

  • KNIME Execution Workflow

    https://www.knime.org

    https://www.knime.org

  • KNIME Integration with NLP Pipeline

  • KNIME Validation Workflow

  • Sharing KNIME Workflows

    Exporting a KNIME workflow Importing a KNIME workflow

  • Results

    I Using the annotator tool at a single site, a corpus of 200randomly selected pathology notes was created. At the polypfinding level (histology + location match), the algorithm achievedsensitivity of 0.95 and PPV 0.95.

    I The algorithm and tools were subsequently shared with threeadditional eMERGE sites

    I Porting to new sites required modifications to script filesregulating sectionizing and relation extraction.

    I After running the algorithm across all four sites, we were able toextract a genotyped cohort of 5839 patients.

    I Our qualitative evaluation indicated that using the toolssignificantly sped up the porting process, and is a promisingapproach to similar tasks involving NLP-based phenotypealgorithms.

  • Conclusion

    I Porting NLP-based phenotype algorithms across sites presentstime-consuming challenges, limiting cross-site collaborationopportunities.

    I We developed a set of tools to help make it easier to share,execute, modify, and validate NLP algorithms.

    I In future work we will peform a quantitive evaluation of thebenefits of using these tools for porting NLP-based phenotypealgorithms.

    I Pathology notes are relatively uniform and lend themselves topattern-based rules; more complex notes or tasks may not lendthemselves as well to this approach.

    I The integration of NLP with KNIME can be improved bydeveloping extension nodes that allow for direct editing of scriptfiles without requiring re-generation of NLP libraries.

  • Thanks!