41
Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine [email protected] http://compbio.ucdenver.edu/Hunter_lab/Vers Research in the Verspoor Lab

Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine [email protected]

Embed Size (px)

Citation preview

Karin Verspoor, Ph.D.Faculty, Computational Bioscience ProgramUniversity of Colorado School of Medicine

[email protected]://compbio.ucdenver.edu/Hunter_lab/Verspoor

Research in the Verspoor Lab

Text Mining

•Information extraction from the biomedical literature–Entity recognition and normalization

–Relation and event extraction

•Last time, I promised that we would look at:–Ontologies as constraints for

information extraction

Making BioNLP relevant

•Recognition of OBO terms, relations

•CRAFT corpus (first release later this year)

OpenDMAP extracts typed relations from the

literature •Concept recognition tool– Connect ontological terms to literature instances

– Built on Protégé knowledge representation system

•Language patterns associated with concepts and slots– Patterns can contain text literals, other concepts,

constraints (conceptual or syntactic), ordering information, or outputs of other processing.

– Linked to many text analysis engines via UIMA

•Best performance in BioCreative II IPS task

•>500,000 instances of three predicates (with arguments) extracted from Medline Abstracts

•[Hunter, et al., 2008] http://bionlp.sourceforge.net

OpenDMAP

ontology patterns

OpenDMAP

freetext

extractedinformation

OpenDMAPCyclin E2 interacts with Cdk2 in a functional kinase complex.

<ontology>

Protein protein interaction := [int1] interacts with [int2]

protein protein interaction: interactor1: cyclin E2 interactor2: cdk2

ontology patterns

OpenDMAP

freetext

extractedinformation

OpenDMAP

OpenDMAP

CLASS: protein protein interaction SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule

PROTÉGÉ ONTOLOGY

{c-interact} := [interactor1] interacts with [interactor2]{c-interact} := [interactor1] is bound by [interactor2] …

PATTERNS

BioCreative II Example

• Some BioCreative patterns for interact{c-interact} := [interactor1] {w-is} {w-interact-verb1} {w-

preposition} the? [interactor2];{w-is} := is, are, was, were; {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates,

co-immunoprecipitated, co-localize, co-localizes, co-localized;{w-preposition} := among, between, by, of, with, to;

• Matched text:PMID 16494873, SENT_ID 16494873_114

Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9 was co-immunoprecipitated with SOX10}.

INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT {c-interact} := [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2

BioCreative Results

•359 full-text articles in the test set

•385 interaction assertions produced

•Performance averaged per article (to avoid dominance of a few assertion-heavy articles)

P = 0.39, R = 0.31, F = 0.29

•Best result in the evaluation!–F score 10% higher than next-scoring system

–F score > 3 standard deviations above mean

–Recall 20% higher than next-scoring system

BioCreative conclusions

•Information extraction in biomedical text is hard– Linguistic variability in how concepts are

expressed

– Complex concepts with multiple “slots”

•OpenDMAP advances the state of the art– Use of an ontology grounds the search for

information

– Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)

BioNLP’09: Methods

Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION])

Bax translocation to mitochondria from the cytosolBax translocation from the cytosol to the mitochondria

Slide credit: Kevin B. Cohen

BioNLP’09: Methods

Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION])

Protein (Sequence Ontology)

Cellular Component (Gene Ontology)

Slide credit: Kevin B. Cohen

BioNLP’09: Methods

Slide credit: Kevin B. Cohen

BioNLP’09: Methods• All event types represented as frames

– Elements from ontology constrain every slot

EVENT TYPE: REGULATIONAtLoc: instance of biological_entityCause: instance of proteinCSite: instance of biological_concept or

polypeptide_regionEvent_action: instance of trigger_word or

detection_methodSite: instance of biological_concept or

polypeptide_regionTheme: instance of protein or biological_processToLoc: instance of biological_entity

Sequence Ontology

Molecular Interaction Ontology

Gene OntologyCell Cycle Ontology

Slide credit: Kevin B. Cohen

BioNLP’09: Methods

Partial view of ontology—reality is a little bit less clean

Slide credit: Kevin B. Cohen

BioNLP’09: MethodsEvent type Site AtLoc ToLoc

Binding protein domain (SO), binding site (SO), DNA (SO), chromosome (SO)

Gene expression gene (SO), biological entity (CCO)

tissue (BTO), cell type (CTO), cellular component (GO)

Localization cellular component (GO)

cellular component (GO)

Phosphorylation amino acid (FMA), polypeptide region (SO)

Protein catabolism cellular component (GO)

Transcription gene (SO), biological entity (CCO)

BTO: BRENDA Tissue OntologyCCO: Cell Cycle OntologyCTO: Cell Type OntologyGO: Gene OntologySO: Sequence Ontology

Slide credit: Kevin B. Cohen

BioNLP’09: Methods

•Manual pattern-writing– Before availability of training data: based on native

speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004)

– After release of training data: based on examination of corpus data, targeting high-frequency predicates only

– Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement

– Protein binding rules re-used from BioCreative II protein-protein interaction task

– Eschewed use of wildcards

Slide credit: Kevin B. Cohen

BioNLP’09: ResultsOur system Best team Best P/R/F

P R F P R F P R F

Task 1 71.81 13.45 22.66 58.48 46.73 51.95 71.81 46.73 51.95

Task 2 70.97 13.25 43.12 54.08 35.86 43.12 70.97 35.86 43.12

Task 3 57.40 12.33 20.30 60.83 32.68 42.52 60.83 32.68 42.52

Task 1: P 10 points higher than second-highestTask 2: P 14 points higher than second-highestTask 3: P 3.4 points lower than highest (3/6)

Slide credit: Kevin B. Cohen

BioNLP’09: Results

P R F

Official results 71.81 13.45 22.66

With bug fixes 67.19 17.38 27.10

Still the highest precision (#2 was 62.21)

Unofficial results: contribution of bug repairs

Slide credit: Kevin B. Cohen

BioNLP’09: Results

•Contribution of coördination-handling–Bug-fixed results: F 27.62 (Task 1)

–Without coordination-handling: F 24.72

–Decrease in F of 2.9 without coördination-handling

Slide credit: Kevin B. Cohen

Syntax helps• 125I-labeled C3b was covalently deposited on CR2, when

hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>

CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>

• The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>

• Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>

More complex examples•Complex noun phrases• The inactive C3 (iC3), which forms spontaneously in serum in low

amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2>

• RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter>

•Negation• TNP-BSA, however, did not bind to the CD4 receptor.

<trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor>• Similarly, when cells expressing the wild type FSHR were treated

with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>

Coordination isparticularly hard

In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.

<mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa>

Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>

The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>

BioNLP Shared Task ‘11

•Extension of BioNLP’09 tasks–Generalization to full text (from abstracts)

–Additional event types: post-translational modifications and catalysis

•Methods:–Based on empirically derived patterns

–Derived from training data + manual refinement

–Using dependency relations (syntax)

–Work of Haibin Liu (postdoc)

Integrating background knowledge

•Can improve OpenDMAP precision with minimal cost to recall–Take advantage of background knowledge

–Tighten constraints on slot fillers in the ontology

–No change to existing patterns

•Proof of concept:–Distinguish among several types of protein

activation (enzyme and receptor) in GeneRIFs

–Utilize Gene Ontology annotations

Refining selectional restrictions

TP: [GeneRIF 104155 ]an ER stress induces the activation of [caspase-12_protein

- catalytic activity]activated_entity via [caspase-3_protein]activator

prevented FP: [GeneRIF 105594]factor Xa can induce mesangial cell proliferation through the activation of ERK_protein via PAR2_protein in mesangial cells

Results

OriginalAdditionalMemory

Difference

EnzymeEvents

Precision 0.24 0.37 0.13Recall 0.27 0.20 -0.07

F-measure 0.26 0.26 0.00

ReceptorEvents

Precision 0.08 0.34 0.26Recall 0.17 0.12 -0.05

F-measure 0.11 0.18 0.07

TotalPrecision 0.16 0.36 0.20

Recall 0.24 0.18 -0.06F-measure 0.19 0.24 0.05

Biological entities

•Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest:–Diseases

–Drugs, Chemicals, and other treatments

–Anatomical and other locations

–Time and temporal relationships

–Methods and evidence

–Molecular functions, biological processes

Biological Concept Recognition

Two dictionary-based tools

tested against CRAFT•UIMA ConceptMapperhttp://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator

– stemming and case matching relaxation

– non-contiguous spans

– ignore stopwords

– order-independent lookup

•Open Biomedical Annotatorhttp://bioportal.bioontology.org/annotator

– ignore stopwords

– partial word matches

Best run results

• CM/CTO: stemming + FindAllMatches: false

• OBA/CTO: using default stop words

• CM/GO_CC: stemming + caseMatch: insensitive

• CM/ChEBI: caseMatch: sensitive

Concept Matching Conclusions

•The kinds of terms in the ontology matter

•The strategies used in the dictionary matching tools matter

•OpenDMAP will support strategies that go beyond dictionary matching …

Evaluation via Test Suite• Big picture: How to evaluate ontology concept

recognition systems?• Traditional approach: “corpus”• Expensive• Time-consuming to produce• Redundancy for some things…• …underrepresentation of others

• Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that:– Control test data– Eliminate redundancy– Systematic coverage (Oepen 1998)

• Immediate (broad) goal of this work: Are there general principles for test suite design?

Slide credit: Kevin B. Cohen

Methods

•Steps: develop “catalogue” of dimensions along which terms vary

•Use insights from linguistics and from how we know concept recognition systems work–Structural aspects: length

–Content aspects: typography, orthography, lexical contents (function words)…

•…to build a structured set of test cases

•Also compare to other test suite work (Cohen et al. 2004) to look for common principles

Slide credit: Kevin B. Cohen

Structured test suite

Canonical

• GO:0000133 Polarisome

• GO:0000108 Repairosome

• GO:0000786 Nucleosome

• GO:0001660 Fever

• GO:0001726 Ruffle

• GO:0005623 Cell

• GO:0005694 Chromosome

• GO:0005814 Centriole

• GO:0005874 Microtubule

Non-canonical

• GO:0000133 Polarisomes

• GO:0000108 Repairosomes

• GO:0000786 Nucleosomes

• GO:0001660 Fevers

• GO:0001726 Ruffles

• GO:0005623 Cells

• GO:0005694 Chromosomes

• GO:0005814 Centrioles

• GO:0005874 Microtubules

indution of apoptosis -> apoptosis induction (Syntax)cell migration -> cell migrated (Part of speech)ensheathment of neurons -> ensheathment of some neurons

Slide credit: Kevin B. Cohen

Methods/Results

•Gene Ontology, revision 9/24/2009

•Canonical: 188

•Non-canonical: 117

•Observation: –5:1 “dirty” versus 5:1 “clean” is mark of

“mature” testing

•Applied publicly available concept recognition systemSlide credit: Kevin B. Cohen

Results

•97.9% of canonical terms were recognized–All exceptions contain the word in

•No non-canonical terms were recognized

•What would it take to recognize the error pattern with canonical terms with a corpus-based approach??

•General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context

Slide credit: Kevin B. Cohen