Upload
timothy-payne
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Karin Verspoor, Ph.D.Faculty, Computational Bioscience ProgramUniversity of Colorado School of Medicine
[email protected]://compbio.ucdenver.edu/Hunter_lab/Verspoor
Research in the Verspoor Lab
Text Mining
•Information extraction from the biomedical literature–Entity recognition and normalization
–Relation and event extraction
•Last time, I promised that we would look at:–Ontologies as constraints for
information extraction
Making BioNLP relevant
•Recognition of OBO terms, relations
•CRAFT corpus (first release later this year)
OpenDMAP extracts typed relations from the
literature •Concept recognition tool– Connect ontological terms to literature instances
– Built on Protégé knowledge representation system
•Language patterns associated with concepts and slots– Patterns can contain text literals, other concepts,
constraints (conceptual or syntactic), ordering information, or outputs of other processing.
– Linked to many text analysis engines via UIMA
•Best performance in BioCreative II IPS task
•>500,000 instances of three predicates (with arguments) extracted from Medline Abstracts
•[Hunter, et al., 2008] http://bionlp.sourceforge.net
OpenDMAPCyclin E2 interacts with Cdk2 in a functional kinase complex.
<ontology>
Protein protein interaction := [int1] interacts with [int2]
protein protein interaction: interactor1: cyclin E2 interactor2: cdk2
ontology patterns
OpenDMAP
freetext
extractedinformation
OpenDMAP
OpenDMAP
CLASS: protein protein interaction SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule
PROTÉGÉ ONTOLOGY
{c-interact} := [interactor1] interacts with [interactor2]{c-interact} := [interactor1] is bound by [interactor2] …
PATTERNS
BioCreative II Example
• Some BioCreative patterns for interact{c-interact} := [interactor1] {w-is} {w-interact-verb1} {w-
preposition} the? [interactor2];{w-is} := is, are, was, were; {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates,
co-immunoprecipitated, co-localize, co-localizes, co-localized;{w-preposition} := among, between, by, of, with, to;
• Matched text:PMID 16494873, SENT_ID 16494873_114
Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9 was co-immunoprecipitated with SOX10}.
INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT {c-interact} := [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2
BioCreative Results
•359 full-text articles in the test set
•385 interaction assertions produced
•Performance averaged per article (to avoid dominance of a few assertion-heavy articles)
P = 0.39, R = 0.31, F = 0.29
•Best result in the evaluation!–F score 10% higher than next-scoring system
–F score > 3 standard deviations above mean
–Recall 20% higher than next-scoring system
BioCreative conclusions
•Information extraction in biomedical text is hard– Linguistic variability in how concepts are
expressed
– Complex concepts with multiple “slots”
•OpenDMAP advances the state of the art– Use of an ontology grounds the search for
information
– Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)
BioNLP’09: Methods
Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION])
Bax translocation to mitochondria from the cytosolBax translocation from the cytosol to the mitochondria
Slide credit: Kevin B. Cohen
BioNLP’09: Methods
Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION])
Protein (Sequence Ontology)
Cellular Component (Gene Ontology)
Slide credit: Kevin B. Cohen
BioNLP’09: Methods• All event types represented as frames
– Elements from ontology constrain every slot
EVENT TYPE: REGULATIONAtLoc: instance of biological_entityCause: instance of proteinCSite: instance of biological_concept or
polypeptide_regionEvent_action: instance of trigger_word or
detection_methodSite: instance of biological_concept or
polypeptide_regionTheme: instance of protein or biological_processToLoc: instance of biological_entity
Sequence Ontology
Molecular Interaction Ontology
Gene OntologyCell Cycle Ontology
Slide credit: Kevin B. Cohen
BioNLP’09: Methods
Partial view of ontology—reality is a little bit less clean
Slide credit: Kevin B. Cohen
BioNLP’09: MethodsEvent type Site AtLoc ToLoc
Binding protein domain (SO), binding site (SO), DNA (SO), chromosome (SO)
Gene expression gene (SO), biological entity (CCO)
tissue (BTO), cell type (CTO), cellular component (GO)
Localization cellular component (GO)
cellular component (GO)
Phosphorylation amino acid (FMA), polypeptide region (SO)
Protein catabolism cellular component (GO)
Transcription gene (SO), biological entity (CCO)
BTO: BRENDA Tissue OntologyCCO: Cell Cycle OntologyCTO: Cell Type OntologyGO: Gene OntologySO: Sequence Ontology
Slide credit: Kevin B. Cohen
BioNLP’09: Methods
•Manual pattern-writing– Before availability of training data: based on native
speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004)
– After release of training data: based on examination of corpus data, targeting high-frequency predicates only
– Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement
– Protein binding rules re-used from BioCreative II protein-protein interaction task
– Eschewed use of wildcards
Slide credit: Kevin B. Cohen
BioNLP’09: ResultsOur system Best team Best P/R/F
P R F P R F P R F
Task 1 71.81 13.45 22.66 58.48 46.73 51.95 71.81 46.73 51.95
Task 2 70.97 13.25 43.12 54.08 35.86 43.12 70.97 35.86 43.12
Task 3 57.40 12.33 20.30 60.83 32.68 42.52 60.83 32.68 42.52
Task 1: P 10 points higher than second-highestTask 2: P 14 points higher than second-highestTask 3: P 3.4 points lower than highest (3/6)
Slide credit: Kevin B. Cohen
BioNLP’09: Results
P R F
Official results 71.81 13.45 22.66
With bug fixes 67.19 17.38 27.10
Still the highest precision (#2 was 62.21)
Unofficial results: contribution of bug repairs
Slide credit: Kevin B. Cohen
BioNLP’09: Results
•Contribution of coördination-handling–Bug-fixed results: F 27.62 (Task 1)
–Without coordination-handling: F 24.72
–Decrease in F of 2.9 without coördination-handling
Slide credit: Kevin B. Cohen
Syntax helps• 125I-labeled C3b was covalently deposited on CR2, when
hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>
•
CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>
• The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>
• Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>
More complex examples•Complex noun phrases• The inactive C3 (iC3), which forms spontaneously in serum in low
amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2>
• RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter>
•Negation• TNP-BSA, however, did not bind to the CD4 receptor.
<trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor>• Similarly, when cells expressing the wild type FSHR were treated
with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>
Coordination isparticularly hard
In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.
<mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa>
Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>
The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>
BioNLP Shared Task ‘11
•Extension of BioNLP’09 tasks–Generalization to full text (from abstracts)
–Additional event types: post-translational modifications and catalysis
•Methods:–Based on empirically derived patterns
–Derived from training data + manual refinement
–Using dependency relations (syntax)
–Work of Haibin Liu (postdoc)
Integrating background knowledge
•Can improve OpenDMAP precision with minimal cost to recall–Take advantage of background knowledge
–Tighten constraints on slot fillers in the ontology
–No change to existing patterns
•Proof of concept:–Distinguish among several types of protein
activation (enzyme and receptor) in GeneRIFs
–Utilize Gene Ontology annotations
Refining selectional restrictions
TP: [GeneRIF 104155 ]an ER stress induces the activation of [caspase-12_protein
- catalytic activity]activated_entity via [caspase-3_protein]activator
prevented FP: [GeneRIF 105594]factor Xa can induce mesangial cell proliferation through the activation of ERK_protein via PAR2_protein in mesangial cells
Results
OriginalAdditionalMemory
Difference
EnzymeEvents
Precision 0.24 0.37 0.13Recall 0.27 0.20 -0.07
F-measure 0.26 0.26 0.00
ReceptorEvents
Precision 0.08 0.34 0.26Recall 0.17 0.12 -0.05
F-measure 0.11 0.18 0.07
TotalPrecision 0.16 0.36 0.20
Recall 0.24 0.18 -0.06F-measure 0.19 0.24 0.05
Biological entities
•Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest:–Diseases
–Drugs, Chemicals, and other treatments
–Anatomical and other locations
–Time and temporal relationships
–Methods and evidence
–Molecular functions, biological processes
Two dictionary-based tools
tested against CRAFT•UIMA ConceptMapperhttp://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator
– stemming and case matching relaxation
– non-contiguous spans
– ignore stopwords
– order-independent lookup
•Open Biomedical Annotatorhttp://bioportal.bioontology.org/annotator
– ignore stopwords
– partial word matches
Best run results
• CM/CTO: stemming + FindAllMatches: false
• OBA/CTO: using default stop words
• CM/GO_CC: stemming + caseMatch: insensitive
• CM/ChEBI: caseMatch: sensitive
Concept Matching Conclusions
•The kinds of terms in the ontology matter
•The strategies used in the dictionary matching tools matter
•OpenDMAP will support strategies that go beyond dictionary matching …
Evaluation via Test Suite• Big picture: How to evaluate ontology concept
recognition systems?• Traditional approach: “corpus”• Expensive• Time-consuming to produce• Redundancy for some things…• …underrepresentation of others
• Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that:– Control test data– Eliminate redundancy– Systematic coverage (Oepen 1998)
• Immediate (broad) goal of this work: Are there general principles for test suite design?
Slide credit: Kevin B. Cohen
Methods
•Steps: develop “catalogue” of dimensions along which terms vary
•Use insights from linguistics and from how we know concept recognition systems work–Structural aspects: length
–Content aspects: typography, orthography, lexical contents (function words)…
•…to build a structured set of test cases
•Also compare to other test suite work (Cohen et al. 2004) to look for common principles
Slide credit: Kevin B. Cohen
Structured test suite
Canonical
• GO:0000133 Polarisome
• GO:0000108 Repairosome
• GO:0000786 Nucleosome
• GO:0001660 Fever
• GO:0001726 Ruffle
• GO:0005623 Cell
• GO:0005694 Chromosome
• GO:0005814 Centriole
• GO:0005874 Microtubule
Non-canonical
• GO:0000133 Polarisomes
• GO:0000108 Repairosomes
• GO:0000786 Nucleosomes
• GO:0001660 Fevers
• GO:0001726 Ruffles
• GO:0005623 Cells
• GO:0005694 Chromosomes
• GO:0005814 Centrioles
• GO:0005874 Microtubules
indution of apoptosis -> apoptosis induction (Syntax)cell migration -> cell migrated (Part of speech)ensheathment of neurons -> ensheathment of some neurons
Slide credit: Kevin B. Cohen
Methods/Results
•Gene Ontology, revision 9/24/2009
•Canonical: 188
•Non-canonical: 117
•Observation: –5:1 “dirty” versus 5:1 “clean” is mark of
“mature” testing
•Applied publicly available concept recognition systemSlide credit: Kevin B. Cohen
Results
•97.9% of canonical terms were recognized–All exceptions contain the word in
•No non-canonical terms were recognized
•What would it take to recognize the error pattern with canonical terms with a corpus-based approach??
•General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context
Slide credit: Kevin B. Cohen