A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

Embed Size (px)

Citation preview

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    1/147

    A semi-supervised approach to

    extracting concepts and

    relationships from clinical text

    Siddhartha Jonnalagadda

    5/18/2010 1ASU Biomedicine

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    2/147

    Abstract

    Heath care industry trillions of dollars of market share,information-rich clinical records abound

    Goal: Extract mentions of entities such as treatment,lab test, and medical problems as well as associations

    among them. Enable secondary use of this data:

    Tracking performance

    Optimizing resources

    Biosurveillance Clinical Decision Support

    Structure the unstructured narratives in clincal recordsusing Information Extraction (NLP)

    5/18/2010 ASU Biomedicine 2

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    3/147

    Applications of NLP for Biomedical

    Informatics Bio Informatics: Curation of PPIs into STRING and SNPs

    into modSNP

    Clinical Informatics: Deidentification of patientinformation and extraction of code information. from

    clinical records Public Health Informatics: Computational

    Biosurveillance. For example: BioCaster tracks thedistribution of infectious disesase outbreaks fromlinguistic signals from Web.

    Imaging informatics: Literature based discovery forimproving the state of art of medical imaging andbiomedical image search

    5/18/2010 ASU Biomedicine 3

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    4/147

    Biomedical NLP vs. Clinical NLP

    Input Scientific Literature Patient ReportsType of

    relations found

    Complex relations between

    biomolecular substance

    Descriptive

    Grammar Relations based on verbs Nouns and adjectives

    Overlap Tissues, cells, molecular components and diseases

    User needs Focused: Literature Search, Curation& Hypothesis Testing Diverse : Coding, Dec. support,Terminology Mgmt, Lit. Search, Hypo.

    Testing

    Availability Open access of scientific literature Privacy concerns due to HIPPAA

    Quality Peer-reviewed and written in English Local languages that arent peer-reviewed

    Motivation Scientifically appealing and new

    discoveries

    Philanthropic , Humanitarian and medico-

    economic motivation

    Funding Stable since genome sequencing Fluctuations over years

    Shared tasks BioCreative Shared task i2b2 NLP shared task

    5/18/2010 ASU Biomedicine 4

    REFERENCE: Pierre Zweigenbaum, Natural Language Processing in the Medical and Biological Domains: a Parallel Perspective. Invited Talk, SMBM 2008

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    5/147

    Hypothesis: Biomedical Clinical

    Could our methods in biomedical NLP beadapted to Clinical NLP?

    Problem: extracting medical problems, tests,

    and treatments, and relations among them inclinical narratives

    Large corpora of text unavailable

    Very little annotated text Approach: 1) Vector similarity approach using

    Distributional Semantics

    5/18/2010 ASU Biomedicine 5

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    6/147

    Objectives

    1. Evaluate the effectiveness of distributionalsemantics for entity recognition andassociation extraction from clinical records.

    2. Develop a system for the automaticextraction of treatment, test, and medicalproblem associations from clinical records

    3. Deploy the system as an i2b2 plug-in thatcould be used for clinical decision supportand also improve the extraction engine.

    5/18/2010 ASU Biomedicine 6

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    7/147

    Clinical Information Extraction

    Extracting concepts: Medical problem

    Test

    Treatment

    Extracting relations (between concepts): which treatment improves a medical condition

    which treatment worsens a medical condition

    which treatment causes a medical problem

    which treatment is administered for a medical problem

    which treatment is not administered for a medical problem which test reveals a medical problem

    which test is conducted to reveal a medical problem

    which medical problem indicates another medical problem

    5/18/2010 ASU Biomedicine 7

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    8/147

    Corpus

    Downloaded from i2b2/VA shared task.

    ~100 de-identified clinical notes annotated forconcepts and relations

    ~1000 unlabeled de-identified clinical notes

    Compiled from discharge summaries from Partners HealthCare

    discharge summaries from Beth Israel Deaconess MedicalCenter

    discharge summaries and progress notes from Universityof Pittsburgh Medical Center

    Sign Data Usage and Confidentiality agreement

    Compulsory participation in the shared task

    5/18/2010 ASU Biomedicine 8

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    9/147

    Extracting concepts aka Named Entity Recognition is being studied for last two decades

    for general domain and since a decade for medical domain

    Can be dictionary-based, rule-based, or machine learning Medical Language Extraction and Encoding System (MEDLEE, 1997)

    generates coded information for general clinical notes which usessyntactic patterns from 1000 grammar rules and some lexicon

    MetaMap (2001) by NLM maps text to UMLS metathesarus uses

    knowledge-intensive approach by detecting noun phrases and thenemploying around 1 M metathesarus strings

    cTakes (2008) by Mayo uses UIMA for extracting clinical concepts.Their Nave Bayes Classifier with syntactic and morphological featuresachieved an F-score of 0.56 based on strict matching

    Patrick et al.,(2010)s baseline for i2b2/VA NLP shared task has an F-score of 64%

    Clinical NLP is lagging behind biomedical NLP (Meystre, et al. 2008)

    Main reason scarcity of labeled data, wider variation in the formatof free text

    Solution unsupervised or semi-supervised learning

    5/18/2010 ASU Biomedicine 9

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    10/147

    Unsupervised Semi-Supervised Supervised

    Goal Uncover hidden

    regularities or todetect anomalies

    in the data

    Training on labeled

    data and unlabeleddata, frequently

    resulting in a more

    accurate classifier.

    Predict the label of

    an unseen examplebased on labels of

    the seen

    Input Only unlabeled

    data

    Labeled data and

    unlabeled data

    Only Labeled data

    Example: K-Means Co-training, ASO SVM

    Output Clusters Usually classes Classes

    Cost Low Moderate High

    Accuracy Low High Moderate

    5/18/2010 ASU Biomedicine 10

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    11/147

    Kernel Methods framework

    5/18/2010 ASU Biomedicine 11

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    12/147

    ASO: Success of semi-supervised

    learning in biological domain

    IBMs Alternating Structure Optimization(ASO) implementation of Gene Tagger(enteredin Biocreative II, 2007) used 5 M MEDLINE

    abstracts as unlabeled data. Result: ranked first in that competition

    Liu and Ng(2007) tried ASO for SRL, but failed

    because they cant fully use all the unlabeleddata because of limitations in computationalresources.

    5/18/2010 ASU Biomedicine 12

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    13/147

    Two Take home messages from ASO

    Unlabeled data can be very useful if the data is LARGE It is worthy to find if the kernel built using term similarity in

    combined word space of Medline and unlabeled clinical

    documents would improve the performance of clinical

    concept extraction

    However biomedical and clinical domains are usually

    considered differentsublanguages

    Pan et al. (2010) classified polarity of sentiments in one

    domain using the annotation of a related-domain via

    simultaneously co-clustering them in common latent space.

    Need computationally scalable model (linear in space and

    time complexity) for building kernel using LARGE data

    Random Indexing is linear in space and time complexity5/18/2010 ASU Biomedicine 13

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    14/147

    Use of Random Indexing based word

    space models for designing kernel Random Indexing helps to reduce the dimensionality

    of unsupervised data by mapping the terms intorandom index vectors

    Semantic term vectors are built from random index

    vectors by considering the context around the terms. Sahlgrens permutation model uses permutations of

    terms in a sliding window surrounding the term tobuild a paradigmatic model of semantic term vectors

    Semantic sentence vectors for each sentence in labeledcorpus is built by adding the individual term vectors

    Kernels for terms and sentences are built by calculatingthe dot product of the corresponding semantic vectors

    5/18/2010 ASU Biomedicine 14

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    15/147

    Approach

    Labeled data: 100 annotated clinical documents

    Unlabeled data: 1000 unlabeled clinical notes andMedline abstracts

    Design kernel(s) (similarity metric(s)) using theunlabeled data

    Implement the kernel algorithm using the labeleddata

    Advantage: Use of unlabeled data that holdsknowledge of how words and sentences arerelated to each other.

    5/18/2010 ASU Biomedicine 15

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    16/147

    Finding optimal parameters

    Different parameters that need to be tested forfinding the most optimal settings

    Dimensions in reduced space

    Seed length Half-window size

    Threshold for term-term similarity

    Threshold for sentence-sentence similarity

    Number of similar sentences to consider

    First three parameters are specific to model, thelast three are universal as they belong to SimFind

    5/18/2010 ASU Biomedicine 16

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    17/147

    Finding optimal models

    Different possible paradigmatic models Sahlgren's permutation based order vector model (2008)

    Sahlgren's directional vector model (2008)

    Hyperspace Analog to Language (HAL) (1996)

    Jones' convolution based BEAGLE model (2007) Cohens Reflective Random Indexing (2010)

    Sahlgrens models are computationally scalablecompared to Jones BEAGLE

    Sahlgrens directional model performed better than theconventionally used permutational model

    HAL uses SVD, but encodes direction not order

    5/18/2010 ASU Biomedicine 17

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    18/147

    Kernel Algorithm: SimFind

    (Architecture)

    5/18/2010 ASU Biomedicine 18

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    19/147

    Kernel Algorithm: SimFind

    (Pseudocode)SimFind(targetToken, Line){

    List simSentences =

    getSimilarSentences(Line,100);

    List goldenTokenLabel =

    getTokenLabels(simSentences);

    STEP1:FOREACH (goldenTokenLabel)

    IF (goldenTokenLabel has

    targetToken as token)

    RETURN goldenTokenLabel;

    STEP2:IF (token IN STOPLIST)

    RETURN ;

    terms = 1;

    STEP3:

    terms *= 10;

    =getSim

    Words(targetToken,terms);

    FOREACH (equivToken)

    FOREACH (goldenTokenLabel)IF (goldenTokenLabel has

    targetToken as token)

    RETURN goldenTokenLabel;

    IF (simIndex>0.5)

    goto STEP3;RETURN ;

    EXIT;

    }

    5/18/2010 ASU Biomedicine 19

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    20/147

    SimFind vs. Linear Discriminant

    Analysis

    Uses Random Indexing for

    dimensionality reduction which

    is O(N) and almost perfect

    Uses Singular Value

    Decomposition which is O(N3)

    and perfect

    Unsupervised dimensionality

    reduction

    Supervised dimensionality

    reduction

    Scalable to large amount of

    unlabeled data

    Applicable only to labeled data

    Random Indexing fixes the

    number of dimensions apriori

    Finds the significant

    dimensions in order like LSA

    Doesnt employ kernel trick Employs kernel trick

    Uses 2 kernels Uses single kernel

    5/18/2010 ASU Biomedicine 20

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    21/147

    Perils of lazy learning

    SimFind is a special case of K-Nearest Neighbor, asupervised machine learning algorithm alsoknown as lazy learning algorithm for its short

    training time and long testing time Time complexity: O(N*T), where N is the number

    of terms in training set and T is the number oftokens in the input or test set. N ~ 500,000

    Unfortunately, well known applications that uselarge unlabeled data use K-NN. For example: PRC,MESH UP and RRI based MESH indexing

    5/18/2010 ASU Biomedicine 21

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    22/147

    Overcoming the perils

    Ando (2007) removed sentences with words thatalready occurred 25 times

    Vasuki and Cohen (2010) used parallel processing tominimize I/O

    Observation: elements of the kernel are computed during the execution

    of the kernel learning algorithm

    the kernel learning algorithm only needs terms closelyrelated to the terms in the corpus

    what terms are needed for the task has a high correlationwith what terms are present in the corpus

    Solution: Modify kernel design to calculate the kernelmatrix at once rather than postpone it to learning step

    5/18/2010 ASU Biomedicine 22

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    23/147

    Modified Kernel design step

    Use Sahlgrens permutation model or a better one withoptimal parameters to build reduced dimensional wordspace for MEDLINE and clinical notes

    For each term in labeled corpus, store the only the

    terms from word space that are similar by a thresholdof cosine similarity

    Also store the cosine similarities

    The second kernel for sentence similarity has to behowever calculated during learning step

    Time complexity: O(S*T) S = number of sentences inthe input or test set, or O(T) per sentence where T isthe average number of tokens in a sentence

    5/18/2010 ASU Biomedicine 23

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    24/147

    Integrating SimFind with other

    features Nave Bayes Classifier trained on the noun phrases and adjective phrases

    in the annotated corpus

    Features: Section of the sentence

    Lexical features

    Grammatical correctness

    output label from SimFind for the target token.

    most similar token from the corpus to the target token

    second most similar token from the corpus to the target token

    third most similar token from the corpus to the target token

    most similar token to the target token from the 100 most similar sentences tothe target sentence

    second most similar token to the target token from the 100 most similarsentences to the target sentence

    third most similar token to the target token from the 100 most similarsentences to the target sentence

    5/18/2010 ASU Biomedicine 24

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    25/147

    Moving on to extracting relations

    An extension to concept extraction Types of relations

    which treatment improves a medical condition

    which treatment worsens a medical condition

    which treatment causes a medical problem which treatment is administered for a medical problem

    which treatment is not administered for a medical problem

    which test reveals a medical problem

    which test is conducted to reveal a medical problem which medical problem indicates another medical problem

    From BioCreative experience: better is the conceptextraction, better is the relation extraction

    5/18/2010 ASU Biomedicine 25

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    26/147

    Approaches

    Machine Learning: less precise because ofover-fitting, but high recall

    Rule-based or pattern-matching: less recall

    because of limited patterns, but high precision Solution: Increase recall of pattern-matching

    by allowing for

    fuzzy matching based on vector similarity unmask relationships hiding in syntactic jungle

    using sentence simplification

    5/18/2010 ASU Biomedicine 26

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    27/147

    Vector similarity of a sentence and a pattern

    Without order a sentence or a pattern can be represented in

    word space as vector sum of the individual terms

    Order can be encoded by using the permutational model of

    Sahlgren

    For example: the vector for hypertension was controlled on

    hydrochlorothiazide would be |0(hypertension) +1(controlled) + 2(hydrochlorothiazide)|, where is a random

    permutation and || is the L2-Norm

    While this appears theoretically sound, empirically permutation

    model performed suboptimally for large windows Probable reason: Increases the ratio of seed length and the

    number of dimensions, thus compromising the Johnson-

    LindenStrauss Lemma

    5/18/2010 ASU Biomedicine 27

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    28/147

    Concatenation as a Means to Encode

    Order in Word Space Define C(a, b) to be the vector formed by concatenating the

    vectors corresponding to a and b

    Order can be partially encoded concatenating n-grams andsumming the vectors

    For n=2, the vector for hypertension was controlled onhydrochlorothiazide would be |C(^, hypertension) C(hypertension, controlled) + C(controlled,hydrochlorothiazide) + C(controlled, hydrochlorothiazide) +C(hydrochlorothiazide, $)|, where ^ and $ are randomvectors are assigned to the beginning and end

    Random vector can also be assigned to * in a patternexpression such as P regulates Tr so that the patternis also expressed as a vector

    5/18/2010 ASU Biomedicine 28

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    29/147

    Example sentences from one discharge

    summary The patient has a history of exertional angina and chest pain

    associated with light-headedness for nine years which was noted toincrease in frequency over the past year , then upon admission hadlight-headedness and chest pain with a dull pressure in her neck toher substernal area , with only minimal exertion such as "walking across the room " .

    She underwent angiography on 5-9-92 which showed the rightinternal carotid artery to be patent , however there was highlysignificant stenosis of the left carotid artery and she was taken tothe operating room later that day for a left carotid endarterectomy .

    She was taken postoperatively the ICU where she was extubated on

    postoperative day number one and by 5-11-92 she was noted tohave markedly increased use of her right side , resolving aphasiaand she was transferred to the floor with the residual deficit onlynoted to be some left upper extremity weakness .

    5/18/2010 ASU Biomedicine 29

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    30/147

    BioSimplify

    GOAL: Create bag of simplified sentences forautomatic discourse analysis.

    Application: information extraction on

    biomedical and clinical text Existing methods: features like POS tags, parse

    trees and dependencies informally known as

    bag-of-NLP BOSS: standardize the representation of

    grammatical information in elemental chunks

    5/18/2010 ASU Biomedicine 30

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    31/147

    BioSimplify (Architecture)

    Noun Phrase Replacement

    Using POS tags and Noun phrase chunker

    POS tags: LingPipe

    Chunker: OpenNLP Syntactic Simplification

    Use any parser to produce CFG penn tree

    Parser: McClosky retraining parser 88%

    Information Extraction System Example: PIE

    5/18/2010 ASU Biomedicine 31

    Sentence

    NP

    Replacement

    SyntacticSimplification

    BOSS

    NP Chunker

    Parser

    Information

    Extractionsystem

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    32/147

    Noun Phrase Replacement

    Noun phrase consists of an optional determinative, an optional

    premodification, a mandatory head, and an optional postmodification

    Noun Phrase chunkers return all the noun phrases of the smallest length,

    thus always excluding the postmodifications

    Last word of the identified noun phrase is the head noun

    Removal of the optional determinative makes the sentence

    ungrammatical

    All tokens other than the head noun and the starting determinative or

    numeral (if exists) are removed

    For example, the noun phrase the recently discovered murineglucocorticoid is replaced with the glucocorticoid.

    5/18/2010 ASU Biomedicine 32

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    33/147

    Syntactic Simplification

    synSimp(t), t is the penn tree of the given sentence:

    -Initialize simpTrees, the ordered set containing the penn trees of all

    simplified sentences.

    -FOREACH subtree of t traversed in the order of depth-first traversal

    - perform necessary simplifications at that node which are thesimplifications that neednt be repeated for all the parents to this node

    -Add the present tree to simpTrees

    -FOREACH unprocessed tree in simpTrees

    - FOREACH subtree of t traversed in the order of depth-first traversal

    - perform the simplifications for this node

    - add new trees in simpTrees if applicable

    -return the sentences represented by the trees in simpTrees

    END

    5/18/2010 ASU Biomedicine 33

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    34/147

    Example Rule Rule: NP[NP SBAR] ~ [NP], {SBAR -WHNP + NP}

    Condition:

    Wh-NP in the relative clause is replaced by NP from main clause

    Explanation: Relative Clause

    Example: To characterize these pathways, we focused onchanges in the cyclin-dependent kinase inhibitors and their

    binding partners that underlie the cell cycle arrest at

    senescence.

    Result:

    changes in the cyclin-dependent kinase inhibitors and their binding

    partners.

    The cyclin-dependent kinase inhibitors and their binding partners

    underlie

    5/18/2010 ASU Biomedicine 34

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    35/147

    Using BioSimplify for Information

    Extraction

    5/18/2010 ASU Biomedicine 35

    BioSimplifyDoc from

    corpus

    Simplified

    docs

    IE system IE system

    RemoveAnnotations

    Results

    for original

    sentences

    Results for

    Simplified

    sentences

    Corpus

    Comparison of

    different methods

    Work Flow for Each Abstract

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    36/147

    Fuzzy pattern matching approach Generating patterns: Reduce each sentence to the

    snippets with concepts from the ground truth andsome interaction keywords

    Manually find synonyms of the keywords: theparadigmatic vector model we built for concept

    extraction task Strict pattern matching using OpenDMAP

    Vector representation of sentences by encoding orderusing Sahlgren's permutation model or a proposedconcatenation model

    use BioSimplify to transform the focus sentence into abag of simplified sentences and search for patterns ineach sentence

    5/18/2010 ASU Biomedicine 36

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    37/147

    Creating symbiotic relationship with

    clinicians

    People are willing to do for free what they arenot willing to do for small amounts of money,Spolsky, founder of stackoverflow.com

    Online servers for hospitals to use for extractionof clinical concepts and relations as part of thei2b2 hive

    Hospital staff verify the output of our system and

    correct them if necessary De-identified corrections retrieved regularly from

    i2b2 to improve the system

    5/18/2010 ASU Biomedicine 37

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    38/147

    Broader impacts and Sustainability

    Plans Proposed using unlabeled data for improving extraction of concepts

    and relationships using distributional semantics methods

    Suggested using sentence simplification for aiding interactionextraction

    Novel additions to clinical domain

    Successfully adaptation would have a transformative influence onthe field of information extraction

    To sustain the application of these methods, we will evaluate themagainst other competitive systems by participating in internationalcompetitions like i2b2/VA NLP shared tasks

    create a service-oriented architecture to offer the services of oursystem to the world for free of cost Wider dissemination

    Obtainment of useful feedback

    5/18/2010 ASU Biomedicine 38

    T Siddh h J l dd (PI)

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    39/147

    Team: Siddhartha Jonnalagadda (PI) Currently pursuing PhD in Biomedical Informatics

    B.Tech in CSE from IIT Kharagpur (5th in GPA among the class of 2008)

    Inlaks Awards of Excellence at IITs [2006]

    10th rank in All India Engineering Entrance among 730, 000 students [2004]

    Indian National Physics Olympiad Gold medalist [2004]

    Regional Mathematics Olympiad Silver medalist [2003]

    National Talent Search Examination Scholarship [2002]

    Relevant Publications: S Jonnalagadda, L Tari, J Hakenberg, G Gonzalez. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text.

    NAACL 2009

    S Jonnalagadda, P Topham, G Gonzalez. Towards Automatic Extraction of Social Networks of Organizations in PubMed Abstracts. GTBN

    workshop in IEEE BIBM 2009

    S Jonnalagadda, G Gonzalez. Sentence Simplification Aids Protein-Protein Interaction Extraction. LBM 2009

    S Jonnalagadda, P Topham, G Gonzalez. ONER: Tool for Organization Named Entity Recognition from Affiliation Strings in PubMed Abstracts.

    LBM 2009

    S Jonnalagadda, R Leaman, T Cohen and G Gonzalez. A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes ofNamed Entities. CICLing 2010, LNCS 6008

    J Hakenberg J, R Leaman R, V Nguyen V, S Jonnalagadda, et al. Efficient extraction of protein-protein interactions from full-text articles.

    Accepted by IEEE/ACM TCBB. 2010.

    S Jonnalagadda, G Gonzalez. BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical

    information extraction. Submitted to AMIA 2010

    5/18/2010 39ASU Biomedicine

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    40/147

    Team: Dr. Graciela Gonzalez (Co-PI) Assistant Professor in Biomedical Informatics at Arizona State

    University

    Research Interests: natural language processing, knowledgerepresentation, and translational bioinformatics

    NSF panelist

    Member of Biomedical Library and Informatics Review Commitee

    (BLIRC) Director of Data Management and Statistics Core of the Arizona

    Alzheimer's Disease Center

    Director of DIEGO: Data Integration and Extraction of Genomic andClinical Ontologies

    Will oversee the project Provide other student researchers of her lab, who come from

    informatics and clinical backgrounds

    5/18/2010 ASU Biomedicine 40

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    41/147

    Team: Dr. Trevor Cohen(Co-PI)

    Assistant Professor, School of Health InformationSciences, University of Texas, Houston

    Research Interest: Empirical distributional semantics

    One of the developers of the Semantic Vectorspackage

    Proposed a novel word space model called ReflectiveRandom Indexing

    Advise on use of distributional semantics methods

    Offered a 16 GB quad-core Opetron server primarilyfor this project

    5/18/2010 ASU Biomedicine 41

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    42/147

    Team: Mr. Robert Leaman(Co-PI)

    Member of DIEGO lab Bachelors of Science degree in Computer Science from

    Brigham Young University

    Several years in industry

    Ph.D. student in Computer Science at Arizona StateUniversity

    Research interests: Computational Linguistics, text miningand Named Entity Recognition

    Developed BANNER, an open-source biomedical NER

    system Mr. Leaman and Siddhartha have been working together in

    the BioCreative shared task

    5/18/2010 ASU Biomedicine 42

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    43/147

    Time Plan

    Month 1: Evaluate different word space models along with their parameters to

    empirically discover the best vector-based kernel for biomedical and clinical text.

    Month 2: Building a preliminary system for concept extraction that uses SimFind

    learning algorithm and the optimal kernel discovered above using the clinical

    corpus provided by the i2b2/VA for the shared task.

    Month 3: Statistically ensemble the outputs of SimFind along with othermorphological and contextual features into a machine-learning framework.

    Month 4: Reduce the number of the term vectors in the distributional semantics

    model for unlabeled data and achieve efficient integration of MedLine and

    unlabeled clinical documents' distributional features into the system

    Month 5: adapt our protein-protein interaction system for relationship extraction

    on clinical text and introduce the novel features of fuzzy pattern matching and

    bag-of-simplified-sentences

    Month 6-8: Create the symbiotic platform with clinicians

    5/18/2010 ASU Biomedicine 43

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    44/147

    Thanks

    Site Visit Team

    Members of DIEGO lab

    Faculty and students of BMI

    Questions?

    5/18/2010 ASU Biomedicine 44

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    45/147

    Appendix

    5/18/2010 ASU Biomedicine 45

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    46/147

    Novel Approach to Biomedical Text

    Mining: Case Study on PPI ExtractionSiddhartha JonnalagaddaPhD Candidate

    Department of Biomedical Informatics

    Although the world is full of suffering, it is full also of the overcoming of it. Helen Keller

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    47/147

    Motivation

    Information Extraction from Biomedical text isstill an open problem

    BioCreative Protein Identification : 42.9%

    BioCreative PPI Extraction: 22.1% Exploit Discourse Analysis Approach

    Application of Distributional Semantics

    Need for high performance, scalable, open-source and adaptable BioNLP systems

    Integrate NLP, Linguistics & Machine Learning

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    48/147

    Complexity of Biomedical Sentences

    Compared to regular English1. More number of words per sentence

    2. Inconsistent use of nouns and partial words

    3. Higher perplexity measures

    4. Greater Lexical Density5. Increased number of relative clauses and prepositional

    phrases

    6. Specialized names (e.g. p53, c-Abl)

    7. Chemical names with commas and parentheses (e.g.1,25(OH)2D3 )

    8. More Coordination Ellipsis (e.g. alpha- and beta-catenin)

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    49/147

    Hypothesis

    1. Removing complexity in biomedical

    sentences helps in unmasking relationships

    from NLP systems

    2. Random indexing based measures areeffective and more efficient in comparing

    patterns of words or sentences than

    traditional machine learning approachesusing non-semantic features

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    50/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Named Entity

    Recognition

    SimFind

    BANNER +SimFind

    Normalization

    External Features

    * Affiliation

    * Author

    Relationship Extraction

    Discourse Analysis

    Bag of SimplifiedSentences model ~

    Shot-gun sequencing

    DistributionalSemantics

    Word-orderpreserving pattern

    extraction

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    51/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Named Entity

    Recognition

    SimFind

    BANNER +SimFind

    Normalization

    External Features

    * Affiliation

    * Author

    Relationship Extraction

    Discourse Analysis

    Bag of SimplifiedSentences model ~

    Shot-gun sequencing

    DistributionalSemantics

    Word-orderpreserving pattern

    extraction

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    52/147

    Protein-Protein Interactions

    Central tenet of modern translational and genomic

    research

    Discovery methods: mass spectrometry,

    immunoprecipitation, Y2-H, and recently domain-based

    computational techniques

    Becoming increasingly important in understanding human

    diseases at system-wide and genomic level.

    Ex:- pathogenesis of Huntingtons disease

    Strong functional correlation with genes

    Create positive bias for genome-wide association analyses

    and reduce computation burden

    Drug discovery, Disease prognosis, Genetic Epidemiology,

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    53/147

    PPI Resources

    Numerous publicly available databases mostly humancurated.

    Ex:- HPRD, BioGrid, BIND, MINT, DIP, Reactome, UniHi,HPID, IntAct, STRING, GeneNetwork

    Manual curation despite years of effort, has only madea small dent (around 7%)

    Text mining systems to automatically extract PPIs areavailable.

    GENIES, BioRAT, GeneWays, MedScan, YAPPIE, AKANE

    Online tools for biologists

    PIE, SPIES, Whatizit, RelEx, PolySearch, PubGENE, CBioC

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    54/147

    Why another Novel Approach?

    F-score of the best system in BioCreative II (2007) is30%

    F-score of the best system in BioCreative II.5 (2009) is22% (different test set and online competition)

    Reason: Many systems have 80-90% F-score on acorpus of less than 10k sentences, but that doesntscale for random documents from (say) PubMedCentral: Humans are linguistically creative and paraphrase concepts

    in a different way by varying both vocabulary and syntax Distributional semantics addresses Vocabulary issue

    Sentence simplification based discourse analysis takes care ofSyntax issue

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    55/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Named Entity

    Recognition

    SimFind

    BANNER +SimFind

    Normalization

    External Features

    * Affiliation

    * Author

    Relationship Extraction

    Discourse Analysis

    Bag of SimplifiedSentences model ~

    Shot-gun sequencing

    DistributionalSemantics

    Word-orderpreserving pattern

    extraction

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    56/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Named Entity Recognition

    SimFind

    BANNER + SimFind

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    57/147

    Named Entity Recognition

    Task:

    Locate names in natural language text

    Specify their type

    Example entities:

    Newswire: people, organization, location

    Biomedical: gene, protein, species, disease, drug

    Motivation: e.g. relationship extraction; notpossible to perform manually

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    58/147

    Biomedical Named Entity Recognition

    Examples from GENIA gold standard:

    IL-2 gene expression and NF-kappa B activation

    through CD28 requires reactive oxygen production

    by 5-lipoxygenase.Lexicon Semantics

    IL-2 gene expression Other name

    IL-2 gene DNA domain or region

    NF-kappa B activation Other name

    NF-kappa B Protein molecule

    CD28 Protein molecule

    5-lipoxygenase Protein molecule

    Multi-labeling problem

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    59/147

    Semantics in Biomedical NER

    Early results: dictionary based (Settles 2004)

    Usually does not help (!)

    Dictionary gives a binary true / false

    Alternating Structure Optimization (Ando 2007) Use 5 million Medline abstracts as unlabeled data

    Good performance

    Too computationally intensive Word Clustering (Finkel 2009)

    Improvement not reported

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    60/147

    Distributional Semantics

    Main types: probabilistic & geometric Geometric methods represent terms as a vector

    in an N-dimensional space LSA uses term-document matrix

    HAL uses term-term matrix Schtzes Wordspace uses term - four-grams of words

    Large number of dimensions imply highcomputational & storage cost

    Need dimensionality reduction Example: LSA uses SVD

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    61/147

    Random Indexing

    Geometric method with reduced dimensionality

    Generates the reduced matrix directly

    JohnsonLindenstrauss Lemma: distance between points in vector

    space will be approximately preserved when projected into a reduced-

    dimensional subspace of sufficient dimensionality Computationally efficient

    O(n) in the size of the corpus

    LSA (SVD) is O(n3)

    Allows efficient incremental updates Accuracy comparable to LSA

    e.g. performs as well as SVD methods on the TOEFL synonym test

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    62/147

    Random Indexing

    Each context is assigned a vector High dimensional: n 1000 (not that high)

    Sparse: 1% of entries assigned {+1, -1}

    Randomly generated; zero-sum

    Large number of possible permutationsimplies vectors will be close to orthogonal

    Semantic term vectors are then just the linearsum of the term vectors in the sliding windowcontext where they occur

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    63/147

    Random Indexing: Visualization

    Semantic term vector for expression = sum(elemental vectors)

    elemental

    vectorstext

    SUM

    il-2

    gene

    expression

    and

    nf

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    64/147

    Term-Term Similarity Example

    Staphylococcus Antidepressants Pressure

    0.61: aureus 0.41: tricyclic 0.34: blood

    0.32: methicillin 0.33: antidepressant 0.33 systolic

    0.29: epidermidis 0.18: reuptake 0.28: pressures

    0.23: coagulase 0.17: tcas 0.28: mmhg

    0.21: mrsa 0.15: tricyclics 0.26: diastolic

    0.18: staphylococci 0.14: ssris 0.25: hg

    Source: T. Cohen and D. Widdows, "Empirical distributional semantics: Methods and biomedical applications,"

    Journal of Biomedical Informatics, vol. 42, 2009, p. 390405.

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    65/147

    Encoding Word Order

    Sequential structure of language often

    important

    Migraines cause nausea nausea causes

    migraines

    Can be captured in RI using a permutation

    operation

    Creates a new orthogonal vector Reversible; can recreate original vector

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    66/147

    RI + Word Order: Visualization

    Semantic term vector for expression = sum(permuted vectors)

    elemental

    vectors

    permuted

    vectorstext

    SUM

    il-2

    geneexpression

    and

    nf

    p(, -2)

    p(, -1)p(, 0)

    p(, +1)

    p(, +2)

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    67/147

    SimFind Algorithm

    SimFind(targetToken, Line){ List simSentences = getSimilarSentences(Line,100);

    List goldenTokenLabels = getTokenLabels(simSentences);

    STEP1:

    FOREACH (goldenTokenLabel)

    IF (goldenTokenLabel has targetToken as token)

    return goldenTokenLabel;

    STEP2:

    IF (token IN STOPLIST) return ;

    terms = 1;

    STEP3:

    terms *= 10;

    =getSimWords(targetToken,terms);

    FOREACH (equivToken)

    FOREACH (goldenTokenLabel)

    IF (goldenTokenLabel has targetToken as token) return goldenTokenLabel;

    IF (simIndex>0.5)

    goto STEP3;

    return ;

    }

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    68/147

    System Architecture

    GENIA Apache Lucene /

    Semantic Vectors

    Training with

    Annotation

    Testing w/o

    Annotation

    SimFind

    Random

    Index

    Vectors

    Testing with

    Annotation

    Stop Word

    ListLucene Tokenizer

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    69/147

    Sample Similar Sentences

    SENTENCE: This activation was due to the translocation of p65 andc-Rel NF.kappa B proteins from cytoplasmic stores to the nucleus,where they bound the kappa B sequence of the IL-2R alphapromoter either as p50.p65 or as p50.c-Rel heterodimers.

    1. The active nuclear form of the NF-kappa B transcription factorcomplex is composed of two DNA binding subunits, NF-kappa Bp65 and Nfkappa B p50, both of which share extensive N-terminalsequence homology with the v-rel oncogene product.

    2. Transcriptional activation of the human TF gene in monocytic cellsexposed to bacterial lipopolysaccharide (LPS) is mediated bybinding of c-Rel/p65 heterodimers to a kappa B site in the TF

    promoter.3. In contrast to induction of STATs by cytokines, the IRF-1 GAS-binding complex activated by CD40, TNF-alpha, or EBV containsRel proteins, specifically p50 and p65.

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    70/147

    Sample Similar Tokens

    SENTENCE: These results strongly suggest that HUinduces both transcriptional and post-transcriptionregulation of c-jun during erythroid differentiation.

    RESULT: HU is assigned label other_organic_compound

    Rank Token Label(s) Present in N-mostSimilarsentences?

    1 dexamethasone lipid No

    2 tpa other_name & other_organic_compound No

    3 jun DNA_domain_or_region No

    4 ap-1 DNA_domain_or_region No

    5 ald other_organic_compound Yes

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    71/147

    Sample SimFind OutputToken Label

    il-2 other_name & protein_molecule

    gene other_name & DNA_domain_or_region

    expression other_name

    and none

    nf protein_complexkappa other_organic_compound & protein_molecule

    b other_organic_compound & protein_molecule

    activation none

    through none

    cd28 protein_molecule

    requires none

    reactive inorganic

    oxygen inorganic

    ...

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    72/147

    Details

    Lucene tokenizer Sliding window model

    Dimensionality: 200

    Window size: 11

    No stop-words for the vector space model

    Stop words for SimFind: 421 derived from Browncorpus

    IO tagging scheme

    Implemented using open source SemanticVectors package

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    73/147

    GENIA Corpus

    400,000 words from 2000 PUBMED abstracts 100,000 annotations

    47 entity types, hierarchical

    17% of the entities are embedded in anotherentity

    Previous uses of this corpus recognized

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    74/147

    GENIA OntologyRoot

    source

    atom

    Other_name

    natural

    organism

    multi-cell

    mono_cell

    virus

    body_part

    tissue

    cell_type

    artificial

    cell_line

    other_artificial

    _source

    substance

    compound

    organic

    amino_acid

    protein

    protein_molecule

    protein_family

    _or_group

    protein_domain

    _or_region

    protein_substructure

    protein_subunit

    protein_complex

    protein_N/A

    peptide

    amino_acid

    _monomer

    nucleic

    _acid

    DNA

    DNA_molecule

    DNA_family

    _or_group

    DNA_domain

    _or_region

    DNA_substructure

    DNA_N/A

    RNA

    RNA_molecule

    RNA_family

    _or_group

    RNA_domain_or_region

    RNA_substructure

    RNA_N/A

    polynucleotide

    nucleotidelipid

    carbohydrate

    other_organic

    _compound

    inorganic

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    75/147

    Example from GENIA gold standard

    IL-2 gene expression and NF-kappa B activationthrough CD28 requires reactive oxygen production

    by 5-lipoxygenase.

    Lexicon Semantics

    IL-2 gene expression Other name

    IL-2 gene DNA domain or region

    NF-kappa B activation Other name

    NF-kappa B Protein molecule

    CD28 Protein molecule

    5-lipoxygenase Protein molecule

    Multi-labeling problem

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    76/147

    Evaluation

    Used 5 x 2 cross validation on the GENIA corpus Tokens may have more than one label due to

    embedded entities

    Precision, recall and F-score calculated withrespect to fragment match

    Must find correct node or child, no partial credit forancestors or cousins

    Overall micro-averaged F-score 67.3% More than half of the entities have an F-score

    greater than 50.0%

    RESULTS

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    77/147

    Entity Precision (%) Recall (%) F score Random F-score

    Bio-entity 78.9 82.5 80.7 26.22

    Substance 77.0 79.6 78.3 20.92

    Organic compound 77.0 79.5 78.2 20.82

    Compound 77.0 79.5 78.2 20.82

    Amino acid 69.4 71.1 70.3 13.69

    Protein 69.2 71.0 70.1 13.37

    Lipid 66.1 67.0 66.5 0.66

    Virus 65.6 67.3 66.4 0.86

    Source 61.4 66.2 63.7 5.62Atom 62.0 60.2 61.1 0.10

    Nucleotide 57.0 64.4 60.5 0.05

    Organism 59.6 58.8 59.2 1.23

    Carbohydrate 63.2 45.7 53.1 0.05

    DNA 48.3 52.6 50.4 5.31

    Cell type 42.7 50.7 46.3 2.14

    Cell line 44.0 44.9 44.5 1.87

    RNA 47.0 41.2 43.9 0.61

    Body part 39.6 45.0 42.1 0.10

    Peptide 41.9 32.7 36.7 0.15

    Polynucleotide 44.9 27.0 33.7 0.10

    Tissue 22.8 23.7 23.3 0.20

    Overall Score 66.3 68.4 67.3

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    78/147

    Future Work in NER

    Integrate into BANNER Corpus: GeneTag

    If time permits: Inter-corpus evaluation

    The best performing Protein NER systemAndo(2007) uses 5 million PubMed abstracts asunlabeled data

    Improvement because of using semantic features

    from Unlabeled data 2%

    Hopefully, we will make it too!

    h

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    79/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Named Entity

    Recognition

    SimFind

    BANNER +SimFind

    Normalization

    External Features

    * Affiliation

    * Author

    Relationship Extraction

    Discourse Analysis

    Bag of SimplifiedSentences model ~

    Shot-gun sequencing

    DistributionalSemantics

    Word-orderpreserving pattern

    extraction

    h

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    80/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Normalization

    External Features* Affiliation

    * Author

    i li i

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    81/147

    Protein Normalization

    It useful (though not essential) to identify the interactingproteins with a standard id Required in BioCreative

    Extending our motto of proposing novel approaches tocounter the creative human authors, I plan on using two

    additional external features along with our labs system: Author names of the closest PubMed abstract

    Geopolitical information of the first author

    The idea is that words in papers from similar people (samelast name) and similar places have same sense

    The closest PubMed abstract is found using distributionalsemantics based index

    Fi di G li i l I f i

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    82/147

    Finding Geopolitical Information

    Use affiliation sentence in the closest PubMed abstract Different features available

    Country

    City

    State Address, Email, URL

    Less useful information like names of buildings

    Organization name and Sub-organization names

    The last ones confirm least to general pattern becauseof idiosyncrasies of the variegated peoples responsiblefor naming an organization. Perhaps needs normalization too

    Multiple Layers, Multiple Rules,

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    83/147

    Multiple Layers, Multiple Rules,

    Multiple Dictionaries

    Neti, Neti: Elimination ofuntruth till one lands up atthe feet of absolute truth

    If finding an organization isdifficult, let us find what isnot an organization

    Easier things First.

    Each subtask - multiple rulesand multiple dictionaries

    C D i

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    84/147

    Country Detection

    1. Exact or Approximate(hidden within a phrase)search for the country name with or withoutdiacritics

    2. Exact search for names of important cities

    3. Exact Search for region names4. Exact Search for city names

    5. Email Aliases

    6. Approximate search for important cities names7. Approximate search for region names

    8. Approximate search for city names

    O i ti D t ti

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    85/147

    Organization Detection

    Most of the phrases already identified as notOrganization

    Bootstrap acronym and Replace

    Check O-Key

    Check Person Name or Place Name

    Person Name: Bootstrap

    Place Name: Dictionary

    And many more subtle rules

    E A l i * OCH d t

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    86/147

    Error Analysis* OCH data

    Total Number of Countries 4910 Number of Countries Detected 4758

    True Positives 4746

    False Positives 12

    False Positives after correction 1 Number of Countries Not Detected 152

    True Negatives 51

    False Negatives 102

    False Negatives after correction 23(all of them have somesuggestions)

    *by 3 annotators: Divya, George and Siddhartha

    Results Summary for Country

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    87/147

    y y

    Detection Precision, the percentage of results returned that

    are correct= TP/(TP+FP) = 99.8%

    Precision after corrections = 99.98%

    Recall, the fraction of (all) correct results

    returned = TP/(TP+FN) = 97.9% Recall after corrections = 99.54%

    F-measure, the harmonic mean of precision andrecall = 2PR/(P+R) = 98.8%

    F-measure after corrections = 99.76% For the second best known system(Yu, et al.):

    Precision = 94.0%; Recall = 92.1%; F-measure = 93.0%

    A l i * i St h A d t

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    88/147

    Analysis* using Staph Aureus data

    Organizations: Number of Affiliation sentences = 4000

    True Positives = 3989

    False Positives = 0

    False Negatives = 11

    Precision = 100%; Recall = 97.5%; F-measure = 98.7%

    States True Positives = 3528 False Positives = 470

    False Negatives = 2

    Precision = 88.2%; Recall = 99.9%; F-measure = 93.7%

    Cities True Positives = 3611

    False Positives = 2

    False Negatives = 387 Precision = 99.9%; Recall = 90.3%; F-measure = 94.9%

    For the second best known system(Yu, et al.): Precision = 86.8%; Recall = 91.3%; F-measure = 89.0%

    *Done by: Divya

    O i ti N li ti

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    89/147

    Organization Normalization Two types of Named Entities

    Described entities: those which uniquely identify with a realworld organization given the GPE

    All Organizations containing a person name, a place name, or adirectional modifier

    Ex:- Jerome Lipper Center for Multiple Myeloma, University of Texasand North Western University.

    Descriptor entities, also sub-organizations: those which dontuniquely identify with a real world organization unless in thepresence of a Described entity and whose primary role is to givemore specific information about a Described entity

    Ex:- School of informatics and Dept. of Biomedical Informatics

    We are interested only in Described entities, also calledorganizations

    A bi iti

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    90/147

    Ambiguities No polysemy at entity class. Single entity.

    Polysemy within entity class: Mayo Clinic, USA

    Mayo Clinic, Rochester, USA

    Synonymy at Word level because of NSWs Formatting errors mainly because of OCR in MARS

    Spelling mistakes

    (Abbreviations taken care of at NER stage)

    Synonymy at Inter-Word level due to lack ofconsensus in the choice of words while referringto an organization

    Example of Synonymy

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    91/147

    Example of SynonymyWashington University School of Medicine

    Washington University School of Medicine and St. Louis Children''s

    Hospital

    School of Medicine

    Washington University School of Medicine at Barnes-Jewish Hospital

    Barnes-Jewish Hospital at Washington University School of Medicine

    Washington University

    Washington University School of Medicien*

    Washington University School of Medicine and Metropolitan St. Louis

    Psychiatric Center

    Barnes Retina Institute and Washington University School of Meidcine*Division of Gastroenterology Washington University School of Medicine

    St. Louis*ibid

    Common Approach

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    92/147

    Common Approach

    Compare against a list or dictionary of organization Used in gene normalization

    Map the Entrez Gene identifiers for gene names mentioned inPubMed/MEDLINE abstracts

    Unlike genes, many organizations get renamed and somebecome defunct

    No community interest in maintaining a database

    Our Approach:

    automatically build a database of Organization clusters,

    OrgDB from 100,000 randomly selected affiliationsentences from PubMed published between the years1998 and 2008

    Clustering

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    93/147

    Clustering

    Entries in OrgDB Centroid string

    has the least sum of distances in DIST

    List of all organizations in each entry or cluster

    DIST matrix containing inter-component distance

    PubMed IDs of the components

    GPE city, state and country of Cluster

    An organization is added if it is similar accordingto String Similarity Metric

    And all entries in the cluster are updated

    Two types of Sequence Alignment

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    94/147

    Two types of Sequence Alignment

    Global Sequence Alignment Implemented by Needleman-Wunsch Algorithm

    using dynamic programming

    Local Sequence Alignment Implemented by Smith-waterman Algorithm

    Also uses dynamic programming

    Global Seq. Alignment is too strict Local Seq. Alignment is inaccurate

    Example

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    95/147

    Example

    Limitation of Global Alignment

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    96/147

    Limitation of Global Alignment

    Limitation of Local Alignment

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    97/147

    Limitation of Local Alignment

    Local Learning

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    98/147

    Local Learning

    Use local information from the training data tofurther enhance the value of the training set

    Have a Tight String Similarity Metric

    Understand the data by finding connectedcomponent

    Precise Computation

    We call it Recalculation through self-training

    Inspired by Charniak Mc Closky Parser,Brown University. Currently the best.

    Tight String Similarity (TSS) Metric

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    99/147

    Tight String Similarity (TSS) Metric

    Levenshtein distance between the two organizationnames NOT at the character level

    BUT at the word level

    AFTER removing stop words

    Criteria for two words a & b to be same:

    Parameters for Levenshtein Distance Penalty of gap = length of the word

    Penalty of mismatch = sum of the lengths of the word

    Sentences same if the distance between them not morethan 4

    Recalculation

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    100/147

    Recalculation TSS addresses Synonymycaused by NSWs

    Still Synonymy exists among clusters in OrgDB

    Due the lack of consensus in the choice of words

    Ex:- The David Geffen School of Medicine at The

    University Of California and DG School ofMedicine at The University Of California at Los

    Angeles.

    Recalculate: Synonyms of the current cluster

    Equivalent to finding connected component inthe corresponding graph.

    Seeing it as a Graph

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    101/147

    Seeing it as a Graph

    OrgDB equivalent to an undirected graphOrgG

    Vertices: clusters

    Edges: Between any two clusters One of which almost contains another like David Gaffen

    School of Medicine and DG School of Medicine

    Both have same GPE

    Almost contains if Extended Smith-

    Waterman Score (ESS) is more than 0.90

    Finding Connected Component

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    102/147

    Finding Connected Component

    Initialization: Add the vertex (cluster) we areconcerned with

    Propagation: Iteratively visit each unvisited vertex(cluster) closest to the root

    add all the vertices (clusters) adjacent to it and are notalready in the connected component

    Pruning: From depth 3, add only those vertices:

    which have an organization that collaborated in an

    article with an organization in one of the vertices(clusters) already in the connected component

    Prevents errors

    An Example

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    103/147

    An Example

    Input: PubMed ID: 16849888

    Affiliation sentence: Duke University Medical Centerand Duke Clinical Research Institute, Durham, NC27710, USA.

    NER: Organization: Duke University Medical Center and

    Duke Clinical Research Institute (O1)

    Country: USA

    State: North Carolina

    Zip Code: 27710

    City: DURHAM

    Example Continued

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    104/147

    Example Continued

    Adding to OrgDB or OrgG No cluster in OrgDB is close to O1

    Add new cluster for O1.

    Recalculation Step 1 Add O1 to the connected component, CC

    Example Continued

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    105/147

    Example Continued

    Recalculation Step 2 Organizations adjacent to O1 are:

    Duke Clinical Research Institute(O2)

    Duke University Medicalcenter(O3) Duke University Medical Center(O4)

    Duke University(O5)

    Add these to CC

    Continued

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    106/147

    Continued

    Recalculation Step 3 Expand all the nodes at level 2. For example, O2.

    The organization adjacent to O2 is Department ofBiostatistics and Bioinformatics and Duke Clinical

    Research Institute (O6). Add this to CC

    Repeat it for O3, O4 and O5.

    We get 14 more organizationsin CC O7 through O20

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    107/147

    Recalculation Step 4

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    108/147

    p

    Consider expanding O19 to :

    Durham Veterans Affairs Medical Center

    Veterans Affairs Medical Center. An examination of

    These two didnt collaborate (in any of the

    100,000 publications) with any organization inthe CC.

    This justifies our not adding these organizationsin the CC.

    Step 4 is continued for the rest of theorganizations

    Finally!

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    109/147

    Finally!

    Connected component gave us the set of all thesynonyms of the organization

    Depending on the objectives of normalization,the criterion to choose varies

    We picked the centroid string of the cluster withthe largest number of publication as thenormalized name Going by this criterion : Duke University Medical

    Centerbecomes the normalized name for Duke

    University Medical Center and Duke Clinical ResearchInstitute

    Analysis

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    110/147

    Analysis

    Test set: obtained 4135 articles related to a studyon Antiangiogenesis indexed in PubMedbetween 2004 and 2008

    The normalization process identified each articlewith a unique standard organization

    182 unique organizations were identified (13.8articles per organization)

    Overall 13 errors; 5 were caused only by NERwhich means the Normalization process alonehas a precision of 99.5% with 100% recall

    Magic Mappings

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    111/147

    Magic Mappings Discovering a richer set of synonyms than navetes

    (Digression) Top-10 organizations in terms of the

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    112/147

    number of publications

    (Digression) Top-10 most influential

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    113/147

    organizations

    Work Left

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    114/147

    Work Left

    Normalization is done only for USA need tobe extended globally

    Incorporating these features into the existing

    system should be easy Analyze the improvement in performance

    Corpus: BioCreative III training set and/or test

    set

    Thesis ONTOLOGY

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    115/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Named Entity

    Recognition

    SimFind

    BANNER +SimFind

    Normalization

    External Features

    * Affiliation

    * Author

    Relationship Extraction

    Discourse Analysis

    Bag of SimplifiedSentences model ~

    Shot-gun sequencing

    DistributionalSemantics

    Word-orderpreserving pattern

    extraction

    Thesis ONTOLOGY

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    116/147

    Thesis ONTOLOGY

    Protein-Protein Interaction Extraction

    Relationship Extraction

    Discourse Analysis

    Bag of Simplified Sentencesmodel ~ Shot-gun sequencing

    Distributional Semantics

    Word-order preservingpattern extraction

    Discourse Analysis

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    117/147

    Discourse Analysis

    Language is not merely a bag of words but a tool withparticular properties The linguists work is precisely to

    discover these properties, whether for descriptive analysis or

    for the synthesis of quasi-linguistic systems. Harris, 1954

    For this, DAs break the sentence into simpler clauses. Even Quirks simple sentence can still be a complex clause

    We have identified a new TNF-related ligand, designated human GITR

    ligand (hGITRL), and its human receptor (hGITR), an ortholog of the

    recentlydiscoveredmurine glucocorticoid-induced TNFR-related

    (mGITR) protein.

    DAs critically analyze discourse using the integration tool

    Discourse Analysis Tools

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    118/147

    Discourse Analysis Tools

    Deixis Tool ask how deictics are being used to tie what is said to

    context and to make assumptions about what authorsalready know or can figure out

    Pronoun Resolution

    Appositives

    Fill in Tool

    Based on what was said and the context in which it

    was said, what needs to be filled in here to achieveclarity?

    Distributional semantics

    Discourse Analysis Tools

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    119/147

    Discourse Analysis Tools

    Deixis Tool ask how deictics are being used to tie what is said to

    context and to make assumptions about what authorsalready know or can figure out

    Pronoun Resolution

    Appositives

    Fill in Tool

    Based on what was said and the context in which it

    was said, what needs to be filled in here to achieveclarity?

    Distributional semantics

    Discourse Analysis Tools

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    120/147

    Discourse Analysis Tools

    Integration Tool Find how clauses were integrated or packaged into

    main, subordinate, and embedded(relative) clauses

    What was missing and what got added?

    Example: It has been shown that LIGHT triggersapoptosis of various tumor cells including HT29 cellsthat express both lymphotoxin beta receptor ( LTbetaR ) and HVEM / TR2 receptors. LIGHT triggers apoptosis of various tumor cells.

    LIGHT triggers apoptosis of HT29 cells. HT29 cells express both lymphotoxin beta receptor and

    HVEM / TR2 receptors.

    Hallidays Systemic Functional

    G

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    121/147

    Grammar

    Three ways clauses expand to sentence Elaborating its existing structure

    Example: Relative clause

    Extending it by addition or replacement Example: Coordination

    Enhancing its environment

    Example: Cause-conditional

    We will be using these guidelines to design

    rules for creating simpler sentences

    BioSimplify

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    122/147

    BioSimplify

    GOAL: Create bag of simplified sentences forautomatic discourse analysis.

    Application: information extraction on

    biomedical text Existing methods: features like POS tags, parse

    trees and dependencies informally known asbag-of-NLP

    BOSS: standardize the representation ofgrammatical information in elemental chunks

    Sentence Simplification: Motivations

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    123/147

    p

    Improve human readability Shorter

    Grammatical

    Cohesive

    Information-preserving

    Text summarization

    Shorter

    Preserve only important

    Improve parser performance or Relationship Extraction

    Shorter

    Grammatical

    Information-preserving

    Architecture

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    124/147

    Noun Phrase Replacement Using POS tags and Noun phrase chunker

    POS tags: LingPipe

    Chunker: OpenNLP

    Syntactic Simplification

    Use any parser to produce CFG penn tree

    Parser: McClosky retraining parser 88%

    Information Extraction System

    Example: PIE

    Sentence

    NPReplacement

    SyntacticSimplification

    BOSS

    NP Chunker

    Parser

    Information

    Extractionsystem

    Noun Phrase Replacement

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    125/147

    p

    Noun phrase consists of an optional determinative, anoptional premodification, a mandatory head, and an optional

    postmodification

    Noun Phrase chunkers return all the noun phrases of the

    smallest length, thus always excluding the postmodifications Last word of the identified noun phrase is the head noun

    Removal of the optional determinative makes the sentence

    ungrammatical

    All tokens other than the head noun and the startingdeterminative or numeral (if exists) are removed

    For example, the noun phrase the recently discovered murine

    glucocorticoid is replaced with the glucocorticoid.

    Syntactic Simplification

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    126/147

    y p

    synSimp(t), t is the penn tree of the given sentence:

    -Initialize simpTrees, the ordered set containing the penn trees of all simplified sentences.

    -FOREACH subtree of t traversed in the order of depth-first traversal

    - perform necessary simplifications at that node which are the simplifications that

    neednt be repeated for all the parents to this node

    -Add the present tree to simpTrees

    -FOREACH unprocessed tree in simpTrees

    - FOREACH subtree of t traversed in the order of depth-first traversal

    - perform the simplifications for this node

    - add new trees in simpTrees if applicable

    -return the sentences represented by the trees in simpTrees

    END

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    127/147

    Rule: S ~ {S}. Condition:

    S contains NP

    S contains VP

    Explanation: Adds all simple sentences into bag

    Example: In differentiating C2C12 cells, E2F complexes

    switch and DNA synthesis in response to serum are

    prevented when MyoD DNA binding activity and the cdks

    inhibitor MyoD downstream effector p21 are induced.

    Result: MyoD DNA binding activity and the cdks inhibitor

    MyoD downstream effector p21 are induced.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    128/147

    Rule: NP[NP1 VP1*] ~ [NP1] {NP1 "can be" VP1} Condition:

    VP1 starts with a gerund, present participle or past participle

    Explanation: Postmodification by verb phrase

    Example:

    The cloning of members of these gene families and the identification of the

    protein-interaction motifs found within their gene products has initiated the

    molecular identity of factors (TRADD, FADD/MORT, RIP, FLICE/MACH, and

    TRAFs) associated with both of the p60 and p80 forms of the TNF receptor

    and with other members of the TNF receptor superfamily.

    Result: The cloning of members of these gene families and the identification of the

    protein-interaction motifs has initiated the molecular identity of factors

    The protein-interaction motifs can be found within their gene products.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    129/147

    Rule: NP[NP1 ADJP1] ~ [NP1] {NP1 "can be" ADJP1 } Explanation: Postmodification by adjective phrase

    Example:

    Src homology domain-2 (SH2)/SH3 domain - can be

    containing adapters such as Grb2, Crk, and Crk-L, whichinteract with guanine nucleotide exchange factors specific

    for the Ras family.

    Result:

    interact with guanine nucleotide exchange factors.

    Guanine nucleotide exchange factors can be specific for

    the Ras family.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    130/147

    Rule: NP[NP1 PRN] ~ [NP1] [PRN - LRB - RRB]

    Explanation:

    Add two sentences- one with abbreviation removed, the other with

    NP replaced by abbrev

    Example: Coexpression of the alpha and betaL subunits of the human interferon

    alpha (IFNalpha) receptor is required for the induction of an antiviral

    state by human IFNalpha.

    Result:

    Coexpression of the alpha and betaL subunits of the human interferon

    alpha receptor is

    Coexpression of the alpha and betaL subunits of the human IFNalpha

    is

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    131/147

    Rule: NP[NP1 PP] ~ [NP1]

    Explanation: Postmodification by prepositional phrase

    Example:

    To explore the role of the different domains of the betaL

    subunit in IFNalpha signaling, we coexpressed wild-typealpha subunit and truncated forms of the betaL chain in L-

    929 cells.

    Result:

    To explore the role in IFNalpha signaling, we coexpressedwild-type alpha subunit and truncated forms of the betaL

    chain in L-929 cells.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    132/147

    Rule: VP[MD VP1 , S*] ~ [MD VP1]

    Condition:

    S contains VP and not NP

    Explanation: Postmodification by verb phrase

    Example: T lymphocytes can be activated normally in response to either

    stimulus, demonstrating that the effects of the inactive CaMKIV on

    activation are reversible.

    Result: T lymphocytes can be activated normally in response to either

    stimulus.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    133/147

    Rule: NP[NP : S*] ~ [S*]

    Condition:

    S contains VP or NP

    Explanation: Section indicator

    Example: OBJECTIVE: To investigate the relationship between the expression of

    Th1/Th2 type cytokines and the effect of interferon-alpha therapy.

    Result:

    To investigate the relationship between the expression of Th1/Th2type cytokines and the effect of interferon-alpha therapy.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    134/147

    Rule: S[S1 , NP VP] ~ [NP VP]

    Condition:

    S1 doesnt contain both NP and VP

    Explanation: Content clause

    Example: To characterize these pathways, we focused on changes in the cyclin-

    dependent kinase inhibitors and their binding partners that underlie

    the cell cycle arrest at senescence.

    Result:

    We focused on changes in the cyclin-dependent kinase inhibitors and

    their binding partners that underlie the cell cycle arrest at senescence.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    135/147

    Rule: NP[NP SBAR] ~ [NP], {SBAR -WHNP + NP}

    Condition:

    Wh-NP in the relative clause is replaced by NP from main clause

    Explanation: Relative Clause

    Example: To characterize these pathways, we focused onchanges in the cyclin-dependent kinase inhibitors and their

    binding partners that underlie the cell cycle arrest at

    senescence.

    Result: changes in the cyclin-dependent kinase inhibitors and their binding

    partners.

    The cyclin-dependent kinase inhibitors and their binding partners

    underlie

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    136/147

    Rule: VP*, SBAR+ ~ -, SBAR

    Explanation: Relative Clause

    Example: As [Ca2+]o increased, [Ca2+]i rapidly increased, as

    monitored by fluorometry.

    Result: As [Ca2+]o increased, [Ca2+]i rapidly increased.

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    137/147

    Rule: VP , CC VP2] ~ [VP1] [VP2]

    Explanation: Coordination of verb phrases

    Example:

    These mechanisms must be understood in order to

    prevent, or combat, the emergence of a virulent,multidrug-resistant form of the bacillus that would be

    uncontrollable by means of today's treatment strategies.

    Result:

    These mechanisms must be understood in order toprevent, the emergence of a virulent, multidrug

    These mechanisms must be understood in order to combat

    , the

    Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    138/147

    Rule: VP[... , PP] ~ }, PP{

    Explanation: Postmodification by prepositional phrase

    Terminal prepositional phrase and preceding comma are removed

    from verb phrase

    Example: Because cell lines can lose their differentiated phenotype in culture

    across passages, documentation of gene expression must be

    determined for passage populations, for us to have knowledge of cell

    behavior in vitro.

    Result: Because cell lines can lose their differentiated phenotype in culture

    across passages, documentation of gene expression must be

    determined for passage populations.

    BioSimplify Parser

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    139/147

    The rules are perfect, so BioSimplify is as perfect as thepenn trees are.

    The Penn trees are as perfect as the performance ofparsers in biomedical domain

    Increasing parser performance (F-measure)

    Stanford (lexicalised) 72.5% (2003) Link Grammar ~70% (2006)

    Charniak-Lease 81% (2005)

    Charniak-McClosky 84% (2008)

    Charniak-McClosky 88% (2009)

    Calculated penn tree databases also available ASU BMI (& CSE)s PTDB parse tree database

    NLP web service provided by NCIBI

    Analysis

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    140/147

    Worst-case Time Complexity O(n

    2

    *R), where n = number of tokens in the sentence

    R = number of simplification rules

    Average Time Complexity O(nlog(n)*R)

    Better than our proto-type system(2009) based on LinkGrammar parser

    Time complexity O(n3*R)

    Link Grammar is getting behind in the race

    The dependencies produced arent standard like the penn trees

    BioSimplify Accuracy:precision of 90%, recall of 99% and f-

    score of 95%

    Test Set: 404 sentences from AIMed

    Old Model

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    141/147

    Preprocessing Removal of sentence indicators

    Removal of phrases in parentheses

    Partial resolution of coordination ellipsis Gene Entity Replacement

    Noun Phrase Replacement

    Syntactic transformation Grammatical correctness using GRAM vector

    Old Model: GRAM vector

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    142/147

    Every sentence can be uniquely associated with the 2-tuple of nullcount and disjunct cost (n,d)

    A null count (which represents unwanted words) needs moreattention than the disjunct cost (which represents less likelywords)

    We define the 2-tuple (n1, d1) to be greater than (n2, d2), if andonly if n1 is greater than n2, or, n1 is equal to n2 and d1 is greaterthan d2.

    The grammatical correctness of a collection of sentences ismeasured by the 2-tuple of the sum of the null counts of theindividual sentences and the sum of the disjunct costs of theindividual sentences respectively.

    Since null counts and disjunct costs are typically less than 10 (i.e,one-digit numbers), for the purpose of easy comparison and forcapturing the 2-tuples in one dimension, we define a new costvector GRAM which is equal to 10*UNUSED + DIS.

    OLD Model:Overview of Rules

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    143/147

    Rules for prefix subordination, infix subordination and if-then

    coordination (details in Siddharthan, 2003) These rules were also adapted recently by SimText (Ong, et

    al., 2008), a text simplification system for improving thereadability of medical literature, but without a mechanism tojudge the grammatical correctness.

    Differences

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    144/147

    New Version Old version

    Time Complexity O(nlog(n)*R) O(n3*R)

    Dependencies PTB format Non-standard LG linkages

    NP chunking LingPipe + OpenNLP Stanford

    Parser Charniak-McClosky Link Grammar

    Domain

    Adaptability

    Yes No

    Protein Replacement

    Scalability Yes No

    Customizability Yes NoModel Bag of simplified

    sentences

    Minimal number of

    Information-preserving

    maximally simple sentences

    PPI Extraction Experiment

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    145/147

    bioSimplifyAbstract

    from AIMed

    Simplified

    Abstract

    PIE PIE

    Remove

    Annotations

    Results

    for original

    sentences

    Results for

    Simplified

    sentences

    AIMed

    Comparison of

    different methods

    Work Flow for Each Abstract

    Results

  • 7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

    146/147

    Precision Recall F-score

    Original sentences 46 58 51

    Sentences simplified by older BioSimplify 51 64 57

    Sentences simplified by current BioSimplify 46 82 60