A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf

1/147

A semi-supervised approach to

extracting concepts and

relationships from clinical text

Siddhartha Jonnalagadda

5/18/2010 1ASU Biomedicine


2/147

Abstract

Heath care industry trillions of dollars of market share,information-rich clinical records abound

Goal: Extract mentions of entities such as treatment,lab test, and medical problems as well as associations

among them. Enable secondary use of this data:

Tracking performance

Optimizing resources

Biosurveillance Clinical Decision Support

Structure the unstructured narratives in clincal recordsusing Information Extraction (NLP)

5/18/2010 ASU Biomedicine 2


3/147

Applications of NLP for Biomedical

Informatics Bio Informatics: Curation of PPIs into STRING and SNPs

into modSNP

Clinical Informatics: Deidentification of patientinformation and extraction of code information. from

clinical records Public Health Informatics: Computational

Biosurveillance. For example: BioCaster tracks thedistribution of infectious disesase outbreaks fromlinguistic signals from Web.

Imaging informatics: Literature based discovery forimproving the state of art of medical imaging andbiomedical image search



4/147

Biomedical NLP vs. Clinical NLP

Input Scientific Literature Patient ReportsType of

relations found

Complex relations between

biomolecular substance

Descriptive

Grammar Relations based on verbs Nouns and adjectives

Overlap Tissues, cells, molecular components and diseases

User needs Focused: Literature Search, Curation& Hypothesis Testing Diverse : Coding, Dec. support,Terminology Mgmt, Lit. Search, Hypo.

Testing

Availability Open access of scientific literature Privacy concerns due to HIPPAA

Quality Peer-reviewed and written in English Local languages that arent peer-reviewed

Motivation Scientifically appealing and new

discoveries

Philanthropic , Humanitarian and medico-

economic motivation

Funding Stable since genome sequencing Fluctuations over years

Shared tasks BioCreative Shared task i2b2 NLP shared task


REFERENCE: Pierre Zweigenbaum, Natural Language Processing in the Medical and Biological Domains: a Parallel Perspective. Invited Talk, SMBM 2008


5/147

Hypothesis: Biomedical Clinical

Could our methods in biomedical NLP beadapted to Clinical NLP?

Problem: extracting medical problems, tests,

and treatments, and relations among them inclinical narratives

Large corpora of text unavailable

Very little annotated text Approach: 1) Vector similarity approach using

Distributional Semantics



6/147

Objectives

1. Evaluate the effectiveness of distributionalsemantics for entity recognition andassociation extraction from clinical records.

2. Develop a system for the automaticextraction of treatment, test, and medicalproblem associations from clinical records

3. Deploy the system as an i2b2 plug-in thatcould be used for clinical decision supportand also improve the extraction engine.



7/147

Clinical Information Extraction

Extracting concepts: Medical problem

Test

Treatment

Extracting relations (between concepts): which treatment improves a medical condition

which treatment worsens a medical condition

which treatment causes a medical problem

which treatment is administered for a medical problem

which treatment is not administered for a medical problem which test reveals a medical problem

which test is conducted to reveal a medical problem

which medical problem indicates another medical problem



8/147

Corpus

Downloaded from i2b2/VA shared task.

~100 de-identified clinical notes annotated forconcepts and relations

~1000 unlabeled de-identified clinical notes

Compiled from discharge summaries from Partners HealthCare

discharge summaries from Beth Israel Deaconess MedicalCenter

discharge summaries and progress notes from Universityof Pittsburgh Medical Center

Sign Data Usage and Confidentiality agreement

Compulsory participation in the shared task



9/147

Extracting concepts aka Named Entity Recognition is being studied for last two decades

for general domain and since a decade for medical domain

Can be dictionary-based, rule-based, or machine learning Medical Language Extraction and Encoding System (MEDLEE, 1997)

generates coded information for general clinical notes which usessyntactic patterns from 1000 grammar rules and some lexicon

MetaMap (2001) by NLM maps text to UMLS metathesarus uses

knowledge-intensive approach by detecting noun phrases and thenemploying around 1 M metathesarus strings

cTakes (2008) by Mayo uses UIMA for extracting clinical concepts.Their Nave Bayes Classifier with syntactic and morphological featuresachieved an F-score of 0.56 based on strict matching

Patrick et al.,(2010)s baseline for i2b2/VA NLP shared task has an F-score of 64%

Clinical NLP is lagging behind biomedical NLP (Meystre, et al. 2008)

Main reason scarcity of labeled data, wider variation in the formatof free text

Solution unsupervised or semi-supervised learning



10/147

Unsupervised Semi-Supervised Supervised

Goal Uncover hidden

regularities or todetect anomalies

in the data

Training on labeled

data and unlabeleddata, frequently

resulting in a more

accurate classifier.

Predict the label of

an unseen examplebased on labels of

the seen

Input Only unlabeled

data

Labeled data and

unlabeled data

Only Labeled data

Example: K-Means Co-training, ASO SVM

Output Clusters Usually classes Classes

Cost Low Moderate High

Accuracy Low High Moderate



11/147

Kernel Methods framework



12/147

ASO: Success of semi-supervised

learning in biological domain

IBMs Alternating Structure Optimization(ASO) implementation of Gene Tagger(enteredin Biocreative II, 2007) used 5 M MEDLINE

abstracts as unlabeled data. Result: ranked first in that competition

Liu and Ng(2007) tried ASO for SRL, but failed

because they cant fully use all the unlabeleddata because of limitations in computationalresources.



13/147

Two Take home messages from ASO

Unlabeled data can be very useful if the data is LARGE It is worthy to find if the kernel built using term similarity in

combined word space of Medline and unlabeled clinical

documents would improve the performance of clinical

concept extraction

However biomedical and clinical domains are usually

considered differentsublanguages

Pan et al. (2010) classified polarity of sentiments in one

domain using the annotation of a related-domain via

simultaneously co-clustering them in common latent space.

Need computationally scalable model (linear in space and

time complexity) for building kernel using LARGE data

Random Indexing is linear in space and time complexity5/18/2010 ASU Biomedicine 13


14/147

Use of Random Indexing based word

space models for designing kernel Random Indexing helps to reduce the dimensionality

of unsupervised data by mapping the terms intorandom index vectors

Semantic term vectors are built from random index

vectors by considering the context around the terms. Sahlgrens permutation model uses permutations of

terms in a sliding window surrounding the term tobuild a paradigmatic model of semantic term vectors

Semantic sentence vectors for each sentence in labeledcorpus is built by adding the individual term vectors

Kernels for terms and sentences are built by calculatingthe dot product of the corresponding semantic vectors



15/147

Approach

Labeled data: 100 annotated clinical documents

Unlabeled data: 1000 unlabeled clinical notes andMedline abstracts

Design kernel(s) (similarity metric(s)) using theunlabeled data

Implement the kernel algorithm using the labeleddata

Advantage: Use of unlabeled data that holdsknowledge of how words and sentences arerelated to each other.



16/147

Finding optimal parameters

Different parameters that need to be tested forfinding the most optimal settings

Dimensions in reduced space

Seed length Half-window size

Threshold for term-term similarity

Threshold for sentence-sentence similarity

Number of similar sentences to consider

First three parameters are specific to model, thelast three are universal as they belong to SimFind



17/147

Finding optimal models

Different possible paradigmatic models Sahlgren's permutation based order vector model (2008)

Sahlgren's directional vector model (2008)

Hyperspace Analog to Language (HAL) (1996)

Jones' convolution based BEAGLE model (2007) Cohens Reflective Random Indexing (2010)

Sahlgrens models are computationally scalablecompared to Jones BEAGLE

Sahlgrens directional model performed better than theconventionally used permutational model

HAL uses SVD, but encodes direction not order



18/147

Kernel Algorithm: SimFind

(Architecture)



19/147

Kernel Algorithm: SimFind

(Pseudocode)SimFind(targetToken, Line){

List simSentences =

getSimilarSentences(Line,100);

List goldenTokenLabel =

getTokenLabels(simSentences);

STEP1:FOREACH (goldenTokenLabel)

IF (goldenTokenLabel has

targetToken as token)

RETURN goldenTokenLabel;

STEP2:IF (token IN STOPLIST)

RETURN ;

terms = 1;

STEP3:

terms *= 10;

=getSim

Words(targetToken,terms);

FOREACH (equivToken)

FOREACH (goldenTokenLabel)IF (goldenTokenLabel has

targetToken as token)

RETURN goldenTokenLabel;

IF (simIndex>0.5)

goto STEP3;RETURN ;

EXIT;

}



20/147

SimFind vs. Linear Discriminant

Analysis

Uses Random Indexing for

dimensionality reduction which

is O(N) and almost perfect

Uses Singular Value

Decomposition which is O(N3)

and perfect

Unsupervised dimensionality

reduction

Supervised dimensionality

reduction

Scalable to large amount of

unlabeled data

Applicable only to labeled data

Random Indexing fixes the

number of dimensions apriori

Finds the significant

dimensions in order like LSA

Doesnt employ kernel trick Employs kernel trick

Uses 2 kernels Uses single kernel



21/147

Perils of lazy learning

SimFind is a special case of K-Nearest Neighbor, asupervised machine learning algorithm alsoknown as lazy learning algorithm for its short

training time and long testing time Time complexity: O(N*T), where N is the number

of terms in training set and T is the number oftokens in the input or test set. N ~ 500,000

Unfortunately, well known applications that uselarge unlabeled data use K-NN. For example: PRC,MESH UP and RRI based MESH indexing



22/147

Overcoming the perils

Ando (2007) removed sentences with words thatalready occurred 25 times

Vasuki and Cohen (2010) used parallel processing tominimize I/O

Observation: elements of the kernel are computed during the execution

of the kernel learning algorithm

the kernel learning algorithm only needs terms closelyrelated to the terms in the corpus

what terms are needed for the task has a high correlationwith what terms are present in the corpus

Solution: Modify kernel design to calculate the kernelmatrix at once rather than postpone it to learning step



23/147

Modified Kernel design step

Use Sahlgrens permutation model or a better one withoptimal parameters to build reduced dimensional wordspace for MEDLINE and clinical notes

For each term in labeled corpus, store the only the

terms from word space that are similar by a thresholdof cosine similarity

Also store the cosine similarities

The second kernel for sentence similarity has to behowever calculated during learning step

Time complexity: O(S*T) S = number of sentences inthe input or test set, or O(T) per sentence where T isthe average number of tokens in a sentence



24/147

Integrating SimFind with other

features Nave Bayes Classifier trained on the noun phrases and adjective phrases

in the annotated corpus

Features: Section of the sentence

Lexical features

Grammatical correctness

output label from SimFind for the target token.

most similar token from the corpus to the target token

second most similar token from the corpus to the target token

third most similar token from the corpus to the target token

most similar token to the target token from the 100 most similar sentences tothe target sentence

second most similar token to the target token from the 100 most similarsentences to the target sentence

third most similar token to the target token from the 100 most similarsentences to the target sentence



25/147

Moving on to extracting relations

An extension to concept extraction Types of relations

which treatment improves a medical condition

which treatment worsens a medical condition

which treatment causes a medical problem which treatment is administered for a medical problem

which treatment is not administered for a medical problem

which test reveals a medical problem

which test is conducted to reveal a medical problem which medical problem indicates another medical problem

From BioCreative experience: better is the conceptextraction, better is the relation extraction



26/147

Approaches

Machine Learning: less precise because ofover-fitting, but high recall

Rule-based or pattern-matching: less recall

because of limited patterns, but high precision Solution: Increase recall of pattern-matching

by allowing for

fuzzy matching based on vector similarity unmask relationships hiding in syntactic jungle

using sentence simplification



27/147

Vector similarity of a sentence and a pattern

Without order a sentence or a pattern can be represented in

word space as vector sum of the individual terms

Order can be encoded by using the permutational model of

Sahlgren

For example: the vector for hypertension was controlled on

hydrochlorothiazide would be |0(hypertension) +1(controlled) + 2(hydrochlorothiazide)|, where is a random

permutation and || is the L2-Norm

While this appears theoretically sound, empirically permutation

model performed suboptimally for large windows Probable reason: Increases the ratio of seed length and the

number of dimensions, thus compromising the Johnson-

LindenStrauss Lemma



28/147

Concatenation as a Means to Encode

Order in Word Space Define C(a, b) to be the vector formed by concatenating the

vectors corresponding to a and b

Order can be partially encoded concatenating n-grams andsumming the vectors

For n=2, the vector for hypertension was controlled onhydrochlorothiazide would be |C(^, hypertension) C(hypertension, controlled) + C(controlled,hydrochlorothiazide) + C(controlled, hydrochlorothiazide) +C(hydrochlorothiazide, $)|, where ^ and $ are randomvectors are assigned to the beginning and end

Random vector can also be assigned to * in a patternexpression such as P regulates Tr so that the patternis also expressed as a vector



29/147

Example sentences from one discharge

summary The patient has a history of exertional angina and chest pain

associated with light-headedness for nine years which was noted toincrease in frequency over the past year , then upon admission hadlight-headedness and chest pain with a dull pressure in her neck toher substernal area , with only minimal exertion such as "walking across the room " .

She underwent angiography on 5-9-92 which showed the rightinternal carotid artery to be patent , however there was highlysignificant stenosis of the left carotid artery and she was taken tothe operating room later that day for a left carotid endarterectomy .

She was taken postoperatively the ICU where she was extubated on

postoperative day number one and by 5-11-92 she was noted tohave markedly increased use of her right side , resolving aphasiaand she was transferred to the floor with the residual deficit onlynoted to be some left upper extremity weakness .



30/147

BioSimplify

GOAL: Create bag of simplified sentences forautomatic discourse analysis.

Application: information extraction on

biomedical and clinical text Existing methods: features like POS tags, parse

trees and dependencies informally known as

bag-of-NLP BOSS: standardize the representation of

grammatical information in elemental chunks



31/147

BioSimplify (Architecture)

Noun Phrase Replacement

Using POS tags and Noun phrase chunker

POS tags: LingPipe

Chunker: OpenNLP Syntactic Simplification

Use any parser to produce CFG penn tree

Parser: McClosky retraining parser 88%

Information Extraction System Example: PIE


Sentence

NP

Replacement

SyntacticSimplification

BOSS

NP Chunker

Parser

Information

Extractionsystem


32/147


Noun phrase consists of an optional determinative, an optional

premodification, a mandatory head, and an optional postmodification

Noun Phrase chunkers return all the noun phrases of the smallest length,

thus always excluding the postmodifications

Last word of the identified noun phrase is the head noun

Removal of the optional determinative makes the sentence

ungrammatical

All tokens other than the head noun and the starting determinative or

numeral (if exists) are removed

For example, the noun phrase the recently discovered murineglucocorticoid is replaced with the glucocorticoid.



33/147

Syntactic Simplification

synSimp(t), t is the penn tree of the given sentence:

-Initialize simpTrees, the ordered set containing the penn trees of all

simplified sentences.

-FOREACH subtree of t traversed in the order of depth-first traversal

- perform necessary simplifications at that node which are thesimplifications that neednt be repeated for all the parents to this node

-Add the present tree to simpTrees

-FOREACH unprocessed tree in simpTrees

- FOREACH subtree of t traversed in the order of depth-first traversal

- perform the simplifications for this node

- add new trees in simpTrees if applicable

-return the sentences represented by the trees in simpTrees

END



34/147

Example Rule Rule: NP[NP SBAR] ~ [NP], {SBAR -WHNP + NP}

Condition:

Wh-NP in the relative clause is replaced by NP from main clause

Explanation: Relative Clause

Example: To characterize these pathways, we focused onchanges in the cyclin-dependent kinase inhibitors and their

binding partners that underlie the cell cycle arrest at

senescence.

Result:

changes in the cyclin-dependent kinase inhibitors and their binding

partners.

The cyclin-dependent kinase inhibitors and their binding partners

underlie



35/147

Using BioSimplify for Information

Extraction


BioSimplifyDoc from

corpus

Simplified

docs

IE system IE system

RemoveAnnotations

Results

for original

sentences

Results for

Simplified

sentences

Corpus

Comparison of

different methods

Work Flow for Each Abstract


36/147

Fuzzy pattern matching approach Generating patterns: Reduce each sentence to the

snippets with concepts from the ground truth andsome interaction keywords

Manually find synonyms of the keywords: theparadigmatic vector model we built for concept

extraction task Strict pattern matching using OpenDMAP

Vector representation of sentences by encoding orderusing Sahlgren's permutation model or a proposedconcatenation model

use BioSimplify to transform the focus sentence into abag of simplified sentences and search for patterns ineach sentence



37/147

Creating symbiotic relationship with

clinicians

People are willing to do for free what they arenot willing to do for small amounts of money,Spolsky, founder of stackoverflow.com

Online servers for hospitals to use for extractionof clinical concepts and relations as part of thei2b2 hive

Hospital staff verify the output of our system and

correct them if necessary De-identified corrections retrieved regularly from

i2b2 to improve the system



38/147

Broader impacts and Sustainability

Plans Proposed using unlabeled data for improving extraction of concepts

and relationships using distributional semantics methods

Suggested using sentence simplification for aiding interactionextraction

Novel additions to clinical domain

Successfully adaptation would have a transformative influence onthe field of information extraction

To sustain the application of these methods, we will evaluate themagainst other competitive systems by participating in internationalcompetitions like i2b2/VA NLP shared tasks

create a service-oriented architecture to offer the services of oursystem to the world for free of cost Wider dissemination

Obtainment of useful feedback


T Siddh h J l dd (PI)


39/147

Team: Siddhartha Jonnalagadda (PI) Currently pursuing PhD in Biomedical Informatics

B.Tech in CSE from IIT Kharagpur (5th in GPA among the class of 2008)

Inlaks Awards of Excellence at IITs [2006]

10th rank in All India Engineering Entrance among 730, 000 students [2004]

Indian National Physics Olympiad Gold medalist [2004]

Regional Mathematics Olympiad Silver medalist [2003]

National Talent Search Examination Scholarship [2002]

Relevant Publications: S Jonnalagadda, L Tari, J Hakenberg, G Gonzalez. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text.

NAACL 2009

S Jonnalagadda, P Topham, G Gonzalez. Towards Automatic Extraction of Social Networks of Organizations in PubMed Abstracts. GTBN

workshop in IEEE BIBM 2009

S Jonnalagadda, G Gonzalez. Sentence Simplification Aids Protein-Protein Interaction Extraction. LBM 2009

S Jonnalagadda, P Topham, G Gonzalez. ONER: Tool for Organization Named Entity Recognition from Affiliation Strings in PubMed Abstracts.

LBM 2009

S Jonnalagadda, R Leaman, T Cohen and G Gonzalez. A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes ofNamed Entities. CICLing 2010, LNCS 6008

J Hakenberg J, R Leaman R, V Nguyen V, S Jonnalagadda, et al. Efficient extraction of protein-protein interactions from full-text articles.

Accepted by IEEE/ACM TCBB. 2010.

S Jonnalagadda, G Gonzalez. BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical

information extraction. Submitted to AMIA 2010

5/18/2010 39ASU Biomedicine


40/147

Team: Dr. Graciela Gonzalez (Co-PI) Assistant Professor in Biomedical Informatics at Arizona State

University

Research Interests: natural language processing, knowledgerepresentation, and translational bioinformatics

NSF panelist

Member of Biomedical Library and Informatics Review Commitee

(BLIRC) Director of Data Management and Statistics Core of the Arizona

Alzheimer's Disease Center

Director of DIEGO: Data Integration and Extraction of Genomic andClinical Ontologies

Will oversee the project Provide other student researchers of her lab, who come from

informatics and clinical backgrounds



41/147

Team: Dr. Trevor Cohen(Co-PI)

Assistant Professor, School of Health InformationSciences, University of Texas, Houston

Research Interest: Empirical distributional semantics

One of the developers of the Semantic Vectorspackage

Proposed a novel word space model called ReflectiveRandom Indexing

Advise on use of distributional semantics methods

Offered a 16 GB quad-core Opetron server primarilyfor this project



42/147

Team: Mr. Robert Leaman(Co-PI)

Member of DIEGO lab Bachelors of Science degree in Computer Science from

Brigham Young University

Several years in industry

Ph.D. student in Computer Science at Arizona StateUniversity

Research interests: Computational Linguistics, text miningand Named Entity Recognition

Developed BANNER, an open-source biomedical NER

system Mr. Leaman and Siddhartha have been working together in

the BioCreative shared task



43/147

Time Plan

Month 1: Evaluate different word space models along with their parameters to

empirically discover the best vector-based kernel for biomedical and clinical text.

Month 2: Building a preliminary system for concept extraction that uses SimFind

learning algorithm and the optimal kernel discovered above using the clinical

corpus provided by the i2b2/VA for the shared task.

Month 3: Statistically ensemble the outputs of SimFind along with othermorphological and contextual features into a machine-learning framework.

Month 4: Reduce the number of the term vectors in the distributional semantics

model for unlabeled data and achieve efficient integration of MedLine and

unlabeled clinical documents' distributional features into the system

Month 5: adapt our protein-protein interaction system for relationship extraction

on clinical text and introduce the novel features of fuzzy pattern matching and

bag-of-simplified-sentences

Month 6-8: Create the symbiotic platform with clinicians



44/147

Thanks

Site Visit Team

Members of DIEGO lab

Faculty and students of BMI

Questions?



45/147

Appendix



46/147

Novel Approach to Biomedical Text

Mining: Case Study on PPI ExtractionSiddhartha JonnalagaddaPhD Candidate

Department of Biomedical Informatics

Although the world is full of suffering, it is full also of the overcoming of it. Helen Keller


47/147

Motivation

Information Extraction from Biomedical text isstill an open problem

BioCreative Protein Identification : 42.9%

BioCreative PPI Extraction: 22.1% Exploit Discourse Analysis Approach

Application of Distributional Semantics

Need for high performance, scalable, open-source and adaptable BioNLP systems

Integrate NLP, Linguistics & Machine Learning


48/147

Complexity of Biomedical Sentences

Compared to regular English1. More number of words per sentence

2. Inconsistent use of nouns and partial words

3. Higher perplexity measures

4. Greater Lexical Density5. Increased number of relative clauses and prepositional

phrases

6. Specialized names (e.g. p53, c-Abl)

7. Chemical names with commas and parentheses (e.g.1,25(OH)2D3 )

8. More Coordination Ellipsis (e.g. alpha- and beta-catenin)


49/147

Hypothesis

1. Removing complexity in biomedical

sentences helps in unmasking relationships

from NLP systems

2. Random indexing based measures areeffective and more efficient in comparing

patterns of words or sentences than

traditional machine learning approachesusing non-semantic features


50/147

Thesis ONTOLOGY

Protein-Protein Interaction Extraction

Named Entity

Recognition

SimFind

BANNER +SimFind

Normalization

External Features

* Affiliation

* Author

Relationship Extraction

Discourse Analysis

Bag of SimplifiedSentences model ~

Shot-gun sequencing

DistributionalSemantics

Word-orderpreserving pattern

extraction


51/147

Thesis ONTOLOGY


Named Entity

Recognition

SimFind

BANNER +SimFind

Normalization

External Features

* Affiliation

* Author


Discourse Analysis


Shot-gun sequencing



extraction


52/147

Protein-Protein Interactions

Central tenet of modern translational and genomic

research

Discovery methods: mass spectrometry,

immunoprecipitation, Y2-H, and recently domain-based

computational techniques

Becoming increasingly important in understanding human

diseases at system-wide and genomic level.

Ex:- pathogenesis of Huntingtons disease

Strong functional correlation with genes

Create positive bias for genome-wide association analyses

and reduce computation burden

Drug discovery, Disease prognosis, Genetic Epidemiology,


53/147

PPI Resources

Numerous publicly available databases mostly humancurated.

Ex:- HPRD, BioGrid, BIND, MINT, DIP, Reactome, UniHi,HPID, IntAct, STRING, GeneNetwork

Manual curation despite years of effort, has only madea small dent (around 7%)

Text mining systems to automatically extract PPIs areavailable.

GENIES, BioRAT, GeneWays, MedScan, YAPPIE, AKANE

Online tools for biologists

PIE, SPIES, Whatizit, RelEx, PolySearch, PubGENE, CBioC


54/147

Why another Novel Approach?

F-score of the best system in BioCreative II (2007) is30%

F-score of the best system in BioCreative II.5 (2009) is22% (different test set and online competition)

Reason: Many systems have 80-90% F-score on acorpus of less than 10k sentences, but that doesntscale for random documents from (say) PubMedCentral: Humans are linguistically creative and paraphrase concepts

in a different way by varying both vocabulary and syntax Distributional semantics addresses Vocabulary issue

Sentence simplification based discourse analysis takes care ofSyntax issue


55/147

Thesis ONTOLOGY


Named Entity

Recognition

SimFind

BANNER +SimFind

Normalization

External Features

* Affiliation

* Author


Discourse Analysis


Shot-gun sequencing



extraction


56/147

Thesis ONTOLOGY


Named Entity Recognition

SimFind

BANNER + SimFind


57/147

Named Entity Recognition

Task:

Locate names in natural language text

Specify their type

Example entities:

Newswire: people, organization, location

Biomedical: gene, protein, species, disease, drug

Motivation: e.g. relationship extraction; notpossible to perform manually


58/147

Biomedical Named Entity Recognition

Examples from GENIA gold standard:

IL-2 gene expression and NF-kappa B activation

through CD28 requires reactive oxygen production

by 5-lipoxygenase.Lexicon Semantics

IL-2 gene expression Other name

IL-2 gene DNA domain or region

NF-kappa B activation Other name

NF-kappa B Protein molecule

CD28 Protein molecule

5-lipoxygenase Protein molecule

Multi-labeling problem


59/147

Semantics in Biomedical NER

Early results: dictionary based (Settles 2004)

Usually does not help (!)

Dictionary gives a binary true / false

Alternating Structure Optimization (Ando 2007) Use 5 million Medline abstracts as unlabeled data

Good performance

Too computationally intensive Word Clustering (Finkel 2009)

Improvement not reported


60/147


Main types: probabilistic & geometric Geometric methods represent terms as a vector

in an N-dimensional space LSA uses term-document matrix

HAL uses term-term matrix Schtzes Wordspace uses term - four-grams of words

Large number of dimensions imply highcomputational & storage cost

Need dimensionality reduction Example: LSA uses SVD


61/147

Random Indexing

Geometric method with reduced dimensionality

Generates the reduced matrix directly

JohnsonLindenstrauss Lemma: distance between points in vector

space will be approximately preserved when projected into a reduced-

dimensional subspace of sufficient dimensionality Computationally efficient

O(n) in the size of the corpus

LSA (SVD) is O(n3)

Allows efficient incremental updates Accuracy comparable to LSA

e.g. performs as well as SVD methods on the TOEFL synonym test


62/147

Random Indexing

Each context is assigned a vector High dimensional: n 1000 (not that high)

Sparse: 1% of entries assigned {+1, -1}

Randomly generated; zero-sum

Large number of possible permutationsimplies vectors will be close to orthogonal

Semantic term vectors are then just the linearsum of the term vectors in the sliding windowcontext where they occur


63/147

Random Indexing: Visualization

Semantic term vector for expression = sum(elemental vectors)

elemental

vectorstext

SUM

il-2

gene

expression

and

nf


64/147

Term-Term Similarity Example

Staphylococcus Antidepressants Pressure

0.61: aureus 0.41: tricyclic 0.34: blood

0.32: methicillin 0.33: antidepressant 0.33 systolic

0.29: epidermidis 0.18: reuptake 0.28: pressures

0.23: coagulase 0.17: tcas 0.28: mmhg

0.21: mrsa 0.15: tricyclics 0.26: diastolic

0.18: staphylococci 0.14: ssris 0.25: hg

Source: T. Cohen and D. Widdows, "Empirical distributional semantics: Methods and biomedical applications,"

Journal of Biomedical Informatics, vol. 42, 2009, p. 390405.


65/147

Encoding Word Order

Sequential structure of language often

important

Migraines cause nausea nausea causes

migraines

Can be captured in RI using a permutation

operation

Creates a new orthogonal vector Reversible; can recreate original vector


66/147

RI + Word Order: Visualization

Semantic term vector for expression = sum(permuted vectors)

elemental

vectors

permuted

vectorstext

SUM

il-2

geneexpression

and

nf

p(, -2)

p(, -1)p(, 0)

p(, +1)

p(, +2)


67/147

SimFind Algorithm

SimFind(targetToken, Line){ List simSentences = getSimilarSentences(Line,100);

List goldenTokenLabels = getTokenLabels(simSentences);

STEP1:

FOREACH (goldenTokenLabel)

IF (goldenTokenLabel has targetToken as token)

return goldenTokenLabel;

STEP2:

IF (token IN STOPLIST) return ;

terms = 1;

STEP3:

terms *= 10;

=getSimWords(targetToken,terms);

FOREACH (equivToken)

FOREACH (goldenTokenLabel)

IF (goldenTokenLabel has targetToken as token) return goldenTokenLabel;

IF (simIndex>0.5)

goto STEP3;

return ;

}


68/147

System Architecture

GENIA Apache Lucene /

Semantic Vectors

Training with

Annotation

Testing w/o

Annotation

SimFind

Random

Index

Vectors

Testing with

Annotation

Stop Word

ListLucene Tokenizer


69/147

Sample Similar Sentences

SENTENCE: This activation was due to the translocation of p65 andc-Rel NF.kappa B proteins from cytoplasmic stores to the nucleus,where they bound the kappa B sequence of the IL-2R alphapromoter either as p50.p65 or as p50.c-Rel heterodimers.

1. The active nuclear form of the NF-kappa B transcription factorcomplex is composed of two DNA binding subunits, NF-kappa Bp65 and Nfkappa B p50, both of which share extensive N-terminalsequence homology with the v-rel oncogene product.

2. Transcriptional activation of the human TF gene in monocytic cellsexposed to bacterial lipopolysaccharide (LPS) is mediated bybinding of c-Rel/p65 heterodimers to a kappa B site in the TF

promoter.3. In contrast to induction of STATs by cytokines, the IRF-1 GAS-binding complex activated by CD40, TNF-alpha, or EBV containsRel proteins, specifically p50 and p65.


70/147

Sample Similar Tokens

SENTENCE: These results strongly suggest that HUinduces both transcriptional and post-transcriptionregulation of c-jun during erythroid differentiation.

RESULT: HU is assigned label other_organic_compound

Rank Token Label(s) Present in N-mostSimilarsentences?

1 dexamethasone lipid No

2 tpa other_name & other_organic_compound No

3 jun DNA_domain_or_region No

4 ap-1 DNA_domain_or_region No

5 ald other_organic_compound Yes


71/147

Sample SimFind OutputToken Label

il-2 other_name & protein_molecule

gene other_name & DNA_domain_or_region

expression other_name

and none

nf protein_complexkappa other_organic_compound & protein_molecule

b other_organic_compound & protein_molecule

activation none

through none

cd28 protein_molecule

requires none

reactive inorganic

oxygen inorganic

...


72/147

Details

Lucene tokenizer Sliding window model

Dimensionality: 200

Window size: 11

No stop-words for the vector space model

Stop words for SimFind: 421 derived from Browncorpus

IO tagging scheme

Implemented using open source SemanticVectors package


73/147

GENIA Corpus

400,000 words from 2000 PUBMED abstracts 100,000 annotations

47 entity types, hierarchical

17% of the entities are embedded in anotherentity

Previous uses of this corpus recognized


74/147

GENIA OntologyRoot

source

atom

Other_name

natural

organism

multi-cell

mono_cell

virus

body_part

tissue

cell_type

artificial

cell_line

other_artificial

_source

substance

compound

organic

amino_acid

protein

protein_molecule

protein_family

_or_group

protein_domain

_or_region

protein_substructure

protein_subunit

protein_complex

protein_N/A

peptide

amino_acid

_monomer

nucleic

_acid

DNA

DNA_molecule

DNA_family

_or_group

DNA_domain

_or_region

DNA_substructure

DNA_N/A

RNA

RNA_molecule

RNA_family

_or_group

RNA_domain_or_region

RNA_substructure

RNA_N/A

polynucleotide

nucleotidelipid

carbohydrate

other_organic

_compound

inorganic


75/147

Example from GENIA gold standard

IL-2 gene expression and NF-kappa B activationthrough CD28 requires reactive oxygen production

by 5-lipoxygenase.

Lexicon Semantics

IL-2 gene expression Other name

IL-2 gene DNA domain or region

NF-kappa B activation Other name

NF-kappa B Protein molecule

CD28 Protein molecule

5-lipoxygenase Protein molecule

Multi-labeling problem


76/147

Evaluation

Used 5 x 2 cross validation on the GENIA corpus Tokens may have more than one label due to

embedded entities

Precision, recall and F-score calculated withrespect to fragment match

Must find correct node or child, no partial credit forancestors or cousins

Overall micro-averaged F-score 67.3% More than half of the entities have an F-score

greater than 50.0%

RESULTS


77/147

Entity Precision (%) Recall (%) F score Random F-score

Bio-entity 78.9 82.5 80.7 26.22

Substance 77.0 79.6 78.3 20.92

Organic compound 77.0 79.5 78.2 20.82

Compound 77.0 79.5 78.2 20.82

Amino acid 69.4 71.1 70.3 13.69

Protein 69.2 71.0 70.1 13.37

Lipid 66.1 67.0 66.5 0.66

Virus 65.6 67.3 66.4 0.86

Source 61.4 66.2 63.7 5.62Atom 62.0 60.2 61.1 0.10

Nucleotide 57.0 64.4 60.5 0.05

Organism 59.6 58.8 59.2 1.23

Carbohydrate 63.2 45.7 53.1 0.05

DNA 48.3 52.6 50.4 5.31

Cell type 42.7 50.7 46.3 2.14

Cell line 44.0 44.9 44.5 1.87

RNA 47.0 41.2 43.9 0.61

Body part 39.6 45.0 42.1 0.10

Peptide 41.9 32.7 36.7 0.15

Polynucleotide 44.9 27.0 33.7 0.10

Tissue 22.8 23.7 23.3 0.20

Overall Score 66.3 68.4 67.3


78/147

Future Work in NER

Integrate into BANNER Corpus: GeneTag

If time permits: Inter-corpus evaluation

The best performing Protein NER systemAndo(2007) uses 5 million PubMed abstracts asunlabeled data

Improvement because of using semantic features

from Unlabeled data 2%

Hopefully, we will make it too!

h


79/147

Thesis ONTOLOGY


Named Entity

Recognition

SimFind

BANNER +SimFind

Normalization

External Features

* Affiliation

* Author


Discourse Analysis


Shot-gun sequencing



extraction

h


80/147

Thesis ONTOLOGY


Normalization

External Features* Affiliation

* Author

i li i


81/147

Protein Normalization

It useful (though not essential) to identify the interactingproteins with a standard id Required in BioCreative

Extending our motto of proposing novel approaches tocounter the creative human authors, I plan on using two

additional external features along with our labs system: Author names of the closest PubMed abstract

Geopolitical information of the first author

The idea is that words in papers from similar people (samelast name) and similar places have same sense

The closest PubMed abstract is found using distributionalsemantics based index

Fi di G li i l I f i


82/147

Finding Geopolitical Information

Use affiliation sentence in the closest PubMed abstract Different features available

Country

City

State Address, Email, URL

Less useful information like names of buildings

Organization name and Sub-organization names

The last ones confirm least to general pattern becauseof idiosyncrasies of the variegated peoples responsiblefor naming an organization. Perhaps needs normalization too

Multiple Layers, Multiple Rules,


83/147

Multiple Layers, Multiple Rules,

Multiple Dictionaries

Neti, Neti: Elimination ofuntruth till one lands up atthe feet of absolute truth

If finding an organization isdifficult, let us find what isnot an organization

Easier things First.

Each subtask - multiple rulesand multiple dictionaries

C D i


84/147

Country Detection

1. Exact or Approximate(hidden within a phrase)search for the country name with or withoutdiacritics

2. Exact search for names of important cities

3. Exact Search for region names4. Exact Search for city names

5. Email Aliases

6. Approximate search for important cities names7. Approximate search for region names

8. Approximate search for city names

O i ti D t ti


85/147

Organization Detection

Most of the phrases already identified as notOrganization

Bootstrap acronym and Replace

Check O-Key

Check Person Name or Place Name

Person Name: Bootstrap

Place Name: Dictionary

And many more subtle rules

E A l i * OCH d t


86/147

Error Analysis* OCH data

Total Number of Countries 4910 Number of Countries Detected 4758

True Positives 4746

False Positives 12

False Positives after correction 1 Number of Countries Not Detected 152

True Negatives 51

False Negatives 102

False Negatives after correction 23(all of them have somesuggestions)

*by 3 annotators: Divya, George and Siddhartha

Results Summary for Country


87/147

y y

Detection Precision, the percentage of results returned that

are correct= TP/(TP+FP) = 99.8%

Precision after corrections = 99.98%

Recall, the fraction of (all) correct results

returned = TP/(TP+FN) = 97.9% Recall after corrections = 99.54%

F-measure, the harmonic mean of precision andrecall = 2PR/(P+R) = 98.8%

F-measure after corrections = 99.76% For the second best known system(Yu, et al.):

Precision = 94.0%; Recall = 92.1%; F-measure = 93.0%

A l i * i St h A d t


88/147

Analysis* using Staph Aureus data

Organizations: Number of Affiliation sentences = 4000

True Positives = 3989

False Positives = 0

False Negatives = 11

Precision = 100%; Recall = 97.5%; F-measure = 98.7%

States True Positives = 3528 False Positives = 470

False Negatives = 2

Precision = 88.2%; Recall = 99.9%; F-measure = 93.7%

Cities True Positives = 3611

False Positives = 2

False Negatives = 387 Precision = 99.9%; Recall = 90.3%; F-measure = 94.9%

For the second best known system(Yu, et al.): Precision = 86.8%; Recall = 91.3%; F-measure = 89.0%

*Done by: Divya

O i ti N li ti


89/147

Organization Normalization Two types of Named Entities

Described entities: those which uniquely identify with a realworld organization given the GPE

All Organizations containing a person name, a place name, or adirectional modifier

Ex:- Jerome Lipper Center for Multiple Myeloma, University of Texasand North Western University.

Descriptor entities, also sub-organizations: those which dontuniquely identify with a real world organization unless in thepresence of a Described entity and whose primary role is to givemore specific information about a Described entity

Ex:- School of informatics and Dept. of Biomedical Informatics

We are interested only in Described entities, also calledorganizations

A bi iti


90/147

Ambiguities No polysemy at entity class. Single entity.

Polysemy within entity class: Mayo Clinic, USA

Mayo Clinic, Rochester, USA

Synonymy at Word level because of NSWs Formatting errors mainly because of OCR in MARS

Spelling mistakes

(Abbreviations taken care of at NER stage)

Synonymy at Inter-Word level due to lack ofconsensus in the choice of words while referringto an organization

Example of Synonymy


91/147

Example of SynonymyWashington University School of Medicine

Washington University School of Medicine and St. Louis Children''s

Hospital

School of Medicine

Washington University School of Medicine at Barnes-Jewish Hospital

Barnes-Jewish Hospital at Washington University School of Medicine

Washington University

Washington University School of Medicien*

Washington University School of Medicine and Metropolitan St. Louis

Psychiatric Center

Barnes Retina Institute and Washington University School of Meidcine*Division of Gastroenterology Washington University School of Medicine

St. Louis*ibid

Common Approach


92/147

Common Approach

Compare against a list or dictionary of organization Used in gene normalization

Map the Entrez Gene identifiers for gene names mentioned inPubMed/MEDLINE abstracts

Unlike genes, many organizations get renamed and somebecome defunct

No community interest in maintaining a database

Our Approach:

automatically build a database of Organization clusters,

OrgDB from 100,000 randomly selected affiliationsentences from PubMed published between the years1998 and 2008

Clustering


93/147

Clustering

Entries in OrgDB Centroid string

has the least sum of distances in DIST

List of all organizations in each entry or cluster

DIST matrix containing inter-component distance

PubMed IDs of the components

GPE city, state and country of Cluster

An organization is added if it is similar accordingto String Similarity Metric

And all entries in the cluster are updated

Two types of Sequence Alignment


94/147

Two types of Sequence Alignment

Global Sequence Alignment Implemented by Needleman-Wunsch Algorithm

using dynamic programming

Local Sequence Alignment Implemented by Smith-waterman Algorithm

Also uses dynamic programming

Global Seq. Alignment is too strict Local Seq. Alignment is inaccurate

Example


95/147

Example

Limitation of Global Alignment


96/147

Limitation of Global Alignment

Limitation of Local Alignment


97/147

Limitation of Local Alignment

Local Learning


98/147

Local Learning

Use local information from the training data tofurther enhance the value of the training set

Have a Tight String Similarity Metric

Understand the data by finding connectedcomponent

Precise Computation

We call it Recalculation through self-training

Inspired by Charniak Mc Closky Parser,Brown University. Currently the best.

Tight String Similarity (TSS) Metric


99/147

Tight String Similarity (TSS) Metric

Levenshtein distance between the two organizationnames NOT at the character level

BUT at the word level

AFTER removing stop words

Criteria for two words a & b to be same:

Parameters for Levenshtein Distance Penalty of gap = length of the word

Penalty of mismatch = sum of the lengths of the word

Sentences same if the distance between them not morethan 4

Recalculation


100/147

Recalculation TSS addresses Synonymycaused by NSWs

Still Synonymy exists among clusters in OrgDB

Due the lack of consensus in the choice of words

Ex:- The David Geffen School of Medicine at The

University Of California and DG School ofMedicine at The University Of California at Los

Angeles.

Recalculate: Synonyms of the current cluster

Equivalent to finding connected component inthe corresponding graph.

Seeing it as a Graph


101/147

Seeing it as a Graph

OrgDB equivalent to an undirected graphOrgG

Vertices: clusters

Edges: Between any two clusters One of which almost contains another like David Gaffen

School of Medicine and DG School of Medicine

Both have same GPE

Almost contains if Extended Smith-

Waterman Score (ESS) is more than 0.90

Finding Connected Component


102/147

Finding Connected Component

Initialization: Add the vertex (cluster) we areconcerned with

Propagation: Iteratively visit each unvisited vertex(cluster) closest to the root

add all the vertices (clusters) adjacent to it and are notalready in the connected component

Pruning: From depth 3, add only those vertices:

which have an organization that collaborated in an

article with an organization in one of the vertices(clusters) already in the connected component

Prevents errors

An Example


103/147

An Example

Input: PubMed ID: 16849888

Affiliation sentence: Duke University Medical Centerand Duke Clinical Research Institute, Durham, NC27710, USA.

NER: Organization: Duke University Medical Center and

Duke Clinical Research Institute (O1)

Country: USA

State: North Carolina

Zip Code: 27710

City: DURHAM

Example Continued


104/147

Example Continued

Adding to OrgDB or OrgG No cluster in OrgDB is close to O1

Add new cluster for O1.

Recalculation Step 1 Add O1 to the connected component, CC

Example Continued


105/147

Example Continued

Recalculation Step 2 Organizations adjacent to O1 are:

Duke Clinical Research Institute(O2)

Duke University Medicalcenter(O3) Duke University Medical Center(O4)

Duke University(O5)

Add these to CC

Continued


106/147

Continued

Recalculation Step 3 Expand all the nodes at level 2. For example, O2.

The organization adjacent to O2 is Department ofBiostatistics and Bioinformatics and Duke Clinical

Research Institute (O6). Add this to CC

Repeat it for O3, O4 and O5.

We get 14 more organizationsin CC O7 through O20


107/147

Recalculation Step 4


108/147

p

Consider expanding O19 to :

Durham Veterans Affairs Medical Center

Veterans Affairs Medical Center. An examination of

These two didnt collaborate (in any of the

100,000 publications) with any organization inthe CC.

This justifies our not adding these organizationsin the CC.

Step 4 is continued for the rest of theorganizations

Finally!


109/147

Finally!

Connected component gave us the set of all thesynonyms of the organization

Depending on the objectives of normalization,the criterion to choose varies

We picked the centroid string of the cluster withthe largest number of publication as thenormalized name Going by this criterion : Duke University Medical

Centerbecomes the normalized name for Duke

University Medical Center and Duke Clinical ResearchInstitute

Analysis


110/147

Analysis

Test set: obtained 4135 articles related to a studyon Antiangiogenesis indexed in PubMedbetween 2004 and 2008

The normalization process identified each articlewith a unique standard organization

182 unique organizations were identified (13.8articles per organization)

Overall 13 errors; 5 were caused only by NERwhich means the Normalization process alonehas a precision of 99.5% with 100% recall

Magic Mappings


111/147

Magic Mappings Discovering a richer set of synonyms than navetes

(Digression) Top-10 organizations in terms of the


112/147

number of publications

(Digression) Top-10 most influential


113/147

organizations

Work Left


114/147

Work Left

Normalization is done only for USA need tobe extended globally

Incorporating these features into the existing

system should be easy Analyze the improvement in performance

Corpus: BioCreative III training set and/or test

set

Thesis ONTOLOGY


115/147

Thesis ONTOLOGY


Named Entity

Recognition

SimFind

BANNER +SimFind

Normalization

External Features

* Affiliation

* Author


Discourse Analysis


Shot-gun sequencing



extraction

Thesis ONTOLOGY


116/147

Thesis ONTOLOGY



Discourse Analysis

Bag of Simplified Sentencesmodel ~ Shot-gun sequencing


Word-order preservingpattern extraction

Discourse Analysis


117/147

Discourse Analysis

Language is not merely a bag of words but a tool withparticular properties The linguists work is precisely to

discover these properties, whether for descriptive analysis or

for the synthesis of quasi-linguistic systems. Harris, 1954

For this, DAs break the sentence into simpler clauses. Even Quirks simple sentence can still be a complex clause

We have identified a new TNF-related ligand, designated human GITR

ligand (hGITRL), and its human receptor (hGITR), an ortholog of the

recentlydiscoveredmurine glucocorticoid-induced TNFR-related

(mGITR) protein.

DAs critically analyze discourse using the integration tool

Discourse Analysis Tools


118/147


Deixis Tool ask how deictics are being used to tie what is said to

context and to make assumptions about what authorsalready know or can figure out

Pronoun Resolution

Appositives

Fill in Tool

Based on what was said and the context in which it

was said, what needs to be filled in here to achieveclarity?

Distributional semantics



119/147


Deixis Tool ask how deictics are being used to tie what is said to

context and to make assumptions about what authorsalready know or can figure out

Pronoun Resolution

Appositives

Fill in Tool

Based on what was said and the context in which it

was said, what needs to be filled in here to achieveclarity?

Distributional semantics



120/147


Integration Tool Find how clauses were integrated or packaged into

main, subordinate, and embedded(relative) clauses

What was missing and what got added?

Example: It has been shown that LIGHT triggersapoptosis of various tumor cells including HT29 cellsthat express both lymphotoxin beta receptor ( LTbetaR ) and HVEM / TR2 receptors. LIGHT triggers apoptosis of various tumor cells.

LIGHT triggers apoptosis of HT29 cells. HT29 cells express both lymphotoxin beta receptor and

HVEM / TR2 receptors.

Hallidays Systemic Functional

G


121/147

Grammar

Three ways clauses expand to sentence Elaborating its existing structure

Example: Relative clause

Extending it by addition or replacement Example: Coordination

Enhancing its environment

Example: Cause-conditional

We will be using these guidelines to design

rules for creating simpler sentences

BioSimplify


122/147

BioSimplify

GOAL: Create bag of simplified sentences forautomatic discourse analysis.

Application: information extraction on

biomedical text Existing methods: features like POS tags, parse

trees and dependencies informally known asbag-of-NLP

BOSS: standardize the representation ofgrammatical information in elemental chunks

Sentence Simplification: Motivations


123/147

p

Improve human readability Shorter

Grammatical

Cohesive

Information-preserving

Text summarization

Shorter

Preserve only important

Improve parser performance or Relationship Extraction

Shorter

Grammatical


Architecture


124/147

Noun Phrase Replacement Using POS tags and Noun phrase chunker

POS tags: LingPipe

Chunker: OpenNLP


Use any parser to produce CFG penn tree

Parser: McClosky retraining parser 88%

Information Extraction System

Example: PIE

Sentence

NPReplacement

SyntacticSimplification

BOSS

NP Chunker

Parser

Information

Extractionsystem



125/147

p

Noun phrase consists of an optional determinative, anoptional premodification, a mandatory head, and an optional

postmodification

Noun Phrase chunkers return all the noun phrases of the

smallest length, thus always excluding the postmodifications Last word of the identified noun phrase is the head noun

Removal of the optional determinative makes the sentence

ungrammatical

All tokens other than the head noun and the startingdeterminative or numeral (if exists) are removed

For example, the noun phrase the recently discovered murine

glucocorticoid is replaced with the glucocorticoid.



126/147

y p

synSimp(t), t is the penn tree of the given sentence:

-Initialize simpTrees, the ordered set containing the penn trees of all simplified sentences.

-FOREACH subtree of t traversed in the order of depth-first traversal

- perform necessary simplifications at that node which are the simplifications that

neednt be repeated for all the parents to this node

-Add the present tree to simpTrees

-FOREACH unprocessed tree in simpTrees

- FOREACH subtree of t traversed in the order of depth-first traversal

- perform the simplifications for this node

- add new trees in simpTrees if applicable

-return the sentences represented by the trees in simpTrees

END

Rules


127/147

Rule: S ~ {S}. Condition:

S contains NP

S contains VP

Explanation: Adds all simple sentences into bag

Example: In differentiating C2C12 cells, E2F complexes

switch and DNA synthesis in response to serum are

prevented when MyoD DNA binding activity and the cdks

inhibitor MyoD downstream effector p21 are induced.

Result: MyoD DNA binding activity and the cdks inhibitor

MyoD downstream effector p21 are induced.

Rules


128/147

Rule: NP[NP1 VP1*] ~ [NP1] {NP1 "can be" VP1} Condition:

VP1 starts with a gerund, present participle or past participle

Explanation: Postmodification by verb phrase

Example:

The cloning of members of these gene families and the identification of the

protein-interaction motifs found within their gene products has initiated the

molecular identity of factors (TRADD, FADD/MORT, RIP, FLICE/MACH, and

TRAFs) associated with both of the p60 and p80 forms of the TNF receptor

and with other members of the TNF receptor superfamily.

Result: The cloning of members of these gene families and the identification of the

protein-interaction motifs has initiated the molecular identity of factors

The protein-interaction motifs can be found within their gene products.

Rules


129/147

Rule: NP[NP1 ADJP1] ~ [NP1] {NP1 "can be" ADJP1 } Explanation: Postmodification by adjective phrase

Example:

Src homology domain-2 (SH2)/SH3 domain - can be

containing adapters such as Grb2, Crk, and Crk-L, whichinteract with guanine nucleotide exchange factors specific

for the Ras family.

Result:

interact with guanine nucleotide exchange factors.

Guanine nucleotide exchange factors can be specific for

the Ras family.

Rules


130/147

Rule: NP[NP1 PRN] ~ [NP1] [PRN - LRB - RRB]

Explanation:

Add two sentences- one with abbreviation removed, the other with

NP replaced by abbrev

Example: Coexpression of the alpha and betaL subunits of the human interferon

alpha (IFNalpha) receptor is required for the induction of an antiviral

state by human IFNalpha.

Result:

Coexpression of the alpha and betaL subunits of the human interferon

alpha receptor is

Coexpression of the alpha and betaL subunits of the human IFNalpha

is

Rules


131/147

Rule: NP[NP1 PP] ~ [NP1]

Explanation: Postmodification by prepositional phrase

Example:

To explore the role of the different domains of the betaL

subunit in IFNalpha signaling, we coexpressed wild-typealpha subunit and truncated forms of the betaL chain in L-

929 cells.

Result:

To explore the role in IFNalpha signaling, we coexpressedwild-type alpha subunit and truncated forms of the betaL

chain in L-929 cells.

Rules


132/147

Rule: VP[MD VP1 , S*] ~ [MD VP1]

Condition:

S contains VP and not NP

Explanation: Postmodification by verb phrase

Example: T lymphocytes can be activated normally in response to either

stimulus, demonstrating that the effects of the inactive CaMKIV on

activation are reversible.

Result: T lymphocytes can be activated normally in response to either

stimulus.

Rules


133/147

Rule: NP[NP : S*] ~ [S*]

Condition:

S contains VP or NP

Explanation: Section indicator

Example: OBJECTIVE: To investigate the relationship between the expression of

Th1/Th2 type cytokines and the effect of interferon-alpha therapy.

Result:

To investigate the relationship between the expression of Th1/Th2type cytokines and the effect of interferon-alpha therapy.

Rules


134/147

Rule: S[S1 , NP VP] ~ [NP VP]

Condition:

S1 doesnt contain both NP and VP

Explanation: Content clause

Example: To characterize these pathways, we focused on changes in the cyclin-

dependent kinase inhibitors and their binding partners that underlie

the cell cycle arrest at senescence.

Result:

We focused on changes in the cyclin-dependent kinase inhibitors and

their binding partners that underlie the cell cycle arrest at senescence.

Rules


135/147

Rule: NP[NP SBAR] ~ [NP], {SBAR -WHNP + NP}

Condition:

Wh-NP in the relative clause is replaced by NP from main clause


Example: To characterize these pathways, we focused onchanges in the cyclin-dependent kinase inhibitors and their

binding partners that underlie the cell cycle arrest at

senescence.

Result: changes in the cyclin-dependent kinase inhibitors and their binding

partners.

The cyclin-dependent kinase inhibitors and their binding partners

underlie

Rules


136/147

Rule: VP*, SBAR+ ~ -, SBAR


Example: As [Ca2+]o increased, [Ca2+]i rapidly increased, as

monitored by fluorometry.

Result: As [Ca2+]o increased, [Ca2+]i rapidly increased.

Rules


137/147

Rule: VP , CC VP2] ~ [VP1] [VP2]

Explanation: Coordination of verb phrases

Example:

These mechanisms must be understood in order to

prevent, or combat, the emergence of a virulent,multidrug-resistant form of the bacillus that would be

uncontrollable by means of today's treatment strategies.

Result:

These mechanisms must be understood in order toprevent, the emergence of a virulent, multidrug

These mechanisms must be understood in order to combat

, the

Rules


138/147

Rule: VP[... , PP] ~ }, PP{

Explanation: Postmodification by prepositional phrase

Terminal prepositional phrase and preceding comma are removed

from verb phrase

Example: Because cell lines can lose their differentiated phenotype in culture

across passages, documentation of gene expression must be

determined for passage populations, for us to have knowledge of cell

behavior in vitro.

Result: Because cell lines can lose their differentiated phenotype in culture

across passages, documentation of gene expression must be

determined for passage populations.

BioSimplify Parser


139/147

The rules are perfect, so BioSimplify is as perfect as thepenn trees are.

The Penn trees are as perfect as the performance ofparsers in biomedical domain

Increasing parser performance (F-measure)

Stanford (lexicalised) 72.5% (2003) Link Grammar ~70% (2006)

Charniak-Lease 81% (2005)

Charniak-McClosky 84% (2008)

Charniak-McClosky 88% (2009)

Calculated penn tree databases also available ASU BMI (& CSE)s PTDB parse tree database

NLP web service provided by NCIBI

Analysis


140/147

Worst-case Time Complexity O(n

2

*R), where n = number of tokens in the sentence

R = number of simplification rules

Average Time Complexity O(nlog(n)*R)

Better than our proto-type system(2009) based on LinkGrammar parser

Time complexity O(n3*R)

Link Grammar is getting behind in the race

The dependencies produced arent standard like the penn trees

BioSimplify Accuracy:precision of 90%, recall of 99% and f-

score of 95%

Test Set: 404 sentences from AIMed

Old Model


141/147

Preprocessing Removal of sentence indicators

Removal of phrases in parentheses

Partial resolution of coordination ellipsis Gene Entity Replacement


Syntactic transformation Grammatical correctness using GRAM vector

Old Model: GRAM vector


142/147

Every sentence can be uniquely associated with the 2-tuple of nullcount and disjunct cost (n,d)

A null count (which represents unwanted words) needs moreattention than the disjunct cost (which represents less likelywords)

We define the 2-tuple (n1, d1) to be greater than (n2, d2), if andonly if n1 is greater than n2, or, n1 is equal to n2 and d1 is greaterthan d2.

The grammatical correctness of a collection of sentences ismeasured by the 2-tuple of the sum of the null counts of theindividual sentences and the sum of the disjunct costs of theindividual sentences respectively.

Since null counts and disjunct costs are typically less than 10 (i.e,one-digit numbers), for the purpose of easy comparison and forcapturing the 2-tuples in one dimension, we define a new costvector GRAM which is equal to 10*UNUSED + DIS.

OLD Model:Overview of Rules


143/147

Rules for prefix subordination, infix subordination and if-then

coordination (details in Siddharthan, 2003) These rules were also adapted recently by SimText (Ong, et

al., 2008), a text simplification system for improving thereadability of medical literature, but without a mechanism tojudge the grammatical correctness.

Differences


144/147

New Version Old version

Time Complexity O(nlog(n)*R) O(n3*R)

Dependencies PTB format Non-standard LG linkages

NP chunking LingPipe + OpenNLP Stanford

Parser Charniak-McClosky Link Grammar

Domain

Adaptability

Yes No

Protein Replacement

Scalability Yes No

Customizability Yes NoModel Bag of simplified

sentences

Minimal number of


maximally simple sentences

PPI Extraction Experiment


145/147

bioSimplifyAbstract

from AIMed

Simplified

Abstract

PIE PIE

Remove

Annotations

Results

for original

sentences

Results for

Simplified

sentences

AIMed

Comparison of

different methods

Work Flow for Each Abstract

Results


146/147

Precision Recall F-score

Original sentences 46 58 51

Sentences simplified by older BioSimplify 51 64 57

Sentences simplified by current BioSimplify 46 82 60

Documents

A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf