Upload
sjonnal3
View
227
Download
0
Embed Size (px)
Citation preview
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
1/147
A semi-supervised approach to
extracting concepts and
relationships from clinical text
Siddhartha Jonnalagadda
5/18/2010 1ASU Biomedicine
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
2/147
Abstract
Heath care industry trillions of dollars of market share,information-rich clinical records abound
Goal: Extract mentions of entities such as treatment,lab test, and medical problems as well as associations
among them. Enable secondary use of this data:
Tracking performance
Optimizing resources
Biosurveillance Clinical Decision Support
Structure the unstructured narratives in clincal recordsusing Information Extraction (NLP)
5/18/2010 ASU Biomedicine 2
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
3/147
Applications of NLP for Biomedical
Informatics Bio Informatics: Curation of PPIs into STRING and SNPs
into modSNP
Clinical Informatics: Deidentification of patientinformation and extraction of code information. from
clinical records Public Health Informatics: Computational
Biosurveillance. For example: BioCaster tracks thedistribution of infectious disesase outbreaks fromlinguistic signals from Web.
Imaging informatics: Literature based discovery forimproving the state of art of medical imaging andbiomedical image search
5/18/2010 ASU Biomedicine 3
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
4/147
Biomedical NLP vs. Clinical NLP
Input Scientific Literature Patient ReportsType of
relations found
Complex relations between
biomolecular substance
Descriptive
Grammar Relations based on verbs Nouns and adjectives
Overlap Tissues, cells, molecular components and diseases
User needs Focused: Literature Search, Curation& Hypothesis Testing Diverse : Coding, Dec. support,Terminology Mgmt, Lit. Search, Hypo.
Testing
Availability Open access of scientific literature Privacy concerns due to HIPPAA
Quality Peer-reviewed and written in English Local languages that arent peer-reviewed
Motivation Scientifically appealing and new
discoveries
Philanthropic , Humanitarian and medico-
economic motivation
Funding Stable since genome sequencing Fluctuations over years
Shared tasks BioCreative Shared task i2b2 NLP shared task
5/18/2010 ASU Biomedicine 4
REFERENCE: Pierre Zweigenbaum, Natural Language Processing in the Medical and Biological Domains: a Parallel Perspective. Invited Talk, SMBM 2008
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
5/147
Hypothesis: Biomedical Clinical
Could our methods in biomedical NLP beadapted to Clinical NLP?
Problem: extracting medical problems, tests,
and treatments, and relations among them inclinical narratives
Large corpora of text unavailable
Very little annotated text Approach: 1) Vector similarity approach using
Distributional Semantics
5/18/2010 ASU Biomedicine 5
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
6/147
Objectives
1. Evaluate the effectiveness of distributionalsemantics for entity recognition andassociation extraction from clinical records.
2. Develop a system for the automaticextraction of treatment, test, and medicalproblem associations from clinical records
3. Deploy the system as an i2b2 plug-in thatcould be used for clinical decision supportand also improve the extraction engine.
5/18/2010 ASU Biomedicine 6
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
7/147
Clinical Information Extraction
Extracting concepts: Medical problem
Test
Treatment
Extracting relations (between concepts): which treatment improves a medical condition
which treatment worsens a medical condition
which treatment causes a medical problem
which treatment is administered for a medical problem
which treatment is not administered for a medical problem which test reveals a medical problem
which test is conducted to reveal a medical problem
which medical problem indicates another medical problem
5/18/2010 ASU Biomedicine 7
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
8/147
Corpus
Downloaded from i2b2/VA shared task.
~100 de-identified clinical notes annotated forconcepts and relations
~1000 unlabeled de-identified clinical notes
Compiled from discharge summaries from Partners HealthCare
discharge summaries from Beth Israel Deaconess MedicalCenter
discharge summaries and progress notes from Universityof Pittsburgh Medical Center
Sign Data Usage and Confidentiality agreement
Compulsory participation in the shared task
5/18/2010 ASU Biomedicine 8
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
9/147
Extracting concepts aka Named Entity Recognition is being studied for last two decades
for general domain and since a decade for medical domain
Can be dictionary-based, rule-based, or machine learning Medical Language Extraction and Encoding System (MEDLEE, 1997)
generates coded information for general clinical notes which usessyntactic patterns from 1000 grammar rules and some lexicon
MetaMap (2001) by NLM maps text to UMLS metathesarus uses
knowledge-intensive approach by detecting noun phrases and thenemploying around 1 M metathesarus strings
cTakes (2008) by Mayo uses UIMA for extracting clinical concepts.Their Nave Bayes Classifier with syntactic and morphological featuresachieved an F-score of 0.56 based on strict matching
Patrick et al.,(2010)s baseline for i2b2/VA NLP shared task has an F-score of 64%
Clinical NLP is lagging behind biomedical NLP (Meystre, et al. 2008)
Main reason scarcity of labeled data, wider variation in the formatof free text
Solution unsupervised or semi-supervised learning
5/18/2010 ASU Biomedicine 9
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
10/147
Unsupervised Semi-Supervised Supervised
Goal Uncover hidden
regularities or todetect anomalies
in the data
Training on labeled
data and unlabeleddata, frequently
resulting in a more
accurate classifier.
Predict the label of
an unseen examplebased on labels of
the seen
Input Only unlabeled
data
Labeled data and
unlabeled data
Only Labeled data
Example: K-Means Co-training, ASO SVM
Output Clusters Usually classes Classes
Cost Low Moderate High
Accuracy Low High Moderate
5/18/2010 ASU Biomedicine 10
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
11/147
Kernel Methods framework
5/18/2010 ASU Biomedicine 11
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
12/147
ASO: Success of semi-supervised
learning in biological domain
IBMs Alternating Structure Optimization(ASO) implementation of Gene Tagger(enteredin Biocreative II, 2007) used 5 M MEDLINE
abstracts as unlabeled data. Result: ranked first in that competition
Liu and Ng(2007) tried ASO for SRL, but failed
because they cant fully use all the unlabeleddata because of limitations in computationalresources.
5/18/2010 ASU Biomedicine 12
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
13/147
Two Take home messages from ASO
Unlabeled data can be very useful if the data is LARGE It is worthy to find if the kernel built using term similarity in
combined word space of Medline and unlabeled clinical
documents would improve the performance of clinical
concept extraction
However biomedical and clinical domains are usually
considered differentsublanguages
Pan et al. (2010) classified polarity of sentiments in one
domain using the annotation of a related-domain via
simultaneously co-clustering them in common latent space.
Need computationally scalable model (linear in space and
time complexity) for building kernel using LARGE data
Random Indexing is linear in space and time complexity5/18/2010 ASU Biomedicine 13
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
14/147
Use of Random Indexing based word
space models for designing kernel Random Indexing helps to reduce the dimensionality
of unsupervised data by mapping the terms intorandom index vectors
Semantic term vectors are built from random index
vectors by considering the context around the terms. Sahlgrens permutation model uses permutations of
terms in a sliding window surrounding the term tobuild a paradigmatic model of semantic term vectors
Semantic sentence vectors for each sentence in labeledcorpus is built by adding the individual term vectors
Kernels for terms and sentences are built by calculatingthe dot product of the corresponding semantic vectors
5/18/2010 ASU Biomedicine 14
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
15/147
Approach
Labeled data: 100 annotated clinical documents
Unlabeled data: 1000 unlabeled clinical notes andMedline abstracts
Design kernel(s) (similarity metric(s)) using theunlabeled data
Implement the kernel algorithm using the labeleddata
Advantage: Use of unlabeled data that holdsknowledge of how words and sentences arerelated to each other.
5/18/2010 ASU Biomedicine 15
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
16/147
Finding optimal parameters
Different parameters that need to be tested forfinding the most optimal settings
Dimensions in reduced space
Seed length Half-window size
Threshold for term-term similarity
Threshold for sentence-sentence similarity
Number of similar sentences to consider
First three parameters are specific to model, thelast three are universal as they belong to SimFind
5/18/2010 ASU Biomedicine 16
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
17/147
Finding optimal models
Different possible paradigmatic models Sahlgren's permutation based order vector model (2008)
Sahlgren's directional vector model (2008)
Hyperspace Analog to Language (HAL) (1996)
Jones' convolution based BEAGLE model (2007) Cohens Reflective Random Indexing (2010)
Sahlgrens models are computationally scalablecompared to Jones BEAGLE
Sahlgrens directional model performed better than theconventionally used permutational model
HAL uses SVD, but encodes direction not order
5/18/2010 ASU Biomedicine 17
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
18/147
Kernel Algorithm: SimFind
(Architecture)
5/18/2010 ASU Biomedicine 18
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
19/147
Kernel Algorithm: SimFind
(Pseudocode)SimFind(targetToken, Line){
List simSentences =
getSimilarSentences(Line,100);
List goldenTokenLabel =
getTokenLabels(simSentences);
STEP1:FOREACH (goldenTokenLabel)
IF (goldenTokenLabel has
targetToken as token)
RETURN goldenTokenLabel;
STEP2:IF (token IN STOPLIST)
RETURN ;
terms = 1;
STEP3:
terms *= 10;
=getSim
Words(targetToken,terms);
FOREACH (equivToken)
FOREACH (goldenTokenLabel)IF (goldenTokenLabel has
targetToken as token)
RETURN goldenTokenLabel;
IF (simIndex>0.5)
goto STEP3;RETURN ;
EXIT;
}
5/18/2010 ASU Biomedicine 19
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
20/147
SimFind vs. Linear Discriminant
Analysis
Uses Random Indexing for
dimensionality reduction which
is O(N) and almost perfect
Uses Singular Value
Decomposition which is O(N3)
and perfect
Unsupervised dimensionality
reduction
Supervised dimensionality
reduction
Scalable to large amount of
unlabeled data
Applicable only to labeled data
Random Indexing fixes the
number of dimensions apriori
Finds the significant
dimensions in order like LSA
Doesnt employ kernel trick Employs kernel trick
Uses 2 kernels Uses single kernel
5/18/2010 ASU Biomedicine 20
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
21/147
Perils of lazy learning
SimFind is a special case of K-Nearest Neighbor, asupervised machine learning algorithm alsoknown as lazy learning algorithm for its short
training time and long testing time Time complexity: O(N*T), where N is the number
of terms in training set and T is the number oftokens in the input or test set. N ~ 500,000
Unfortunately, well known applications that uselarge unlabeled data use K-NN. For example: PRC,MESH UP and RRI based MESH indexing
5/18/2010 ASU Biomedicine 21
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
22/147
Overcoming the perils
Ando (2007) removed sentences with words thatalready occurred 25 times
Vasuki and Cohen (2010) used parallel processing tominimize I/O
Observation: elements of the kernel are computed during the execution
of the kernel learning algorithm
the kernel learning algorithm only needs terms closelyrelated to the terms in the corpus
what terms are needed for the task has a high correlationwith what terms are present in the corpus
Solution: Modify kernel design to calculate the kernelmatrix at once rather than postpone it to learning step
5/18/2010 ASU Biomedicine 22
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
23/147
Modified Kernel design step
Use Sahlgrens permutation model or a better one withoptimal parameters to build reduced dimensional wordspace for MEDLINE and clinical notes
For each term in labeled corpus, store the only the
terms from word space that are similar by a thresholdof cosine similarity
Also store the cosine similarities
The second kernel for sentence similarity has to behowever calculated during learning step
Time complexity: O(S*T) S = number of sentences inthe input or test set, or O(T) per sentence where T isthe average number of tokens in a sentence
5/18/2010 ASU Biomedicine 23
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
24/147
Integrating SimFind with other
features Nave Bayes Classifier trained on the noun phrases and adjective phrases
in the annotated corpus
Features: Section of the sentence
Lexical features
Grammatical correctness
output label from SimFind for the target token.
most similar token from the corpus to the target token
second most similar token from the corpus to the target token
third most similar token from the corpus to the target token
most similar token to the target token from the 100 most similar sentences tothe target sentence
second most similar token to the target token from the 100 most similarsentences to the target sentence
third most similar token to the target token from the 100 most similarsentences to the target sentence
5/18/2010 ASU Biomedicine 24
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
25/147
Moving on to extracting relations
An extension to concept extraction Types of relations
which treatment improves a medical condition
which treatment worsens a medical condition
which treatment causes a medical problem which treatment is administered for a medical problem
which treatment is not administered for a medical problem
which test reveals a medical problem
which test is conducted to reveal a medical problem which medical problem indicates another medical problem
From BioCreative experience: better is the conceptextraction, better is the relation extraction
5/18/2010 ASU Biomedicine 25
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
26/147
Approaches
Machine Learning: less precise because ofover-fitting, but high recall
Rule-based or pattern-matching: less recall
because of limited patterns, but high precision Solution: Increase recall of pattern-matching
by allowing for
fuzzy matching based on vector similarity unmask relationships hiding in syntactic jungle
using sentence simplification
5/18/2010 ASU Biomedicine 26
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
27/147
Vector similarity of a sentence and a pattern
Without order a sentence or a pattern can be represented in
word space as vector sum of the individual terms
Order can be encoded by using the permutational model of
Sahlgren
For example: the vector for hypertension was controlled on
hydrochlorothiazide would be |0(hypertension) +1(controlled) + 2(hydrochlorothiazide)|, where is a random
permutation and || is the L2-Norm
While this appears theoretically sound, empirically permutation
model performed suboptimally for large windows Probable reason: Increases the ratio of seed length and the
number of dimensions, thus compromising the Johnson-
LindenStrauss Lemma
5/18/2010 ASU Biomedicine 27
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
28/147
Concatenation as a Means to Encode
Order in Word Space Define C(a, b) to be the vector formed by concatenating the
vectors corresponding to a and b
Order can be partially encoded concatenating n-grams andsumming the vectors
For n=2, the vector for hypertension was controlled onhydrochlorothiazide would be |C(^, hypertension) C(hypertension, controlled) + C(controlled,hydrochlorothiazide) + C(controlled, hydrochlorothiazide) +C(hydrochlorothiazide, $)|, where ^ and $ are randomvectors are assigned to the beginning and end
Random vector can also be assigned to * in a patternexpression such as P regulates Tr so that the patternis also expressed as a vector
5/18/2010 ASU Biomedicine 28
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
29/147
Example sentences from one discharge
summary The patient has a history of exertional angina and chest pain
associated with light-headedness for nine years which was noted toincrease in frequency over the past year , then upon admission hadlight-headedness and chest pain with a dull pressure in her neck toher substernal area , with only minimal exertion such as "walking across the room " .
She underwent angiography on 5-9-92 which showed the rightinternal carotid artery to be patent , however there was highlysignificant stenosis of the left carotid artery and she was taken tothe operating room later that day for a left carotid endarterectomy .
She was taken postoperatively the ICU where she was extubated on
postoperative day number one and by 5-11-92 she was noted tohave markedly increased use of her right side , resolving aphasiaand she was transferred to the floor with the residual deficit onlynoted to be some left upper extremity weakness .
5/18/2010 ASU Biomedicine 29
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
30/147
BioSimplify
GOAL: Create bag of simplified sentences forautomatic discourse analysis.
Application: information extraction on
biomedical and clinical text Existing methods: features like POS tags, parse
trees and dependencies informally known as
bag-of-NLP BOSS: standardize the representation of
grammatical information in elemental chunks
5/18/2010 ASU Biomedicine 30
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
31/147
BioSimplify (Architecture)
Noun Phrase Replacement
Using POS tags and Noun phrase chunker
POS tags: LingPipe
Chunker: OpenNLP Syntactic Simplification
Use any parser to produce CFG penn tree
Parser: McClosky retraining parser 88%
Information Extraction System Example: PIE
5/18/2010 ASU Biomedicine 31
Sentence
NP
Replacement
SyntacticSimplification
BOSS
NP Chunker
Parser
Information
Extractionsystem
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
32/147
Noun Phrase Replacement
Noun phrase consists of an optional determinative, an optional
premodification, a mandatory head, and an optional postmodification
Noun Phrase chunkers return all the noun phrases of the smallest length,
thus always excluding the postmodifications
Last word of the identified noun phrase is the head noun
Removal of the optional determinative makes the sentence
ungrammatical
All tokens other than the head noun and the starting determinative or
numeral (if exists) are removed
For example, the noun phrase the recently discovered murineglucocorticoid is replaced with the glucocorticoid.
5/18/2010 ASU Biomedicine 32
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
33/147
Syntactic Simplification
synSimp(t), t is the penn tree of the given sentence:
-Initialize simpTrees, the ordered set containing the penn trees of all
simplified sentences.
-FOREACH subtree of t traversed in the order of depth-first traversal
- perform necessary simplifications at that node which are thesimplifications that neednt be repeated for all the parents to this node
-Add the present tree to simpTrees
-FOREACH unprocessed tree in simpTrees
- FOREACH subtree of t traversed in the order of depth-first traversal
- perform the simplifications for this node
- add new trees in simpTrees if applicable
-return the sentences represented by the trees in simpTrees
END
5/18/2010 ASU Biomedicine 33
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
34/147
Example Rule Rule: NP[NP SBAR] ~ [NP], {SBAR -WHNP + NP}
Condition:
Wh-NP in the relative clause is replaced by NP from main clause
Explanation: Relative Clause
Example: To characterize these pathways, we focused onchanges in the cyclin-dependent kinase inhibitors and their
binding partners that underlie the cell cycle arrest at
senescence.
Result:
changes in the cyclin-dependent kinase inhibitors and their binding
partners.
The cyclin-dependent kinase inhibitors and their binding partners
underlie
5/18/2010 ASU Biomedicine 34
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
35/147
Using BioSimplify for Information
Extraction
5/18/2010 ASU Biomedicine 35
BioSimplifyDoc from
corpus
Simplified
docs
IE system IE system
RemoveAnnotations
Results
for original
sentences
Results for
Simplified
sentences
Corpus
Comparison of
different methods
Work Flow for Each Abstract
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
36/147
Fuzzy pattern matching approach Generating patterns: Reduce each sentence to the
snippets with concepts from the ground truth andsome interaction keywords
Manually find synonyms of the keywords: theparadigmatic vector model we built for concept
extraction task Strict pattern matching using OpenDMAP
Vector representation of sentences by encoding orderusing Sahlgren's permutation model or a proposedconcatenation model
use BioSimplify to transform the focus sentence into abag of simplified sentences and search for patterns ineach sentence
5/18/2010 ASU Biomedicine 36
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
37/147
Creating symbiotic relationship with
clinicians
People are willing to do for free what they arenot willing to do for small amounts of money,Spolsky, founder of stackoverflow.com
Online servers for hospitals to use for extractionof clinical concepts and relations as part of thei2b2 hive
Hospital staff verify the output of our system and
correct them if necessary De-identified corrections retrieved regularly from
i2b2 to improve the system
5/18/2010 ASU Biomedicine 37
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
38/147
Broader impacts and Sustainability
Plans Proposed using unlabeled data for improving extraction of concepts
and relationships using distributional semantics methods
Suggested using sentence simplification for aiding interactionextraction
Novel additions to clinical domain
Successfully adaptation would have a transformative influence onthe field of information extraction
To sustain the application of these methods, we will evaluate themagainst other competitive systems by participating in internationalcompetitions like i2b2/VA NLP shared tasks
create a service-oriented architecture to offer the services of oursystem to the world for free of cost Wider dissemination
Obtainment of useful feedback
5/18/2010 ASU Biomedicine 38
T Siddh h J l dd (PI)
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
39/147
Team: Siddhartha Jonnalagadda (PI) Currently pursuing PhD in Biomedical Informatics
B.Tech in CSE from IIT Kharagpur (5th in GPA among the class of 2008)
Inlaks Awards of Excellence at IITs [2006]
10th rank in All India Engineering Entrance among 730, 000 students [2004]
Indian National Physics Olympiad Gold medalist [2004]
Regional Mathematics Olympiad Silver medalist [2003]
National Talent Search Examination Scholarship [2002]
Relevant Publications: S Jonnalagadda, L Tari, J Hakenberg, G Gonzalez. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text.
NAACL 2009
S Jonnalagadda, P Topham, G Gonzalez. Towards Automatic Extraction of Social Networks of Organizations in PubMed Abstracts. GTBN
workshop in IEEE BIBM 2009
S Jonnalagadda, G Gonzalez. Sentence Simplification Aids Protein-Protein Interaction Extraction. LBM 2009
S Jonnalagadda, P Topham, G Gonzalez. ONER: Tool for Organization Named Entity Recognition from Affiliation Strings in PubMed Abstracts.
LBM 2009
S Jonnalagadda, R Leaman, T Cohen and G Gonzalez. A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes ofNamed Entities. CICLing 2010, LNCS 6008
J Hakenberg J, R Leaman R, V Nguyen V, S Jonnalagadda, et al. Efficient extraction of protein-protein interactions from full-text articles.
Accepted by IEEE/ACM TCBB. 2010.
S Jonnalagadda, G Gonzalez. BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical
information extraction. Submitted to AMIA 2010
5/18/2010 39ASU Biomedicine
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
40/147
Team: Dr. Graciela Gonzalez (Co-PI) Assistant Professor in Biomedical Informatics at Arizona State
University
Research Interests: natural language processing, knowledgerepresentation, and translational bioinformatics
NSF panelist
Member of Biomedical Library and Informatics Review Commitee
(BLIRC) Director of Data Management and Statistics Core of the Arizona
Alzheimer's Disease Center
Director of DIEGO: Data Integration and Extraction of Genomic andClinical Ontologies
Will oversee the project Provide other student researchers of her lab, who come from
informatics and clinical backgrounds
5/18/2010 ASU Biomedicine 40
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
41/147
Team: Dr. Trevor Cohen(Co-PI)
Assistant Professor, School of Health InformationSciences, University of Texas, Houston
Research Interest: Empirical distributional semantics
One of the developers of the Semantic Vectorspackage
Proposed a novel word space model called ReflectiveRandom Indexing
Advise on use of distributional semantics methods
Offered a 16 GB quad-core Opetron server primarilyfor this project
5/18/2010 ASU Biomedicine 41
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
42/147
Team: Mr. Robert Leaman(Co-PI)
Member of DIEGO lab Bachelors of Science degree in Computer Science from
Brigham Young University
Several years in industry
Ph.D. student in Computer Science at Arizona StateUniversity
Research interests: Computational Linguistics, text miningand Named Entity Recognition
Developed BANNER, an open-source biomedical NER
system Mr. Leaman and Siddhartha have been working together in
the BioCreative shared task
5/18/2010 ASU Biomedicine 42
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
43/147
Time Plan
Month 1: Evaluate different word space models along with their parameters to
empirically discover the best vector-based kernel for biomedical and clinical text.
Month 2: Building a preliminary system for concept extraction that uses SimFind
learning algorithm and the optimal kernel discovered above using the clinical
corpus provided by the i2b2/VA for the shared task.
Month 3: Statistically ensemble the outputs of SimFind along with othermorphological and contextual features into a machine-learning framework.
Month 4: Reduce the number of the term vectors in the distributional semantics
model for unlabeled data and achieve efficient integration of MedLine and
unlabeled clinical documents' distributional features into the system
Month 5: adapt our protein-protein interaction system for relationship extraction
on clinical text and introduce the novel features of fuzzy pattern matching and
bag-of-simplified-sentences
Month 6-8: Create the symbiotic platform with clinicians
5/18/2010 ASU Biomedicine 43
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
44/147
Thanks
Site Visit Team
Members of DIEGO lab
Faculty and students of BMI
Questions?
5/18/2010 ASU Biomedicine 44
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
45/147
Appendix
5/18/2010 ASU Biomedicine 45
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
46/147
Novel Approach to Biomedical Text
Mining: Case Study on PPI ExtractionSiddhartha JonnalagaddaPhD Candidate
Department of Biomedical Informatics
Although the world is full of suffering, it is full also of the overcoming of it. Helen Keller
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
47/147
Motivation
Information Extraction from Biomedical text isstill an open problem
BioCreative Protein Identification : 42.9%
BioCreative PPI Extraction: 22.1% Exploit Discourse Analysis Approach
Application of Distributional Semantics
Need for high performance, scalable, open-source and adaptable BioNLP systems
Integrate NLP, Linguistics & Machine Learning
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
48/147
Complexity of Biomedical Sentences
Compared to regular English1. More number of words per sentence
2. Inconsistent use of nouns and partial words
3. Higher perplexity measures
4. Greater Lexical Density5. Increased number of relative clauses and prepositional
phrases
6. Specialized names (e.g. p53, c-Abl)
7. Chemical names with commas and parentheses (e.g.1,25(OH)2D3 )
8. More Coordination Ellipsis (e.g. alpha- and beta-catenin)
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
49/147
Hypothesis
1. Removing complexity in biomedical
sentences helps in unmasking relationships
from NLP systems
2. Random indexing based measures areeffective and more efficient in comparing
patterns of words or sentences than
traditional machine learning approachesusing non-semantic features
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
50/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Named Entity
Recognition
SimFind
BANNER +SimFind
Normalization
External Features
* Affiliation
* Author
Relationship Extraction
Discourse Analysis
Bag of SimplifiedSentences model ~
Shot-gun sequencing
DistributionalSemantics
Word-orderpreserving pattern
extraction
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
51/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Named Entity
Recognition
SimFind
BANNER +SimFind
Normalization
External Features
* Affiliation
* Author
Relationship Extraction
Discourse Analysis
Bag of SimplifiedSentences model ~
Shot-gun sequencing
DistributionalSemantics
Word-orderpreserving pattern
extraction
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
52/147
Protein-Protein Interactions
Central tenet of modern translational and genomic
research
Discovery methods: mass spectrometry,
immunoprecipitation, Y2-H, and recently domain-based
computational techniques
Becoming increasingly important in understanding human
diseases at system-wide and genomic level.
Ex:- pathogenesis of Huntingtons disease
Strong functional correlation with genes
Create positive bias for genome-wide association analyses
and reduce computation burden
Drug discovery, Disease prognosis, Genetic Epidemiology,
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
53/147
PPI Resources
Numerous publicly available databases mostly humancurated.
Ex:- HPRD, BioGrid, BIND, MINT, DIP, Reactome, UniHi,HPID, IntAct, STRING, GeneNetwork
Manual curation despite years of effort, has only madea small dent (around 7%)
Text mining systems to automatically extract PPIs areavailable.
GENIES, BioRAT, GeneWays, MedScan, YAPPIE, AKANE
Online tools for biologists
PIE, SPIES, Whatizit, RelEx, PolySearch, PubGENE, CBioC
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
54/147
Why another Novel Approach?
F-score of the best system in BioCreative II (2007) is30%
F-score of the best system in BioCreative II.5 (2009) is22% (different test set and online competition)
Reason: Many systems have 80-90% F-score on acorpus of less than 10k sentences, but that doesntscale for random documents from (say) PubMedCentral: Humans are linguistically creative and paraphrase concepts
in a different way by varying both vocabulary and syntax Distributional semantics addresses Vocabulary issue
Sentence simplification based discourse analysis takes care ofSyntax issue
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
55/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Named Entity
Recognition
SimFind
BANNER +SimFind
Normalization
External Features
* Affiliation
* Author
Relationship Extraction
Discourse Analysis
Bag of SimplifiedSentences model ~
Shot-gun sequencing
DistributionalSemantics
Word-orderpreserving pattern
extraction
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
56/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Named Entity Recognition
SimFind
BANNER + SimFind
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
57/147
Named Entity Recognition
Task:
Locate names in natural language text
Specify their type
Example entities:
Newswire: people, organization, location
Biomedical: gene, protein, species, disease, drug
Motivation: e.g. relationship extraction; notpossible to perform manually
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
58/147
Biomedical Named Entity Recognition
Examples from GENIA gold standard:
IL-2 gene expression and NF-kappa B activation
through CD28 requires reactive oxygen production
by 5-lipoxygenase.Lexicon Semantics
IL-2 gene expression Other name
IL-2 gene DNA domain or region
NF-kappa B activation Other name
NF-kappa B Protein molecule
CD28 Protein molecule
5-lipoxygenase Protein molecule
Multi-labeling problem
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
59/147
Semantics in Biomedical NER
Early results: dictionary based (Settles 2004)
Usually does not help (!)
Dictionary gives a binary true / false
Alternating Structure Optimization (Ando 2007) Use 5 million Medline abstracts as unlabeled data
Good performance
Too computationally intensive Word Clustering (Finkel 2009)
Improvement not reported
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
60/147
Distributional Semantics
Main types: probabilistic & geometric Geometric methods represent terms as a vector
in an N-dimensional space LSA uses term-document matrix
HAL uses term-term matrix Schtzes Wordspace uses term - four-grams of words
Large number of dimensions imply highcomputational & storage cost
Need dimensionality reduction Example: LSA uses SVD
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
61/147
Random Indexing
Geometric method with reduced dimensionality
Generates the reduced matrix directly
JohnsonLindenstrauss Lemma: distance between points in vector
space will be approximately preserved when projected into a reduced-
dimensional subspace of sufficient dimensionality Computationally efficient
O(n) in the size of the corpus
LSA (SVD) is O(n3)
Allows efficient incremental updates Accuracy comparable to LSA
e.g. performs as well as SVD methods on the TOEFL synonym test
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
62/147
Random Indexing
Each context is assigned a vector High dimensional: n 1000 (not that high)
Sparse: 1% of entries assigned {+1, -1}
Randomly generated; zero-sum
Large number of possible permutationsimplies vectors will be close to orthogonal
Semantic term vectors are then just the linearsum of the term vectors in the sliding windowcontext where they occur
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
63/147
Random Indexing: Visualization
Semantic term vector for expression = sum(elemental vectors)
elemental
vectorstext
SUM
il-2
gene
expression
and
nf
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
64/147
Term-Term Similarity Example
Staphylococcus Antidepressants Pressure
0.61: aureus 0.41: tricyclic 0.34: blood
0.32: methicillin 0.33: antidepressant 0.33 systolic
0.29: epidermidis 0.18: reuptake 0.28: pressures
0.23: coagulase 0.17: tcas 0.28: mmhg
0.21: mrsa 0.15: tricyclics 0.26: diastolic
0.18: staphylococci 0.14: ssris 0.25: hg
Source: T. Cohen and D. Widdows, "Empirical distributional semantics: Methods and biomedical applications,"
Journal of Biomedical Informatics, vol. 42, 2009, p. 390405.
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
65/147
Encoding Word Order
Sequential structure of language often
important
Migraines cause nausea nausea causes
migraines
Can be captured in RI using a permutation
operation
Creates a new orthogonal vector Reversible; can recreate original vector
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
66/147
RI + Word Order: Visualization
Semantic term vector for expression = sum(permuted vectors)
elemental
vectors
permuted
vectorstext
SUM
il-2
geneexpression
and
nf
p(, -2)
p(, -1)p(, 0)
p(, +1)
p(, +2)
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
67/147
SimFind Algorithm
SimFind(targetToken, Line){ List simSentences = getSimilarSentences(Line,100);
List goldenTokenLabels = getTokenLabels(simSentences);
STEP1:
FOREACH (goldenTokenLabel)
IF (goldenTokenLabel has targetToken as token)
return goldenTokenLabel;
STEP2:
IF (token IN STOPLIST) return ;
terms = 1;
STEP3:
terms *= 10;
=getSimWords(targetToken,terms);
FOREACH (equivToken)
FOREACH (goldenTokenLabel)
IF (goldenTokenLabel has targetToken as token) return goldenTokenLabel;
IF (simIndex>0.5)
goto STEP3;
return ;
}
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
68/147
System Architecture
GENIA Apache Lucene /
Semantic Vectors
Training with
Annotation
Testing w/o
Annotation
SimFind
Random
Index
Vectors
Testing with
Annotation
Stop Word
ListLucene Tokenizer
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
69/147
Sample Similar Sentences
SENTENCE: This activation was due to the translocation of p65 andc-Rel NF.kappa B proteins from cytoplasmic stores to the nucleus,where they bound the kappa B sequence of the IL-2R alphapromoter either as p50.p65 or as p50.c-Rel heterodimers.
1. The active nuclear form of the NF-kappa B transcription factorcomplex is composed of two DNA binding subunits, NF-kappa Bp65 and Nfkappa B p50, both of which share extensive N-terminalsequence homology with the v-rel oncogene product.
2. Transcriptional activation of the human TF gene in monocytic cellsexposed to bacterial lipopolysaccharide (LPS) is mediated bybinding of c-Rel/p65 heterodimers to a kappa B site in the TF
promoter.3. In contrast to induction of STATs by cytokines, the IRF-1 GAS-binding complex activated by CD40, TNF-alpha, or EBV containsRel proteins, specifically p50 and p65.
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
70/147
Sample Similar Tokens
SENTENCE: These results strongly suggest that HUinduces both transcriptional and post-transcriptionregulation of c-jun during erythroid differentiation.
RESULT: HU is assigned label other_organic_compound
Rank Token Label(s) Present in N-mostSimilarsentences?
1 dexamethasone lipid No
2 tpa other_name & other_organic_compound No
3 jun DNA_domain_or_region No
4 ap-1 DNA_domain_or_region No
5 ald other_organic_compound Yes
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
71/147
Sample SimFind OutputToken Label
il-2 other_name & protein_molecule
gene other_name & DNA_domain_or_region
expression other_name
and none
nf protein_complexkappa other_organic_compound & protein_molecule
b other_organic_compound & protein_molecule
activation none
through none
cd28 protein_molecule
requires none
reactive inorganic
oxygen inorganic
...
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
72/147
Details
Lucene tokenizer Sliding window model
Dimensionality: 200
Window size: 11
No stop-words for the vector space model
Stop words for SimFind: 421 derived from Browncorpus
IO tagging scheme
Implemented using open source SemanticVectors package
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
73/147
GENIA Corpus
400,000 words from 2000 PUBMED abstracts 100,000 annotations
47 entity types, hierarchical
17% of the entities are embedded in anotherentity
Previous uses of this corpus recognized
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
74/147
GENIA OntologyRoot
source
atom
Other_name
natural
organism
multi-cell
mono_cell
virus
body_part
tissue
cell_type
artificial
cell_line
other_artificial
_source
substance
compound
organic
amino_acid
protein
protein_molecule
protein_family
_or_group
protein_domain
_or_region
protein_substructure
protein_subunit
protein_complex
protein_N/A
peptide
amino_acid
_monomer
nucleic
_acid
DNA
DNA_molecule
DNA_family
_or_group
DNA_domain
_or_region
DNA_substructure
DNA_N/A
RNA
RNA_molecule
RNA_family
_or_group
RNA_domain_or_region
RNA_substructure
RNA_N/A
polynucleotide
nucleotidelipid
carbohydrate
other_organic
_compound
inorganic
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
75/147
Example from GENIA gold standard
IL-2 gene expression and NF-kappa B activationthrough CD28 requires reactive oxygen production
by 5-lipoxygenase.
Lexicon Semantics
IL-2 gene expression Other name
IL-2 gene DNA domain or region
NF-kappa B activation Other name
NF-kappa B Protein molecule
CD28 Protein molecule
5-lipoxygenase Protein molecule
Multi-labeling problem
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
76/147
Evaluation
Used 5 x 2 cross validation on the GENIA corpus Tokens may have more than one label due to
embedded entities
Precision, recall and F-score calculated withrespect to fragment match
Must find correct node or child, no partial credit forancestors or cousins
Overall micro-averaged F-score 67.3% More than half of the entities have an F-score
greater than 50.0%
RESULTS
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
77/147
Entity Precision (%) Recall (%) F score Random F-score
Bio-entity 78.9 82.5 80.7 26.22
Substance 77.0 79.6 78.3 20.92
Organic compound 77.0 79.5 78.2 20.82
Compound 77.0 79.5 78.2 20.82
Amino acid 69.4 71.1 70.3 13.69
Protein 69.2 71.0 70.1 13.37
Lipid 66.1 67.0 66.5 0.66
Virus 65.6 67.3 66.4 0.86
Source 61.4 66.2 63.7 5.62Atom 62.0 60.2 61.1 0.10
Nucleotide 57.0 64.4 60.5 0.05
Organism 59.6 58.8 59.2 1.23
Carbohydrate 63.2 45.7 53.1 0.05
DNA 48.3 52.6 50.4 5.31
Cell type 42.7 50.7 46.3 2.14
Cell line 44.0 44.9 44.5 1.87
RNA 47.0 41.2 43.9 0.61
Body part 39.6 45.0 42.1 0.10
Peptide 41.9 32.7 36.7 0.15
Polynucleotide 44.9 27.0 33.7 0.10
Tissue 22.8 23.7 23.3 0.20
Overall Score 66.3 68.4 67.3
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
78/147
Future Work in NER
Integrate into BANNER Corpus: GeneTag
If time permits: Inter-corpus evaluation
The best performing Protein NER systemAndo(2007) uses 5 million PubMed abstracts asunlabeled data
Improvement because of using semantic features
from Unlabeled data 2%
Hopefully, we will make it too!
h
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
79/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Named Entity
Recognition
SimFind
BANNER +SimFind
Normalization
External Features
* Affiliation
* Author
Relationship Extraction
Discourse Analysis
Bag of SimplifiedSentences model ~
Shot-gun sequencing
DistributionalSemantics
Word-orderpreserving pattern
extraction
h
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
80/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Normalization
External Features* Affiliation
* Author
i li i
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
81/147
Protein Normalization
It useful (though not essential) to identify the interactingproteins with a standard id Required in BioCreative
Extending our motto of proposing novel approaches tocounter the creative human authors, I plan on using two
additional external features along with our labs system: Author names of the closest PubMed abstract
Geopolitical information of the first author
The idea is that words in papers from similar people (samelast name) and similar places have same sense
The closest PubMed abstract is found using distributionalsemantics based index
Fi di G li i l I f i
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
82/147
Finding Geopolitical Information
Use affiliation sentence in the closest PubMed abstract Different features available
Country
City
State Address, Email, URL
Less useful information like names of buildings
Organization name and Sub-organization names
The last ones confirm least to general pattern becauseof idiosyncrasies of the variegated peoples responsiblefor naming an organization. Perhaps needs normalization too
Multiple Layers, Multiple Rules,
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
83/147
Multiple Layers, Multiple Rules,
Multiple Dictionaries
Neti, Neti: Elimination ofuntruth till one lands up atthe feet of absolute truth
If finding an organization isdifficult, let us find what isnot an organization
Easier things First.
Each subtask - multiple rulesand multiple dictionaries
C D i
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
84/147
Country Detection
1. Exact or Approximate(hidden within a phrase)search for the country name with or withoutdiacritics
2. Exact search for names of important cities
3. Exact Search for region names4. Exact Search for city names
5. Email Aliases
6. Approximate search for important cities names7. Approximate search for region names
8. Approximate search for city names
O i ti D t ti
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
85/147
Organization Detection
Most of the phrases already identified as notOrganization
Bootstrap acronym and Replace
Check O-Key
Check Person Name or Place Name
Person Name: Bootstrap
Place Name: Dictionary
And many more subtle rules
E A l i * OCH d t
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
86/147
Error Analysis* OCH data
Total Number of Countries 4910 Number of Countries Detected 4758
True Positives 4746
False Positives 12
False Positives after correction 1 Number of Countries Not Detected 152
True Negatives 51
False Negatives 102
False Negatives after correction 23(all of them have somesuggestions)
*by 3 annotators: Divya, George and Siddhartha
Results Summary for Country
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
87/147
y y
Detection Precision, the percentage of results returned that
are correct= TP/(TP+FP) = 99.8%
Precision after corrections = 99.98%
Recall, the fraction of (all) correct results
returned = TP/(TP+FN) = 97.9% Recall after corrections = 99.54%
F-measure, the harmonic mean of precision andrecall = 2PR/(P+R) = 98.8%
F-measure after corrections = 99.76% For the second best known system(Yu, et al.):
Precision = 94.0%; Recall = 92.1%; F-measure = 93.0%
A l i * i St h A d t
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
88/147
Analysis* using Staph Aureus data
Organizations: Number of Affiliation sentences = 4000
True Positives = 3989
False Positives = 0
False Negatives = 11
Precision = 100%; Recall = 97.5%; F-measure = 98.7%
States True Positives = 3528 False Positives = 470
False Negatives = 2
Precision = 88.2%; Recall = 99.9%; F-measure = 93.7%
Cities True Positives = 3611
False Positives = 2
False Negatives = 387 Precision = 99.9%; Recall = 90.3%; F-measure = 94.9%
For the second best known system(Yu, et al.): Precision = 86.8%; Recall = 91.3%; F-measure = 89.0%
*Done by: Divya
O i ti N li ti
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
89/147
Organization Normalization Two types of Named Entities
Described entities: those which uniquely identify with a realworld organization given the GPE
All Organizations containing a person name, a place name, or adirectional modifier
Ex:- Jerome Lipper Center for Multiple Myeloma, University of Texasand North Western University.
Descriptor entities, also sub-organizations: those which dontuniquely identify with a real world organization unless in thepresence of a Described entity and whose primary role is to givemore specific information about a Described entity
Ex:- School of informatics and Dept. of Biomedical Informatics
We are interested only in Described entities, also calledorganizations
A bi iti
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
90/147
Ambiguities No polysemy at entity class. Single entity.
Polysemy within entity class: Mayo Clinic, USA
Mayo Clinic, Rochester, USA
Synonymy at Word level because of NSWs Formatting errors mainly because of OCR in MARS
Spelling mistakes
(Abbreviations taken care of at NER stage)
Synonymy at Inter-Word level due to lack ofconsensus in the choice of words while referringto an organization
Example of Synonymy
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
91/147
Example of SynonymyWashington University School of Medicine
Washington University School of Medicine and St. Louis Children''s
Hospital
School of Medicine
Washington University School of Medicine at Barnes-Jewish Hospital
Barnes-Jewish Hospital at Washington University School of Medicine
Washington University
Washington University School of Medicien*
Washington University School of Medicine and Metropolitan St. Louis
Psychiatric Center
Barnes Retina Institute and Washington University School of Meidcine*Division of Gastroenterology Washington University School of Medicine
St. Louis*ibid
Common Approach
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
92/147
Common Approach
Compare against a list or dictionary of organization Used in gene normalization
Map the Entrez Gene identifiers for gene names mentioned inPubMed/MEDLINE abstracts
Unlike genes, many organizations get renamed and somebecome defunct
No community interest in maintaining a database
Our Approach:
automatically build a database of Organization clusters,
OrgDB from 100,000 randomly selected affiliationsentences from PubMed published between the years1998 and 2008
Clustering
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
93/147
Clustering
Entries in OrgDB Centroid string
has the least sum of distances in DIST
List of all organizations in each entry or cluster
DIST matrix containing inter-component distance
PubMed IDs of the components
GPE city, state and country of Cluster
An organization is added if it is similar accordingto String Similarity Metric
And all entries in the cluster are updated
Two types of Sequence Alignment
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
94/147
Two types of Sequence Alignment
Global Sequence Alignment Implemented by Needleman-Wunsch Algorithm
using dynamic programming
Local Sequence Alignment Implemented by Smith-waterman Algorithm
Also uses dynamic programming
Global Seq. Alignment is too strict Local Seq. Alignment is inaccurate
Example
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
95/147
Example
Limitation of Global Alignment
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
96/147
Limitation of Global Alignment
Limitation of Local Alignment
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
97/147
Limitation of Local Alignment
Local Learning
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
98/147
Local Learning
Use local information from the training data tofurther enhance the value of the training set
Have a Tight String Similarity Metric
Understand the data by finding connectedcomponent
Precise Computation
We call it Recalculation through self-training
Inspired by Charniak Mc Closky Parser,Brown University. Currently the best.
Tight String Similarity (TSS) Metric
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
99/147
Tight String Similarity (TSS) Metric
Levenshtein distance between the two organizationnames NOT at the character level
BUT at the word level
AFTER removing stop words
Criteria for two words a & b to be same:
Parameters for Levenshtein Distance Penalty of gap = length of the word
Penalty of mismatch = sum of the lengths of the word
Sentences same if the distance between them not morethan 4
Recalculation
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
100/147
Recalculation TSS addresses Synonymycaused by NSWs
Still Synonymy exists among clusters in OrgDB
Due the lack of consensus in the choice of words
Ex:- The David Geffen School of Medicine at The
University Of California and DG School ofMedicine at The University Of California at Los
Angeles.
Recalculate: Synonyms of the current cluster
Equivalent to finding connected component inthe corresponding graph.
Seeing it as a Graph
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
101/147
Seeing it as a Graph
OrgDB equivalent to an undirected graphOrgG
Vertices: clusters
Edges: Between any two clusters One of which almost contains another like David Gaffen
School of Medicine and DG School of Medicine
Both have same GPE
Almost contains if Extended Smith-
Waterman Score (ESS) is more than 0.90
Finding Connected Component
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
102/147
Finding Connected Component
Initialization: Add the vertex (cluster) we areconcerned with
Propagation: Iteratively visit each unvisited vertex(cluster) closest to the root
add all the vertices (clusters) adjacent to it and are notalready in the connected component
Pruning: From depth 3, add only those vertices:
which have an organization that collaborated in an
article with an organization in one of the vertices(clusters) already in the connected component
Prevents errors
An Example
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
103/147
An Example
Input: PubMed ID: 16849888
Affiliation sentence: Duke University Medical Centerand Duke Clinical Research Institute, Durham, NC27710, USA.
NER: Organization: Duke University Medical Center and
Duke Clinical Research Institute (O1)
Country: USA
State: North Carolina
Zip Code: 27710
City: DURHAM
Example Continued
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
104/147
Example Continued
Adding to OrgDB or OrgG No cluster in OrgDB is close to O1
Add new cluster for O1.
Recalculation Step 1 Add O1 to the connected component, CC
Example Continued
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
105/147
Example Continued
Recalculation Step 2 Organizations adjacent to O1 are:
Duke Clinical Research Institute(O2)
Duke University Medicalcenter(O3) Duke University Medical Center(O4)
Duke University(O5)
Add these to CC
Continued
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
106/147
Continued
Recalculation Step 3 Expand all the nodes at level 2. For example, O2.
The organization adjacent to O2 is Department ofBiostatistics and Bioinformatics and Duke Clinical
Research Institute (O6). Add this to CC
Repeat it for O3, O4 and O5.
We get 14 more organizationsin CC O7 through O20
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
107/147
Recalculation Step 4
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
108/147
p
Consider expanding O19 to :
Durham Veterans Affairs Medical Center
Veterans Affairs Medical Center. An examination of
These two didnt collaborate (in any of the
100,000 publications) with any organization inthe CC.
This justifies our not adding these organizationsin the CC.
Step 4 is continued for the rest of theorganizations
Finally!
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
109/147
Finally!
Connected component gave us the set of all thesynonyms of the organization
Depending on the objectives of normalization,the criterion to choose varies
We picked the centroid string of the cluster withthe largest number of publication as thenormalized name Going by this criterion : Duke University Medical
Centerbecomes the normalized name for Duke
University Medical Center and Duke Clinical ResearchInstitute
Analysis
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
110/147
Analysis
Test set: obtained 4135 articles related to a studyon Antiangiogenesis indexed in PubMedbetween 2004 and 2008
The normalization process identified each articlewith a unique standard organization
182 unique organizations were identified (13.8articles per organization)
Overall 13 errors; 5 were caused only by NERwhich means the Normalization process alonehas a precision of 99.5% with 100% recall
Magic Mappings
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
111/147
Magic Mappings Discovering a richer set of synonyms than navetes
(Digression) Top-10 organizations in terms of the
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
112/147
number of publications
(Digression) Top-10 most influential
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
113/147
organizations
Work Left
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
114/147
Work Left
Normalization is done only for USA need tobe extended globally
Incorporating these features into the existing
system should be easy Analyze the improvement in performance
Corpus: BioCreative III training set and/or test
set
Thesis ONTOLOGY
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
115/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Named Entity
Recognition
SimFind
BANNER +SimFind
Normalization
External Features
* Affiliation
* Author
Relationship Extraction
Discourse Analysis
Bag of SimplifiedSentences model ~
Shot-gun sequencing
DistributionalSemantics
Word-orderpreserving pattern
extraction
Thesis ONTOLOGY
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
116/147
Thesis ONTOLOGY
Protein-Protein Interaction Extraction
Relationship Extraction
Discourse Analysis
Bag of Simplified Sentencesmodel ~ Shot-gun sequencing
Distributional Semantics
Word-order preservingpattern extraction
Discourse Analysis
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
117/147
Discourse Analysis
Language is not merely a bag of words but a tool withparticular properties The linguists work is precisely to
discover these properties, whether for descriptive analysis or
for the synthesis of quasi-linguistic systems. Harris, 1954
For this, DAs break the sentence into simpler clauses. Even Quirks simple sentence can still be a complex clause
We have identified a new TNF-related ligand, designated human GITR
ligand (hGITRL), and its human receptor (hGITR), an ortholog of the
recentlydiscoveredmurine glucocorticoid-induced TNFR-related
(mGITR) protein.
DAs critically analyze discourse using the integration tool
Discourse Analysis Tools
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
118/147
Discourse Analysis Tools
Deixis Tool ask how deictics are being used to tie what is said to
context and to make assumptions about what authorsalready know or can figure out
Pronoun Resolution
Appositives
Fill in Tool
Based on what was said and the context in which it
was said, what needs to be filled in here to achieveclarity?
Distributional semantics
Discourse Analysis Tools
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
119/147
Discourse Analysis Tools
Deixis Tool ask how deictics are being used to tie what is said to
context and to make assumptions about what authorsalready know or can figure out
Pronoun Resolution
Appositives
Fill in Tool
Based on what was said and the context in which it
was said, what needs to be filled in here to achieveclarity?
Distributional semantics
Discourse Analysis Tools
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
120/147
Discourse Analysis Tools
Integration Tool Find how clauses were integrated or packaged into
main, subordinate, and embedded(relative) clauses
What was missing and what got added?
Example: It has been shown that LIGHT triggersapoptosis of various tumor cells including HT29 cellsthat express both lymphotoxin beta receptor ( LTbetaR ) and HVEM / TR2 receptors. LIGHT triggers apoptosis of various tumor cells.
LIGHT triggers apoptosis of HT29 cells. HT29 cells express both lymphotoxin beta receptor and
HVEM / TR2 receptors.
Hallidays Systemic Functional
G
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
121/147
Grammar
Three ways clauses expand to sentence Elaborating its existing structure
Example: Relative clause
Extending it by addition or replacement Example: Coordination
Enhancing its environment
Example: Cause-conditional
We will be using these guidelines to design
rules for creating simpler sentences
BioSimplify
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
122/147
BioSimplify
GOAL: Create bag of simplified sentences forautomatic discourse analysis.
Application: information extraction on
biomedical text Existing methods: features like POS tags, parse
trees and dependencies informally known asbag-of-NLP
BOSS: standardize the representation ofgrammatical information in elemental chunks
Sentence Simplification: Motivations
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
123/147
p
Improve human readability Shorter
Grammatical
Cohesive
Information-preserving
Text summarization
Shorter
Preserve only important
Improve parser performance or Relationship Extraction
Shorter
Grammatical
Information-preserving
Architecture
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
124/147
Noun Phrase Replacement Using POS tags and Noun phrase chunker
POS tags: LingPipe
Chunker: OpenNLP
Syntactic Simplification
Use any parser to produce CFG penn tree
Parser: McClosky retraining parser 88%
Information Extraction System
Example: PIE
Sentence
NPReplacement
SyntacticSimplification
BOSS
NP Chunker
Parser
Information
Extractionsystem
Noun Phrase Replacement
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
125/147
p
Noun phrase consists of an optional determinative, anoptional premodification, a mandatory head, and an optional
postmodification
Noun Phrase chunkers return all the noun phrases of the
smallest length, thus always excluding the postmodifications Last word of the identified noun phrase is the head noun
Removal of the optional determinative makes the sentence
ungrammatical
All tokens other than the head noun and the startingdeterminative or numeral (if exists) are removed
For example, the noun phrase the recently discovered murine
glucocorticoid is replaced with the glucocorticoid.
Syntactic Simplification
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
126/147
y p
synSimp(t), t is the penn tree of the given sentence:
-Initialize simpTrees, the ordered set containing the penn trees of all simplified sentences.
-FOREACH subtree of t traversed in the order of depth-first traversal
- perform necessary simplifications at that node which are the simplifications that
neednt be repeated for all the parents to this node
-Add the present tree to simpTrees
-FOREACH unprocessed tree in simpTrees
- FOREACH subtree of t traversed in the order of depth-first traversal
- perform the simplifications for this node
- add new trees in simpTrees if applicable
-return the sentences represented by the trees in simpTrees
END
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
127/147
Rule: S ~ {S}. Condition:
S contains NP
S contains VP
Explanation: Adds all simple sentences into bag
Example: In differentiating C2C12 cells, E2F complexes
switch and DNA synthesis in response to serum are
prevented when MyoD DNA binding activity and the cdks
inhibitor MyoD downstream effector p21 are induced.
Result: MyoD DNA binding activity and the cdks inhibitor
MyoD downstream effector p21 are induced.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
128/147
Rule: NP[NP1 VP1*] ~ [NP1] {NP1 "can be" VP1} Condition:
VP1 starts with a gerund, present participle or past participle
Explanation: Postmodification by verb phrase
Example:
The cloning of members of these gene families and the identification of the
protein-interaction motifs found within their gene products has initiated the
molecular identity of factors (TRADD, FADD/MORT, RIP, FLICE/MACH, and
TRAFs) associated with both of the p60 and p80 forms of the TNF receptor
and with other members of the TNF receptor superfamily.
Result: The cloning of members of these gene families and the identification of the
protein-interaction motifs has initiated the molecular identity of factors
The protein-interaction motifs can be found within their gene products.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
129/147
Rule: NP[NP1 ADJP1] ~ [NP1] {NP1 "can be" ADJP1 } Explanation: Postmodification by adjective phrase
Example:
Src homology domain-2 (SH2)/SH3 domain - can be
containing adapters such as Grb2, Crk, and Crk-L, whichinteract with guanine nucleotide exchange factors specific
for the Ras family.
Result:
interact with guanine nucleotide exchange factors.
Guanine nucleotide exchange factors can be specific for
the Ras family.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
130/147
Rule: NP[NP1 PRN] ~ [NP1] [PRN - LRB - RRB]
Explanation:
Add two sentences- one with abbreviation removed, the other with
NP replaced by abbrev
Example: Coexpression of the alpha and betaL subunits of the human interferon
alpha (IFNalpha) receptor is required for the induction of an antiviral
state by human IFNalpha.
Result:
Coexpression of the alpha and betaL subunits of the human interferon
alpha receptor is
Coexpression of the alpha and betaL subunits of the human IFNalpha
is
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
131/147
Rule: NP[NP1 PP] ~ [NP1]
Explanation: Postmodification by prepositional phrase
Example:
To explore the role of the different domains of the betaL
subunit in IFNalpha signaling, we coexpressed wild-typealpha subunit and truncated forms of the betaL chain in L-
929 cells.
Result:
To explore the role in IFNalpha signaling, we coexpressedwild-type alpha subunit and truncated forms of the betaL
chain in L-929 cells.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
132/147
Rule: VP[MD VP1 , S*] ~ [MD VP1]
Condition:
S contains VP and not NP
Explanation: Postmodification by verb phrase
Example: T lymphocytes can be activated normally in response to either
stimulus, demonstrating that the effects of the inactive CaMKIV on
activation are reversible.
Result: T lymphocytes can be activated normally in response to either
stimulus.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
133/147
Rule: NP[NP : S*] ~ [S*]
Condition:
S contains VP or NP
Explanation: Section indicator
Example: OBJECTIVE: To investigate the relationship between the expression of
Th1/Th2 type cytokines and the effect of interferon-alpha therapy.
Result:
To investigate the relationship between the expression of Th1/Th2type cytokines and the effect of interferon-alpha therapy.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
134/147
Rule: S[S1 , NP VP] ~ [NP VP]
Condition:
S1 doesnt contain both NP and VP
Explanation: Content clause
Example: To characterize these pathways, we focused on changes in the cyclin-
dependent kinase inhibitors and their binding partners that underlie
the cell cycle arrest at senescence.
Result:
We focused on changes in the cyclin-dependent kinase inhibitors and
their binding partners that underlie the cell cycle arrest at senescence.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
135/147
Rule: NP[NP SBAR] ~ [NP], {SBAR -WHNP + NP}
Condition:
Wh-NP in the relative clause is replaced by NP from main clause
Explanation: Relative Clause
Example: To characterize these pathways, we focused onchanges in the cyclin-dependent kinase inhibitors and their
binding partners that underlie the cell cycle arrest at
senescence.
Result: changes in the cyclin-dependent kinase inhibitors and their binding
partners.
The cyclin-dependent kinase inhibitors and their binding partners
underlie
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
136/147
Rule: VP*, SBAR+ ~ -, SBAR
Explanation: Relative Clause
Example: As [Ca2+]o increased, [Ca2+]i rapidly increased, as
monitored by fluorometry.
Result: As [Ca2+]o increased, [Ca2+]i rapidly increased.
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
137/147
Rule: VP , CC VP2] ~ [VP1] [VP2]
Explanation: Coordination of verb phrases
Example:
These mechanisms must be understood in order to
prevent, or combat, the emergence of a virulent,multidrug-resistant form of the bacillus that would be
uncontrollable by means of today's treatment strategies.
Result:
These mechanisms must be understood in order toprevent, the emergence of a virulent, multidrug
These mechanisms must be understood in order to combat
, the
Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
138/147
Rule: VP[... , PP] ~ }, PP{
Explanation: Postmodification by prepositional phrase
Terminal prepositional phrase and preceding comma are removed
from verb phrase
Example: Because cell lines can lose their differentiated phenotype in culture
across passages, documentation of gene expression must be
determined for passage populations, for us to have knowledge of cell
behavior in vitro.
Result: Because cell lines can lose their differentiated phenotype in culture
across passages, documentation of gene expression must be
determined for passage populations.
BioSimplify Parser
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
139/147
The rules are perfect, so BioSimplify is as perfect as thepenn trees are.
The Penn trees are as perfect as the performance ofparsers in biomedical domain
Increasing parser performance (F-measure)
Stanford (lexicalised) 72.5% (2003) Link Grammar ~70% (2006)
Charniak-Lease 81% (2005)
Charniak-McClosky 84% (2008)
Charniak-McClosky 88% (2009)
Calculated penn tree databases also available ASU BMI (& CSE)s PTDB parse tree database
NLP web service provided by NCIBI
Analysis
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
140/147
Worst-case Time Complexity O(n
2
*R), where n = number of tokens in the sentence
R = number of simplification rules
Average Time Complexity O(nlog(n)*R)
Better than our proto-type system(2009) based on LinkGrammar parser
Time complexity O(n3*R)
Link Grammar is getting behind in the race
The dependencies produced arent standard like the penn trees
BioSimplify Accuracy:precision of 90%, recall of 99% and f-
score of 95%
Test Set: 404 sentences from AIMed
Old Model
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
141/147
Preprocessing Removal of sentence indicators
Removal of phrases in parentheses
Partial resolution of coordination ellipsis Gene Entity Replacement
Noun Phrase Replacement
Syntactic transformation Grammatical correctness using GRAM vector
Old Model: GRAM vector
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
142/147
Every sentence can be uniquely associated with the 2-tuple of nullcount and disjunct cost (n,d)
A null count (which represents unwanted words) needs moreattention than the disjunct cost (which represents less likelywords)
We define the 2-tuple (n1, d1) to be greater than (n2, d2), if andonly if n1 is greater than n2, or, n1 is equal to n2 and d1 is greaterthan d2.
The grammatical correctness of a collection of sentences ismeasured by the 2-tuple of the sum of the null counts of theindividual sentences and the sum of the disjunct costs of theindividual sentences respectively.
Since null counts and disjunct costs are typically less than 10 (i.e,one-digit numbers), for the purpose of easy comparison and forcapturing the 2-tuples in one dimension, we define a new costvector GRAM which is equal to 10*UNUSED + DIS.
OLD Model:Overview of Rules
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
143/147
Rules for prefix subordination, infix subordination and if-then
coordination (details in Siddharthan, 2003) These rules were also adapted recently by SimText (Ong, et
al., 2008), a text simplification system for improving thereadability of medical literature, but without a mechanism tojudge the grammatical correctness.
Differences
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
144/147
New Version Old version
Time Complexity O(nlog(n)*R) O(n3*R)
Dependencies PTB format Non-standard LG linkages
NP chunking LingPipe + OpenNLP Stanford
Parser Charniak-McClosky Link Grammar
Domain
Adaptability
Yes No
Protein Replacement
Scalability Yes No
Customizability Yes NoModel Bag of simplified
sentences
Minimal number of
Information-preserving
maximally simple sentences
PPI Extraction Experiment
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
145/147
bioSimplifyAbstract
from AIMed
Simplified
Abstract
PIE PIE
Remove
Annotations
Results
for original
sentences
Results for
Simplified
sentences
AIMed
Comparison of
different methods
Work Flow for Each Abstract
Results
7/27/2019 A semi-supervised approach to extracting concepts and relationships_v2_final-1.pdf
146/147
Precision Recall F-score
Original sentences 46 58 51
Sentences simplified by older BioSimplify 51 64 57
Sentences simplified by current BioSimplify 46 82 60