View
185
Download
2
Category
Preview:
Citation preview
Extracting Medical Attributes and finding
relations
Sanghamitra Deb Accenture Technology Laboratory
drugs
side effects
Personalized Medicine
ethnicity
dosages
diseases
age group
compounds
gender
interactions
?
?
?
FDA Drug Labels
It is indicated for treating respiratory disorder caused due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an anticonvulsant was derived from active drug-controlled studies that enrolled patients with the following seizure types:
LOTEMAX is a corticosteroid indicated for the treatment of post-operative inflammation and pain following ocular surgery.
FDA Drug Labels: Examples
We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd). The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily. We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case. Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.
Meta Data
Dosage single dose: 240 ml
Drug methylphenidate
# of vol 30mg
Clinical Trials: Meta Data
We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd). The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily. We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case. Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.
Drug Adverse Effects
Ritalin pancreatitis,abdominal pain
Tylenolnausea, upper stomach pain, itching, loss of appetite
Aspirinrash, gastrointestinal ulcerations, abdominal pain, upset stomach, heartburn
Clinical Trials: Side Effects
Drug—Disease
• Of Label Drug Uses
• Database completion
• Design of clinical trials
relationship between meta- data
• How does heart disease correlate with gender and age.?
• Which universities have the most successful clinical trails for breast cancer?
• How are genes and phenotypes related?
• What dosage for ritalin was most effective in treating ADHD with least side effects?
Problems it Solves
8000 drug - disease treatment relationships from UMLS data
drug_name:’metipred|methylprednisolone|methylprednisolone preparation|methylprednisolonum|6alpha-methylprednisolone|6-alpha-methylprednisolone preparation|methylprednisolone|pregna-1,4-diene-3,20-dione, 11,17,21-trihydroxy-6-methyl-, (6alpha,11beta)-|(6alpha,11beta)-11,17,21-trihydroxy-6-methylpregna-1,4-diene-3,20-dione|methylprednisolone|meprdl|methylprednisolone|6-methylprednisolone|6 methylprednisolone'
disease_name: 'respiratory distress syndrome, acute|pulmonary capillary leak syndrome|wet lung syndrome|acute respiratory distress syndrome|shock lung|adult respiratory distress syndrome|shock lung|human ards|adult respiratory distress syndrome|wet lung|ards - adult respiratory distress syndrome|acquired respiratory distress syndrome|adult rds|ards|adult respiratory syndrome|a.r.d.s.|danang lung|danang lung|respiratory distress syndrome|adult respiratory distress syndrome, ards|shock lung|respiratory distress syndrome, adult|adult respiratory distress syndrome|vietnam lung|rds|lung, shock|adult hyaline membrane disease|ards, human|adult respiratory distress syndrome|adult hyaline membrane disease|ardss, human|a r d s|adult rds|congestive atelectasis|ards|respiratory distress syndrome|respiratory distress syndrome, adult|adult respiratory distress syndrome’
Training Data
Extract sentences that contain the specific attribute
POS tag and extract unigrams,bigramsand trigrams centered on nouns
Extract Features: words around nouns: bag of words/word vectors, position of the noun.
Train a Machine Learning model to predict which unigrams,bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship.
Map training data to create a balanced positive and negative training set.
Course of Action
Creating Labelled Datalemmatized_sentence: [‘maintenance’, ‘therapy','reduce','the','frequency','of', ‘manic', 'episode', 'and', 'diminish', 'the', 'intensity', 'of', 'those', 'episode', 'which', 'may', 'occur', '.']
Several CandidatesTypically one of them is the disease that the drug treats. For every drug we create a training data. One line of the text produces 5 lines of training data with one true positive.
Balancing the Training DataSince the training data contains a higher percentage of zero’s than one’s it is important to balance it before modeling, i.e in order to build the model I choose equal number of zeros and ones.
Candidate Target rule-
predictionmainten
ance 0 1
therapy 0 1
manic episode 1 1
intensity 0 1
episode 0 1
Feature Extraction: Word Vectors, Disease Combinations
adhd + manic episode = bipolar disorderrespiratory disorder+allergy=common cold
coronary artery+heart disease=angina pectoris
high blood pressure+lipid=diabetes_management
Extract Features: Initialize vocabulary with pre-trained vectors
gensim: Train word2vec on medical corpus with unigrams, bi-grams and trigrams
Produce word vectors
Pure Python stack
pandas
scikit-learn
gensim
stanford-nlp-parser
pipeline = Pipeline([ ('union', FeatureUnion( transformer_list=[ # Pipeline for getting the position of the disease candidate ('position', Pipeline([ ('selector', ItemSelector(column='candidate')), ('vect', DictVectorizer()), ])), # Pipeline for getting words around candidates
('words_around', Pipeline([ ('selector', ItemSelector(column='words_around')), ('count', CountVectorizer()), ])) ])), ('clf', ML_library(penalty=‘l1'))])
Data Cleaning and Tokenization
Machine Learning Workflow: Pure Python stack
pandas
scikit-learn
gensim
stanford-nlp-parser
Feature Extraction/Candidate Selection Create Labelled Data
ML: Logistics Regression, …
HyperParameter Tuning
Calculate Metrics: precision, recall, ROC curve, etc
Results: Examples
drug-name disease candidate Candidates ML
Lithium Carbonate
bipolar disorder 1 1
Lithium Carbonate individual 1 0
Lithium Carbonate maintenance 1 0
Lithium Carbonate manic episode 1 1
Drug Candidate Target Predict
Silver Sulfadiazine
third degree 0 0
Silver Sulfadiazine sepsis 0 1
Silver Sulfadiazine burn 0 1
Silver Sulfadiazine cream 0 0
Drug Candidate Target Predict
Diltiazem Hydrochlori
despasm 1 0
Diltiazem Hydrochlori
de
coronary artery 1 0
Diltiazem Hydrochlori
de
stable angina 0 0
Diltiazem Hydrochlori
deangina 0 0
'silver sulfadiazine cream usp 1 % be a topical antimicrobial drug indicate as a adjunct for the prevention and treatment of wound sepsis in patient with second and third degree burn .’
[‘Diltiazem', ‘hydrochloride', ‘tablet','USP', 'be', ‘indicate', 'for', 'the', ‘management', 'of', 'chronic', 'stable', 'angina', 'and', ‘angina', 'due', ‘to', ‘coronary', 'artery', 'spasm', '.']
Cases where it does not work
Exploring Modeling Technique
Method Precision Recall F1 ROC Curve
Logistic Regression 0.95 0.95 0.95 0.92
LR+ word2vec 0.94 0.94 0.94 0.9
SVM 0.96 0.95 0.95 0.92
Random Forest 0.96 0.96 0.96 0.9
Clinical Trials Data
We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd).
The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case.
Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.
Clinical Trials Data: Labelled Data
Data Dosage Drug Treats Disease
Side Effects Age Gender Ethnicity duration
10-year-old 0 0 0 0 1 0 0 0
pancreatitis-ritalin 0 0 0 1 0 0 0 0
adhd-ritalin 0 0 1 0 0 0 0 0
ritalin 0 1 0 0 0 0 0 0
30 mg 1 0 0 0 0 0 0 0
past three weeks 0 0 0 0 0 0 0 1
boy 0 0 0 0 0 1 0 0
Clinical Trials Data: Labelled Data Exist
Data Dosage Drug Treats Disease
Side Effects Age Gender Ethnicity duration
10-year-old 0 0 0 0 1 0 0 0
pancreatitis-ritalin 0 0 0 1 0 0 0 0
adhd-ritalin 0 0 1 0 0 0 0 0
ritalin 0 1 0 0 0 0 0 0
30 mg 1 0 0 0 0 0 0 0
past three weeks 0 0 0 0 0 0 0 1
boy 0 0 0 0 0 1 0 0
Creating Labeled Data
Hand Label data that contain the specific attribute ~100
Extract Candidates: POS tag and extract unigrams,bigrams and trigrams centered on nouns
Generate rules: Automatic creation of labels that satisfy the 100 hand labelled data
This process will create a smaller sample (say 5-10%) of data which can be further crowdsourced for 100% accurate gold sample
Rule Based Model : with 95% accuracy
Iterate: Repeat process a few times
Example of rules: Dosage: (1) Sentence contains numbers (2) Distance between numbers and “mg”, “milligrams” <5 characters (3)Contains the word “dose”
Age: (1) Sentence contains numbers (2)Contains the word “age”, “year-old” within 5 words of the candidate
Deepdive: Extracting relationships between entities
pdf’s, textfiles, semistuctured json, example: journals available at pubmed and clinicaltrails.gov
Provide examples of data that need to be extracted
Structured data
Deepdive: Prototyping with ddlite
https://github.com/HazyResearch/ddlite
Deepdive: Prototyping with ddlite
Mind Tagger
Show ipython notebook
• NLP relationship extraction with ML techniques are very successful in presence of gold labeled data
• It is very important to invest time and resources towards harvesting good training data.
• There is an enormous amount data in pharma (clinical trials, laboratory notes, doctors notes, drug manufacturing documents,…). In order to pursue personalized medicine it is important to centralize this and make joint inferences across all data sets.
Final Remarks
Thank You: We are hiring …
blog: https://medium.com/@sangha_deb @sangha_deb,sanghamitra.a.deb@accenture.com
Recommended