9
1 Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh Abstract—Feature engineering remains a major bottleneck when creating predictive systems from electronic medical records. At present, an important missing element is detecting predictive regular clinical motifs from irregular episodic records. We present Deepr (short for Deep record), a new end-to-end deep learning system that learns to extract features from medical records and predicts future risk automatically. Deepr transforms a record into a sequence of discrete elements separated by coded time gaps and hospital transfers. On top of the sequence is a convolutional neu- ral net that detects and combines predictive local clinical motifs to stratify the risk. Deepr permits transparent inspection and visualization of its inner working. We validate Deepr on hospital data to predict unplanned readmission after discharge. Deepr achieves superior accuracy compared to traditional techniques, detects meaningful clinical motifs, and uncovers the underlying structure of the disease and intervention space. I. I NTRODUCTION A major theme in modern medicine is prospective healthcare, which refers to the capability to estimate the future medical risks for individuals. These risks can include readmission after discharge, the onset of specific diseases, and worsening from a condition [42]. Such capability would facilitate timely prevention or intervention for maximum health impact, and provide a major step toward personalized medicine. An important data resource in aiding this process are electronic medical records [20]. Electronic medical records (EMRs) contain a wealth of patient information over time. Central to EMR-driven risk prediction is patient representation, also known as feature engineering. Representing an EMR amounts to extracting relevant historical signals to form a feature vector. However, feature extraction in EMR is challenging [44]. An EMR typically consists of a sequence of time-stamped visit episodes, each of which has a subset of coded diagnoses, a subset of procedures, lab tests and textual narratives. The data is irregular at patient level. EMR is episodic – events are only recorded when patients visit clinics, and the time gap between two visits is largely random. Representing irregular timing poses a major challenge. EMR varies greatly in length – young patients usually have just one visit for an acute condition, but old patients with chronic conditions may have hundreds of visits. At the same time, the data is regular at local episode level. Diseases tend to form clusters (comorbidity) [41] and the disease progression may be dictated by the underlying biological processes [49]. Likewise treatments may follow a certain protocol or best practice guideline [17], and there are well-defined disease-treatment interactions [39]. These regularities can be thought as clinical motifs. Thus an effective EMR representation should be able to identify regular clinical motifs out of irregular data. Existing EMR-driven predictive work often relies on high- dimensional sparse feature representation, where features are engineered to capture certain regularities of the data [11], [20] This feature engineering practice is effort intensive and non- adaptive to varying medical records systems. Automated feature representation based on bag-of-words (BoW) is scalable, but it breaks collocation relations between words and ignores the temporal nature of the EMR, thus it fails to properly address the aforementioned challenges. In this work we present a new prediction framework called Deepr that does not require manual feature engineering. The technology is based on deep learning, a new revolutionary approach that aims to build a multilayered neural learning system like a brain [25]. When fed with a large amount of raw data, the system learns to recognize patterns with little help from domain experts. Deep learning now powers speech recognition in Google Voice, self-driving cars at Google and Baidu, question answering system at IBM (Watson), and smart assistants at Facebook. It already has a great impact on hundreds of millions (if not billions) people. But healthcare has largely been ignored. We hypothesize that a key to apply deep learning for healthcare patient representation which requires a proper handling of the irregular nature of episodes mentioned above [37]. Deepr fills the gap by offering an end-to-end technology that learns to represent patients from scratch. It reads medical records, learns the local patterns, adapts to irregular timing, and predicts personalized risk. The architecture of Deepr is multilayered and is inspired by recent convolutional neural nets (CNNs) in natural languages [9], [21], [25], [30], [51]. The most crucial operation occurs at the bottom level where Deepr transforms an EMR into a “sentence” of multiple phrases separated by special “words” that represent time gap. Each phrase is an visit episode. As with syntactical grammars and collocation patterns in NLP, there might exist “health grammars” and “clinical patterns” in healthcare. Health grammars refer to latent biological and environmental laws that dictate the global evolution of one’s health over time, e.g., probable progression from “diabetes type II” to “renal failure”. To handle irregular timing, time gaps and transfers are treated as special words. With this representation, an EMR is transformed into a sentence of variable length that retains all important events. The other layers of Deepr constitute a CNN, which is similar to those in [9], [21], [51]. First, words are embedded into a continuous vector space. Next, words in sentence are passed through a convolution operation which detects local motifs. Local motifs are then pooled to form a global feature vector, which is passed into a classifier, which predicts the future risk. All components are learned at the same time from data: the data signals are passed from the data to the output, and the training signals are propagated back from the labels to the motif detectors. Hence Deepr is end-to-end. We validate Deepr on a large database of 300K patients collected from a hospital chain in Australia. We focus on arXiv:1607.07519v1 [stat.ML] 26 Jul 2016

Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

  • Upload
    vokhanh

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

1

Deepr: A Convolutional Net for Medical RecordsPhuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh

Abstract—Feature engineering remains a major bottleneckwhen creating predictive systems from electronic medical records.At present, an important missing element is detecting predictiveregular clinical motifs from irregular episodic records. We presentDeepr (short for Deep record), a new end-to-end deep learningsystem that learns to extract features from medical records andpredicts future risk automatically. Deepr transforms a record intoa sequence of discrete elements separated by coded time gaps andhospital transfers. On top of the sequence is a convolutional neu-ral net that detects and combines predictive local clinical motifsto stratify the risk. Deepr permits transparent inspection andvisualization of its inner working. We validate Deepr on hospitaldata to predict unplanned readmission after discharge. Deeprachieves superior accuracy compared to traditional techniques,detects meaningful clinical motifs, and uncovers the underlyingstructure of the disease and intervention space.

I. INTRODUCTION

A major theme in modern medicine is prospective healthcare,which refers to the capability to estimate the future medicalrisks for individuals. These risks can include readmissionafter discharge, the onset of specific diseases, and worseningfrom a condition [42]. Such capability would facilitate timelyprevention or intervention for maximum health impact, andprovide a major step toward personalized medicine. Animportant data resource in aiding this process are electronicmedical records [20]. Electronic medical records (EMRs)contain a wealth of patient information over time. Centralto EMR-driven risk prediction is patient representation, alsoknown as feature engineering. Representing an EMR amountsto extracting relevant historical signals to form a feature vector.

However, feature extraction in EMR is challenging [44]. AnEMR typically consists of a sequence of time-stamped visitepisodes, each of which has a subset of coded diagnoses, asubset of procedures, lab tests and textual narratives. The datais irregular at patient level. EMR is episodic – events are onlyrecorded when patients visit clinics, and the time gap betweentwo visits is largely random. Representing irregular timingposes a major challenge. EMR varies greatly in length – youngpatients usually have just one visit for an acute condition, butold patients with chronic conditions may have hundreds ofvisits. At the same time, the data is regular at local episodelevel. Diseases tend to form clusters (comorbidity) [41] andthe disease progression may be dictated by the underlyingbiological processes [49]. Likewise treatments may followa certain protocol or best practice guideline [17], and thereare well-defined disease-treatment interactions [39]. Theseregularities can be thought as clinical motifs. Thus an effectiveEMR representation should be able to identify regular clinicalmotifs out of irregular data.

Existing EMR-driven predictive work often relies on high-dimensional sparse feature representation, where features areengineered to capture certain regularities of the data [11], [20]

This feature engineering practice is effort intensive and non-adaptive to varying medical records systems. Automated featurerepresentation based on bag-of-words (BoW) is scalable, butit breaks collocation relations between words and ignores thetemporal nature of the EMR, thus it fails to properly addressthe aforementioned challenges.

In this work we present a new prediction framework calledDeepr that does not require manual feature engineering. Thetechnology is based on deep learning, a new revolutionaryapproach that aims to build a multilayered neural learningsystem like a brain [25]. When fed with a large amount ofraw data, the system learns to recognize patterns with littlehelp from domain experts. Deep learning now powers speechrecognition in Google Voice, self-driving cars at Google andBaidu, question answering system at IBM (Watson), and smartassistants at Facebook. It already has a great impact on hundredsof millions (if not billions) people. But healthcare has largelybeen ignored. We hypothesize that a key to apply deep learningfor healthcare patient representation which requires a properhandling of the irregular nature of episodes mentioned above[37]. Deepr fills the gap by offering an end-to-end technologythat learns to represent patients from scratch. It reads medicalrecords, learns the local patterns, adapts to irregular timing,and predicts personalized risk.

The architecture of Deepr is multilayered and is inspired byrecent convolutional neural nets (CNNs) in natural languages[9], [21], [25], [30], [51]. The most crucial operation occursat the bottom level where Deepr transforms an EMR into a“sentence” of multiple phrases separated by special “words”that represent time gap. Each phrase is an visit episode. Aswith syntactical grammars and collocation patterns in NLP,there might exist “health grammars” and “clinical patterns”in healthcare. Health grammars refer to latent biological andenvironmental laws that dictate the global evolution of one’shealth over time, e.g., probable progression from “diabetes typeII” to “renal failure”. To handle irregular timing, time gaps andtransfers are treated as special words. With this representation,an EMR is transformed into a sentence of variable lengththat retains all important events. The other layers of Deepr

constitute a CNN, which is similar to those in [9], [21], [51].First, words are embedded into a continuous vector space. Next,words in sentence are passed through a convolution operationwhich detects local motifs. Local motifs are then pooled toform a global feature vector, which is passed into a classifier,which predicts the future risk. All components are learned atthe same time from data: the data signals are passed fromthe data to the output, and the training signals are propagatedback from the labels to the motif detectors. Hence Deepr isend-to-end.

We validate Deepr on a large database of 300K patientscollected from a hospital chain in Australia. We focus on

arX

iv:1

607.

0751

9v1

[st

at.M

L]

26

Jul 2

016

Page 2: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

2

predicting unplanned readmission within 6 months afterdischarge. Compared to existing bag-of-words representation,Deepr demonstrates a superior accuracy as well as the capacityto learn predictive clinical motifs, and to uncover the underlyingstructure of the space of diseases and interventions.

To summarize, we claim the following contributions:• A novel representation of irregular-time EMR as a

sentence with time gaps and transfers as special words.• A novel deep learning architecture called Deepr that (i)

uncovers the structure of the disease/treatment space, (ii)discovers clinical motifs, (iii) predicts future risk and (iv)explains the prediction by identifying motifs with strongresponses in each record. The system is end-to-end, and itsinner working can be inspected and visualized, allowinginterpretability and transparency.

• An evaluation of these claimed capabilities on a large-scaledataset of 300K patients.

II. BACKGROUND

a) Medical records: An electronic medical record (EMR)contains information about patient demographics and a se-quence of hospital visits for a patient. Admission informa-tion may include admission time, discharge time, lab tests,diagnoses, procedures, medications and clinical narratives.Diagnoses, procedures and medications are discrete entities.For example, diagnoses may be represented using ICD-10coding schemes1. For example, in ICD-10, E10 refers to Type1 diabetes mellitus, E11 to Type 2 diabetes mellitus. Theprocedures are typically coded in CPT (Current ProceduralTerminology) or ICHI (International Classification of HealthInterventions) schemes 2. One of the most important secondaryuses of EMR is building predictive models [20], [31], [44],[46].

Most existing prediction methods on EMRs either rely onmanual feature engineering [31] or simplistic extraction [44].They either ignore long-term dependencies or do not adequatelycapture variable length [2], [31], [44]. Neither are they able tomodel temporal irregularity [18], [29], [44], [49]. Capturingdisease progression has been of great interest [19], [29], andmuch effort has been spent on Markov models [14], [18],[49]. As Markov processes are memoryless, Markov modelsforget severe conditions of the past when it sees an admissiondue to common cold. This is undesirable. A proper modeling,therefore, must be non-Markovian and able to capture long-termdependencies.

b) Deep learning: Deep learning is an approach inmachine learning, aiming at producing end-to-end systemsthat learn from raw data and perform desired tasks withoutmanual feature engineering. The current wave of deep learningwas initiated by the seminal work of [15] in 2006, but deeplearning has been developed for decades [40]. Over the past fewyears, deep learning has broken records in cognitive domainssuch as vision, speech and natural languages [25]. Current deeplearning is mostly based on multilayered neural networks [40].All the networks share a common unit – the neuron – which is a

1http://apps.who.int/classifications/icd10/browse/2016/en2http://www.who.int/classifications/ichi/en/

simple computational device that applies a nonlinear transformto a linear function of inputs: i.e., f(x) = σ (b+

∑i wixi).

Almost all networks thus far are trained using back-propagation[50], thus enable end-to-end learning.

There are three main deep neural architectures in practice:feedforward, recurrent and convolutional. Feedforward nets(FFN) pass unstructured information from one end to theother, usually from an input to an output, hence they actas a universal function approximator [16]. Recurrent nets(RNN) model dynamics over time (and space) using self-replicated units. They maintain some degree of memory, andthus have potential to capture long-term dependencies. RNNsare powerful computational machines – they can approximateany program [27]. Convolutional nets (CNN) exploit therepeated local motifs across time and space, and thus aretranslation-invariant – the capacity often seen in human visualcortex [24]. Local motifs are small piece of data, usually ofpre-defined sizes, e.g., a batch of pixels, or a n-gram of words.CNN is often equipped with pooling operations to reduce theresolution and enlarge the motifs.

III. Deepr: A DEEP NET FOR MEDICAL RECORDS

In this section, we describe our deep neural net namedDeepr (short for Deep net for medical Record) for representingElectronic Medical Records (EMR) and predicting the futurerisk.

A. Deepr Overview

Deepr is a multilayered architecture based on convolutionalneural nets (CNNs). The information flow is summarizedin Fig. 1. At the bottom level, Deepr sequences the EMRinto a “sentence”, or equivalently, a sequence of “words”.Each word represents a discrete object or event such asdiagnosis, procedure, or any derived object such as time-intervalor hospital transfer. The next layer embeds words into anEuclidean space. On top of the embedding layer is a CNN thatreads a small chunk of words in a sliding window to identifylocal motifs. The local motifs are transformed by RectifiedLinear Unit (ReLU), which is a nonlinear function. All thetransformed motifs are then max-pooled across the sentence toderive an EMR-level feature vector. Finally, a linear classifieris placed at the top layer for prediction. The entire architectureof Deepr can be summarized as a function f(r) for record r:

f(r)← Class (Pool {ReLU (Conv [Embed {Seq (r)}])}) (1)

The CNN plays a crucial role as it detects clinical motifs thatare predictive. Clinical motifs are co-occurrences of diseases(also known as comorbidity), disease progression, patterns ofdisease/treatment, and patterns of collocating treatments [21].However, as CNN is supervised it requires labels, which maynot always be available (e.g., new patients with short history).A possible enhancement is through pretraining the embeddinglayer through a powerful tool known as word2vec [34]. Asword2vec is unsupervised and relies on local collocationpatterns, clinical motifs can be pre-detected, and then furtherrefined through CNN with supervising signals.

Page 3: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

3

output

max-pooling

convolution --motif detection

embedding

sequencing

medical record

visits/admissions

time gaps/transferphrase/admission

prediction

1

2

3

4

5

time gaprecord vector

word vector

?

prediction point

Figure 1. Overview of Deepr for predicting future risk from medical record. Top-left box depicts an example of medical record with multiple visits, each ofwhich has multiple coded objects (diagnosis & procedure). The future risk is unknown (question mark (?)). Steps from-left-to-right: (1) Medical record issequenced into phrases separated by coded time-gaps/transfers; then from-bottom-to-top: (2) Words are embedded into continuous vectors, (3) local wordvectors are convoluted to detect local motifs, (4) max-pooling to derive record-level vector, (5) classifier is applied to predict an output, which is a future event.Best viewed in color.

B. Sequencing EMR

This task refers to transforming an EMR into a sentence,which is essentially a sequence of words. We present here howthe words are defined and arranged in the sentence.

Recall that an EMR is a sequence of time-stamped visitepisodes. Each episode may contain many pieces of information,but for the purpose of this work, we focus mainly ondiagnoses and treatments (which involve clinical proceduresand medications). For simplicity, we do not assume perfecttiming of each piece, and thus an episode is a finite set ofdiscrete words (diagnoses and treatments). The episode isthen sequenced into a phrase. The order of the element inthe phrase may follow the pre-defined ordering by the EMRsystem, for example, primary diagnosis is placed first, followedby secondary diagnoses, followed by procedures. In absenceof this information, we may randomize the elements.

Within an episode, occasionally, there are one or moretransfers between care providers, for example, separate de-partments from the same hospital, or between hospitals. Inthese cases, an admission is a phrase, and an episode is asubset of phrases separated by a transfer event. We create aspecial word TRANSFER for this event. Between two consecutiveepisodes, there is a time gap, whose duration is generallyrandomly distributed. We discretize the time gap into fiveintervals, measured in months: (0-1], (1-3], (3-6], (6-12], 12+.Each interval is assigned a unique identifier, which is treatedas a word. For example, 0-1m is a word for the (0-1] intervalgap. With these treatments, an EMR is a sentence of phrasesseparated by words for transfers or time gaps. The phrases areordered by their natural time-stamps. For robustness, infrequentwords are coded as RAREWORD.

The following is an example of a sentence, where diagnosesare in ICD-10 format (a character followed by digits), and

procedures are in digits:

1910 Z83 911 1008 D12 K31 1-3m R94 RAREWORD H53

Y83 M62 Y92 E87 T81 RAREWORD RAREWORD 1893 D12

S14 738 1910 1916 Z83 0-1m T91 RAREWORD Y83 Y92

K91 M10 E86 6-12m K31 1008 1910 Z13 Z83.

Here the phrases are: [1910 Z83 911 1008 D12], [R94RAREWORD H53 Y83 M62 Y92 E87 T81 RAREWORD RAREWORD

1893 D12 S14 738 1910 1916 Z83], [RAREWORD Y83 Y92 K91

M10 E86], and [K31 1008 1910 Z13 Z83]. The time separatorsare: [1-3m], [0-1m], and [6-12m]. Note that within each phrase,the ordering of words has been randomized.

C. Convolutional Net

c) Embedding: The first step when applying convolutionalnets on a sentence is to represent discrete words as continuousvectors. One way is to use the so-called one-hot coding, thatis, each word is a binary vector of all zeros, except for justone position indexed by the word. However, this representationcreates a high-dimensional vector, which may lead to overfittingand expensive computation. Alternatively, we can use wordembedding, which refers to assigning a dense continuous vectorto a discrete word. For example, the second word [Z83] in theexample above may be assigned to 3D vector as (0.1 -2.3 0.5).In practice, we maintain a look-up table indexed by words,i.e., E(w) ∈ Rm is the vector for word w. The embeddingtable E is learnable. Applying word embedding to the sentenceyields a sequence of vectors, where the vector at position t isxt = E(wt).

d) Convolution: On top of the word embedding layersis a convolutional layer. Each convolution operation reads asliding window of size 2d+ 1 and produces p filter responses

Page 4: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

4

as follows:

zt = ReLU

b +

d∑j=−d

Wjxt+j

(2)

where zt ∈ Rp is filter response vector at position t, Wj ∈Rp×m is the convolution kernel at relative position j (hence,W ∈ Rp×m×(2d+1)), b is bias, and ReLU(x) = max {0,x}(element-wise). When it is clear from the context, we use “filter”to refer to the learnable device that detects motifs, which aremanifestation of filters in real data. The rectified linear functionenhances strong signals and eliminates weak ones. The bias band the kernel tensor W are learnable.

e) Pooling: Once the local filter responses are computedby the convolutional layer, we need to pool all the responsesto derive a global sentence-level vector. We apply here themax-pooling operator:

z = maxt{zt} (3)

where the max is element-wise. Thus the pooled vector z livesin the same space of Rp as filters responses {zt}. Like therectifier used in Eq. (2), this max-pooling further enhancesstrong signals across the words in the sentence.

f) Classifier: The final layer of Deepr is a classifierthat takes the pooled information and predicts the outcome:f(r) = classifier(z(r)) for record r. The main requirement isthat the classifiers must allow gradient to propagate down tolower layers. Examples include a linear classifier (e.g., logisticregression) or a non-linear parametric classifier (e.g., neuralnetwork).

D. Training

Deepr has multiple trainable parameters: embedding matrix,biases, convolution kernels, and classifier-specific parameters.As the number of trainable parameters is often large, itnecessitates regularizers such as weight shrinkage (e.g., via`2 norm) or dropouts [43] . For training we also need tospecify a loss function, which depends on the nature ofclassifiers. For example, for binary outcome (e.g., readmission),logistic classifier is usually trained on cross-entropy loss.Training starts with (random) initialization of parameterswhich are then refined through back-propagation and stochasticgradient descent (SGD). This requires gradients with respectto trainable parameters. Gradient computation is often tediousand erroneous, but it is now fully automated in modern deeplearning frameworks such as Theano [3] and Tensorflow [1].For SGD, parameters are updated after every mini-batch ofrecords (or sentences). Training is stopped after a pre-definednumber of epochs (iterations), or on convergence.

g) Pretraining with word2vec: As mentioned inSec. III-A, the embedding matrix can be pretrained usingword2vec. Here we do not need labels, and thus we can exploita large set of unlabeled data.

E. Model Inspection and Visualization

Deepr facilitates intuitive model inspection and visualizationfor better understanding:

h) Identifying motif responses in a sequence: For eachmotif detector, the motifs response at position t (e.g., zt ∈Rp) can be used to identify and visualize strong motifs. Forsize-3 motifs, the response weight to a size-3 sub-sequence(xt−1,xt,xt+1) of a sequence x is the term

∑dj=−dWjxt+j

in Eq. (2), which is the dot product of the sub-sequence andthe kernel W .

i) Identifying frequent and strong motifs: Motifs withlarge responses in sequences are collected. From this collection,we keep frequent motifs representative for each outcome class.

j) Computing word similarity: Through embedding xw =E(w), word similarity can be computed easily, e.g., throughcosine S(w, v) = x>wxv (‖xw‖ ‖xv‖)−1.

k) Visualization of similar patients: Patient vectors fromEq. (3) can be used to compute patient similarity. This enablesretrieving patients who have similar history and similar futurerisk likelihood. This is unlike existing methods that computeonly similar history, which does not necessarily guaranteesimilar future. Further, the similarity is not heuristic, and itdoes not require a heuristic combination of multiple data types(such as diseases and interventions). Fig. 2, for example, showsthe distribution of positive and negative classes, in which patientvectors are projected onto 2D using t-SNE [47]. Patients whohave similar history and future will stay close together.

l) Visualization in disease/intervention space: Sincewords are embedded into vectors, visualization in 2D is throughdimensionality reduction tools such as PCA or t-SNE [47].

IV. IMPLEMENTATION

In this section, we document implementation details ofDeepr on a typical EMR system. For ease of exposition, weassume that diseases are coded in ICD-10 format, but otherversions are also applicable with minimal changes.

A. Data and Evaluation

Data was collected from a large private hospital chain inAustralia in the period of July 2011 – December 2015. The datais coded according to Australian Coding Standard (ACS). TheACS dictates that diagnosis coding is based on ICD-10-AM3,an Australian adaptation to WHO’s ICD-10 system. Likewise,procedure coding follows ACHI (Australian Classification ofHealth Interventions). The data consists of 590,546 records(300K unique patients), each corresponds to an admission(defined by an admission time and a discharge time).

The data subset for testing Deepr was selected as follows.First we identified 4,993 patients who had at least an unplannedreadmission within 6 months from a discharge, regardless ofthe admitting diagnosis. This constituted the risk group. Foreach risk case, we then randomly picked a control case fromthe remaining patients. For each risk/control group, we used830 patients for model tuning, 830 for testing and the rest fortraining. A discharge (except for the last one in risk group) israndomly selected as prediction point, from which the futurerisk will be predicted. See also Fig. 1 for a graphical illustration.

3https://www.accd.net.au/Icd10.aspx

Page 5: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

5

B. Implementation Details of Deepr

m) Episode definition: Deepr assumes that episodesare well-defined with an admission time and discharge time.However, it is not always the case due to intra-hospital or inter-hospital transfers. Our implementation links two admissionsinto an episode if they are separated by less than 12 hours, orby 12-24 hours but with documented transfer.

n) Words: For robustness, only level 3 ICD-10-AM codesare used. For example, F20.0 (paranoid schizophrenia) would beconverted into F20 (schizophrenia). Similarly, the proceduresare converted into procedure blocks. Rare words are thoseoccurring less than 100 times in the database.

o) Word order randomization: For motifs detection,randomization is necessary to generate many potential motifs.We also test a special case where words in a phrase are orderedstarting with the primary diagnosis followed by other secondarydiagnoses, then by procedures in their natural ordering asdefined by the EMR system.

p) Sentence length: For CNN, the sentences are trimmedto keep the last min(100, len(sentence)) words. This is to avoidthe effects of some patients who have very long sentenceswhich severely skew the data distribution. In a typical EMR,this is equivalent to accounting for up-to 10 visits per patient,which cover more than 95% of patients.

q) Hyper-parameter tuning: Deepr has a number ofhyper-parameters pre-specified by model users: embeddingdimension m, kernel window size 2d+ 1, motif size, numberof motifs n per size, number of epochs, mini-batch size, andother classifier-specific settings. Some hyper-parameters can befound through grid search, which finds the best configurationwith respect to the accuracy on the development set.

We searched for the best parameters using the training anddevelopment data. Then we used the model with the bestparameter to predict the unseen test data. The best parameterssettings were m = 100, d = 1, motif size = 3, 4 and 5, n = 100number of epochs = 10, mini-batch size = 64, `2 regularizationλ = 1.0.

C. Baselines

We implemented the bag-of-words representation and regular-ized logistic regression (BoW+LR). LR has a parameter C thathelps control overfitting. We searched for the best parameter Cusing the development data. We used the model with the bestparameter to predict the unseen test data. We found the bestparameter C = 0.1, which is equivalent to a prior Gaussian ofmean 0 and standard deviation of 0.333.

V. RESULTS

A. Risk Prediction

We predict unplanned readmission within 6 months aftera random index discharge. Table I reports the predictionaccuracy for all methods, when trained on data with andwithout coded time-gaps. Time-gaps coding improves theBoW-based prediction, suggesting the importance of propersequential handling. However, time-gaps do not affect theaccuracy of Deepr. This might be due to the convolution,

Method W/o time With timeBoW + LR 0.727 0.741Deepr (rand init) 0.754 0.753Deepr (word2vec init) 0.750 0.756

Table IACCURACY ON 6-MONTH UNPLANNED READMISSION PREDICTIONFOLLOWING A RANDOM INDEX DISCHARGE WITH AND WITHOUT

TIME-GAPS. RAND INIT REFERS TO RANDOM INITIALIZATION OF THEEMBEDDING MATRIX. Word2vec INIT REFERS TO PRETRAINING THE

EMBEDDING MATRIX USING THE word2vec ALGORITHM [34].

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

V29

R09

D35O30

N86

I49

N77

R87

H52

R39

I82

W17

M32

I80

D04

S67

N83

O34

S86

L50

M71

I83

D06

Z93G51

I95

L73

R27

W84

L02

M62

K29K90 Z30

R30

Y45D69

R26

W11

N32

D64

Z49

G56

K82

J10

O46

V19

S13

L72

D24

O33

Z80

G44

D48

N28

W23

W44F02

B37

O09L05

H72

Z46

E85

O14

M12

S06

R02

M80

S52

M67

R11

D75

K81

V28

K63

G20

R34

B02

S14

K83

P22

E21

M43

Z37

V47

B18

H11

P92

N31K91

O32

Z06

O35

J34

Z87

R42

J39

S31

T81

U72

R05

S51

R58

F00

R21

R18

S65

E44

Z13

V43

D27

M54

E66

M86

A41

N18

M10

N60

O41

K08

Z22

J94

K14

E43

S12

T91

Y46

N90

S42

Y42

G95

M96

W07

O21

S05

N12

R44

G83

I71

O26

K64

P39TRANSFER

S60

O80

R63

K61

S39

E88

I34

Q43

O75

T88

K52

N81

G12

U78

W19

T92

Y60

E55Z85

K85

H65

I10

M06

J35

Z33

D46

R91

S70

J69

M65I74

B34

A40

O70

J30

I21

D09

W10

D21

O24

Z12

M22

I61

Z38

N94

Z48M76

Y92

K20

Y86

R74

G70

S61

I26

S82

O98

H57

K59

D22

W55

K72

D05

R00

S50

V23S22

R94

N89

Z08

J44

Z94

F31

N42

I64

N92

D28

S63

R50

J18

T18

I25

O69

R49

X50

W25

F04

K62

D32

Z74

L53

E22

Z88L27

R57

N43

A04

M48

G47

Y65

N85

N21

R40

I97

O63

P28

Y51

K35

L81

W50

E04

H81

P59

J96

G31

I77

S68

P05

Z83

K86

N50

Z40

F03

S69

K01

R68

N72

F41

R52

K43

I86

D39

S91

M40

E06

N70

N63

K10

J45

J95

O00

M53

N36

O13

R51

E10

K26

M85

D13

F11

I72

S41

Z39

R10K30

U86

Q51

U50

O36

K75

Z09

K41

P36

Z45

G96

K92

J06

I89

R56G50

O64

I46

T82

S32

Z03

L60

R79

W45

O62O03

O90O44

K05

Z89

Y52

S89

Z34

Y85

L84

K12

G58

N40

L30

J20

R45

H61

G91

S40Z47

L82

T89

R53

N41

R15

M95

I08

W31

S72

O71

N62

I69

D23

Y89

Q38

S09

R12

E27

I47

H40

K65

Q50

M41

N19

S35

K11

G40

Q21

G54

K31 O81

J22

I84

G45

N20

E86

R47

S93

K76E03

Y84

X58

O68

G57

M84

W18

L90

I65

L89

I67

N73

H66

M17

N64

N88

S30

M75

N80

W02

Y83

R22

M13

N99

S83

K70

S76

Z75

M24

Z50

R31

F05

L29

T79

N17

T93

L97

S79S29

M11

J32

D47

D41

K25

Z91

T85

U83

M89

L92

L08

G93

K50

R29

Z31

S49

Q66

K58

U55

Z96

M50

O84

K22

E83

D36

L74J38

M19

F43H35

E53N75

O06

V18

P70

S00

D16

U82

J84

Z53

M51M46

N23

F10

K37

U66

N76

Z72

F60

K02

D17

R25

J98

Y43

E28

E46

S53

R23I50

R07M31

O92

Z95

S01

I35

O02

H91

P07

R73

M77

I31

N71

Y44

D68

S56

M94

B00

M79

S81

U80

M81

H04

G25

K00

M47D86

H10

K40

S71

E14

W01

Q27

Z41

R14

V03

Y48

D34

S66

I63

G61

G97

S20

J02

E65

K44O72

K07

A63

T80

S62

O47

J33

I87

H53

A09

E78

D07

G72

D50

F17

R41

Z98

B96

J90

N87

N48

L03

S64

W06

J47

W29

I05

M66

R19

Z29

Z60

K04

D72

N30

O04

R06

Y57

T83

D25

S37

Z35

I42

S99

K09

G81

Q74

R04

I70

T78

N35

W26

Z51RAREWORD

K60

O48

B35

M35

N84

D59

J40

R33

G43K74

K80

M20

Z92

N97

G62 U79

F01

M25

R20

O61

I20N46

O42

R93

O87

D80

O82

M21

Z86

M72

M23

Y49

S02

E87

K55

R13

Q23

S73

I44 H26

D03

I62

I51

I45

K51

D70

I07

D45

B95

F33

N13

D62

M87

T17

R60

U73

S92

F32

M88

S96

T84

R59

A08

M93

K13

O99

H71

L91

N95

S46

G30

B97

E11

W54

Z43K56

K57

K42

L98

J93

I27

K21

I85

D18

K66

E09

E05

E13

U87

H02

O66

I48

R55

V48

A49S36 M70

R54

O43

N93

O60

G55

S27

D37

S43

S90

Y40

E61

Z21

D30

M16

W22

M00

J15

L57S80

R35

X59

N47

R32

Y54 G35

D12

N39

Z42

pregnancyrelated

birthrelated

injuries

injuries

injuries

musculoskeletal system

sport related

heart related

heart related

heart related

respiratory system

respiratory system

respiratory system

blood related nervous

system

genitourinary system

genitourinary system

musculoskeletal system

digestive system

heart related

digestive system

mental health

digestive system

Figure 3. Distribution in the disease space, projected into 2D using t-SNE.Distribution of interventions is omitted for clarity. Best viewed in color.

rectification and max-pooling operations (see Sec. III-C), whichpick the most powerful convoluted signals in the sequence. Theuse of word2vec to initialize the embedding matrix also haslittle contribution toward the accuracy. This could be becauseword2vec looks only for local collocations in both directions(past and future), whereas the prediction in Deepr is moreglobal and of longer time horizon only in the future direction.In either cases with and without word2vec, Deepr is superiorthan the baseline BoW+LR.

Fig. (2) shows how Deepr groups similar patients and createsa more linear decision boundary while BoW+LR scattersthe patient distribution and has a more complicated decisionboundary. Recall that Deepr creates the feature vectors usingelement-wise max-pooling over all the motifs responses, as inEq. (3). This demonstrates that the motifs, not just individualwords, are important to computing similarity between patients.This also suggests that given a new patient Deepr is better atquerying similar patients in the database when future risk isneeded.

Page 6: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

6

30 20 10 0 10

20

10

0

10

20

20 10 0 10 20

15

10

5

0

5

10

15

20

BoW+LR Deepr

Figure 2. 2D projections of classification on the unseen test set of two methods BoW+LR and Deepr. White points and blue background are negative class,black point and yellow region are positive class. The figure shows Deepr groups similar patients and creates a more linear decision boundary while BoW+LRscatters the patient distribution and has a more complicated decision boundary. The decision boundary is approximated by an exhaustive contouring method,where fine lattice points of the background grid are labeled to the predicted label of their nearest data point, and then the boundary is computed by thecontouring algorithm. Best viewed in color.

B. Disease/Procedure Semantics

Recall that Deepr first embeds words into a vector space.This offers a simple but powerful way to uncover and visualizethe underlying structure of the word space (see Sec. III-E).Fig. 3 plots the distribution of diseases on 2D. Deepr discoversdisease clusters which partly correspond to nodes in the ICD-10 hierarchy. Apart from pregnancy, child birth issues andinjuries, the conditions are not totally separately suggesting acomplex dependencies in the disease space. The main bockof the disease space has conditions related to heart, blood,metabolic system, respiratory system, nervous system andmental health. A more close examination of most similarconditions to a disease is given in Table II. For example,similar to cesarean section delivery of baby are those relatedto pregnancy complications (disproportion, failed inductionof labor, or diabetes) and corresponding delivery procedures(cesarean section, manipulating fetal presentation, forceps).

We note in passing that we also obtained a similar visual-ization using only word2vec as in [34], which is known todetect hidden semantic relationships between words. Deeprtrained on the embedding matrix initialized by word2vec didnot significantly change the relative positions of words. Thissuggests that Deepr also captures the semantic relationshipbetween words.

C. Filter Responses and Motifs

While the semantics in the previous sub-section reveal theglobal relative relation between diseases and procedures, theydo not explain local interactions (e.g., motifs). Here we computethe local filter responses per sentence, and from there, acollection of strong and frequent motifs is derived.

Table III shows some sentences with strong responses forFilter 1 and 4 for both risk and no-risk class. It can be seen thatthe sub-sequences Z85.1163.1910 and 1066.1067.I21 respondstrongly for the positive class and contribute to the classificationresult. The first sub-sequence is about cancer history (Z85),

Filter ID Response within a (sub) sentence

1 (readmit)

filter_0, filter_size_3, weight_0.954588, positive_618, 1620 . 1649 . 1910 . 1645 . D03 . 12­99m . 1744 . N62 .1910 . D24

filter_0, filter_size_3, weight_0.906711, positive_520, Z08 . Z85 . 1163 . 1910 . 1089

filter_0, filter_size_3, weight_0.902545, positive_816, 1910 . Z08 . 1089 . Z85 . 12­99m . 1910 . 1089 .Z08 . Z85 . 12­99m . Z08 . 1089 . Z85 . 1910

filter_0, filter_size_3, weight_0.892010, positive_1446, Z86 . Z85 . 1089 . 1910 . Z08 . 12­99m . 1089 .Z86 . 1910 . Z08 . Z85 . 12­99m . 1089 . 1910 . Z86 . Z85 . Z08

filter_0, filter_size_3, weight_0.874444, positive_26, 1089 . Z86 . Z08 . Z85 . 6­12m . Z85 . 1089 . Z08

filter_1, filter_size_3, weight_2.002083, positive_4396, 1098 . RAREWORD . Z85 . Z08 . 1­3m . G47 .K59 . R31 . 0­1m . R33 . E11 . Z86 . Z92 . 6­12m . 1108 . 1916 . 1092 . 1910 . E11 . N30 . 1566 . Z86 . Y84 . Y92 . N32

filter_1, filter_size_3, weight_1.936797, positive_4160, 1089 . Z85 . Z08 . 12­99m . Z08 . 1089 .Z85 . 6­12m . Z85 . Z08 . 1089 . 12­99m . 1089 . Z08 . Z85

filter_1, filter_size_3, weight_1.894471, positive_448, Z85 . 1089 . 1910 . Z08 . 12­99m . Z08 . Z85 .1910 . 1089 . 12­99m . 1910 . Z85 . Z08 . 1089 . 0­1m . M54 . J45 . K21 . 6­12m . 1089 . Z85 . 1910 . Z08

filter_1, filter_size_3, weight_1.888126, positive_1490, 1910 . Z85 . 911 . Z08 . 6­12m . Z08. 911 . Z85 . 1910 . 3­6m . E11 . K62 . 908 . 1910

filter_1, filter_size_3, weight_1.883164, positive_4304, 1089 . 1910 . Z85 . Z08 . 12­99m . Z08 .1910 . Z72 . 1089 . Z85 . 12­99m . Z85 . Z08 . 1910 . 1089 . Z72

filter_2, filter_size_3, weight_1.894100, positive_1314, 1008 . K22 . 1910 . Z09 . 905 . Z87 . 12­99m . 958 . 984 .Z03 . 1910 . 963 . 0­1m . F41 . 1916

filter_2, filter_size_3, weight_1.698204, positive_3834, 1569 . 1910 . K07 . 1706 . K07 . 1702filter_2, filter_size_3, weight_1.631166, positive_1164, Z86 . 1534 . M20 . 1916 . 1910 . 1528 . 1909 . 1547 . 12­99m . 727 . I83 . 1910 . 727 . Z86 . E11

filter_2, filter_size_3, weight_1.593942, positive_2409, 1258 . 1259 . 984 . 1265 . N80 . 1910

filter_2, filter_size_3, weight_1.508270, positive_3755, 905 . 1910 . Z86 . Z03 . 1008 . R19 . K64

filter_3, filter_size_3, weight_1.908521, positive_1392, 1910 . Z86 . N97 . 1259 . 6­12m .

D05 . 1744 . 1910 . Z86 . 1­3m . RAREWORD . 1747 . 1916 . Z86 . 1893 . 1910 . D05 . 1756 . D64 . 1754 . 12­99m .1910 . Z42 . Z86 . 1660

filter_3, filter_size_3, weight_1.797218, positive_2780, Z53 . Z08 . Z85 . Z86 . I20 . 0­1m . 1910 . Z85 .1089 . Z08 . Z86 . 12­99m . 1098 . Z86 . 1910 . 1165 . D09 . 0­1m . 1067 . 1066 . RAREWORD . 1910 . Z86 .D09

1 (no-risk)

filter_98, filter_size_3, weight_2.951619, positive_1649, 1474 . 1561 . M67 . 1910 .1651 . 1­3m . 1620 . 1910 . 1651 . D04

filter_98, filter_size_3, weight_2.737271, positive_3896, H65 . 309 . 0­1m . 1668 . 1756 .

1668 . Z86 . Z42 . 1754 . Z85 . 1910 . 1757

filter_98, filter_size_3, weight_2.545380, positive_3089, 1910 . N84 . 1276 . N84 .1266

filter_98, filter_size_3, weight_2.377592, positive_1558, S01 . 1910 . 406 . X59 . U73 . Y92 . 12­99m .

Z30 . 1910 . 1183filter_98, filter_size_3, weight_2.363580, positive_3368, 1916 . N17 . D64 . 1165 . 1916 . E83 . N13 . E83 . Z72 . 1910 . I10 .

R07 . 1093 . R33 . E87 . 1893 . 0­1m . Z72 . K59 . 0­1m . Z46 . 1092 . 6­12m . RAREWORD . Y65 . 1910 . Z03 . I97 .Y92 . T88 . I95

filter_99, filter_size_3, weight_2.046401, positive_688, 1183 . 1910 . Z30 . 6­12m . I48 . K92 . 0­1m . 911 .K64 . K92 . 1910

filter_99, filter_size_3, weight_2.015343, positive_1273, 1098 . Y84 . N30 . Y92 . 1096 . 1910 . 0­1m . Y84 . N30 . Y92 .1096 . 1092 . E11 . R31 . 1910 . 1­3m . Y84 . K66 . 1916 . 1910 . RAREWORD . T81 . R32 . RAREWORD . N30 . 986 . E10 . 1916 . 1893 . Y92

. 1909 . Y92 . 1183 . Y83 . 12­99m . 1910 . 1067 . E10 . N20

filter_99, filter_size_3, weight_1.981209, positive_4017, 458 . K08 . 400 . 1910 . 400filter_99, filter_size_3, weight_1.908346, positive_2368, 1910 . Z30 . 1183filter_99, filter_size_3, weight_1.908346, positive_1558, S01 . 1910 . 406 . X59 . U73 . Y92 . 12­99m . Z30 .1910 . 1183filter_0, filter_size_3, weight_0.954588, negative_2158, Z86 . 1744 . 1910 . D24

filter_0, filter_size_3, weight_0.954588, negative_3125, 1744 . 1910 . D24

filter_0, filter_size_3, weight_0.954588, negative_3071, 1744 . 1910 . D24

filter_0, filter_size_3, weight_0.954588, negative_2582, 1744 . 1910 . D24

filter_0, filter_size_3, weight_0.954588, negative_2245, 1910 . 1744 . D24

filter_1, filter_size_3, weight_2.170072, negative_4199, 1435 . Y92 . T81 . R11 . U73 . S61 . 1427 . Y60 . S51 . 1916 . S52

. 1910 . 1429 . Y92 . W19 . 1910 . 1557 . 1­3m . K57 . M13 . 0­1m . 1008 . R10 . R14 . M43 . 1910 . R11 . 1916 . 12­99m . 1910 .309 . U83 . H65 . U86

4 (readmit)

filter_3, filter_size_3, weight_1.734465, positive_1490, 1910 . Z85 . 911 . Z08 . 6­12m . Z08 . 911 .Z85 . 1910 . 3­6m . E11 . K62 . 908 . 1910

filter_3, filter_size_3, weight_1.716212, positive_269, R19 . 911 . Z87 . 1008 . Z80 . 1910 . 0­1m .

D37 . 911 . 1910

filter_3, filter_size_3, weight_1.712104, positive_4349, Z08 . Z85 . 1089 . 12­99m . Z08 .

Z85 . 1089 N18 . 668 . I25 . N13 . 1910 . 905 . 1066 . 1067 . I21 . N17 . 607 . 1910 . 1910 . 1067 . 671 . R19

filter_4, filter_size_3, weight_2.025432, positive_2169, K55 . K22 . R07 . 1008 . 1916 . 1910 . 905 . 12­99m . 1910 .1008 . K31 . K22 . K44 . Z86 . 1­3m . K56 . Z86 . 0­1m . 607 . I10 . 671 . I25 . 668 . Z86 . I20 . 0­1m . R10 . 0­1m . K57 . 0­1m . 1910. K57 . 905 . 3­6m . 1089 . 1916 . Z86 . 1910 . R31 . D09 . 1916 . J44filter_4, filter_size_3, weight_1.942211, positive_1166, K63 . Z86 . I84 . 905 . 1910 . 1­3m . 668 . I21 . 1910 .

Z95 . N18 . N17 . J44 . I48 . I50 . 668 . E11 . E11 . 607 . 1910 . I25 . 1916filter_4, filter_size_3, weight_1.925787, positive_1056, E11 . 1916 . 1916 . 1916 . E11 . I50 . G47 . I10 . J96 . E66 .570 . E11 . N17 . E11 . 0­1m . I10 . E66 . E11 . N17 . N18 . Z92 . 1893 . 1916 . E11 . E11 . Y52 . Y92 . 6­12m .

E11 . E11 . Z92 . R15 . D68 . Y92 . I50 . N18 . I48 . Y44

filter_4, filter_size_3, weight_1.925539, positive_391, 1910 . N47 . 1196 . Z86 . 1­3m . 1916 . Z86 . M47

. 12­99m . J18 . 1916 . R33 . Z86 . 1916 . R32 . E87 . I50filter_5, filter_size_3, weight_1.428182, positive_2661, Z86 . 1910 . N20 . 1126 . 3­6m . 1910 . N40 . 1165. 1909 . Z86 . N13

filter_5, filter_size_3, weight_1.387222, positive_2211, 1910 . N20 . 1126 . 0­1m . 1126 . 1910 . 1067 . N20

filter_5, filter_size_3, weight_1.168228, positive_369, 1910 . 1909 . H26 . 197 . 173 . Z86 . H52 . 0­1m . 197 .Z86 . H26 . 1910 . 1909

filter_5, filter_size_3, weight_1.111798, positive_3563, R33 . Z46 . N39 . N41 . 1902

filter_5, filter_size_3, weight_1.091392, positive_2924, 1046 . Z86 . 1067 . 1910 . 1074 . N20 . 1066 . 0­1m . N20 .1910 . 1067 . Z86 . 1041 . 3­6m . 1089 . Z08 . Z85 . 1910 . 1­3m . 1066 . 1910 . Z86 . 1067 . N20 . 1046 . 0­1m . 1089 . 1910 . Z86 . 1067 . Z46 . 6­12m . Z85 . Z87 . 1910 . 1066 . Z86 . Z08 . 6­12m . 1066 .

Z08 . 1910 . Z85 . Z86filter_6, filter_size_3, weight_1.514249, positive_2392, Z72 . 990 . 1910 . K40 . 12­99m . 1910 . I97 . Z72 . I72. 700 . Y92 . I70 . I95 . 715 . Y83 . 3­6m . Z53

4 (no-risk)

filter_1, filter_size_3, weight_2.011279, negative_959, I84 . Z87 . 905 . K92 . E87 . 1910 . Z09 . E87 . 905 . K92 . K57. Z09 . 1910 . Z87 . 12­99m . Z45 . Z53 . 1­3m . Z45 . 655

filter_1, filter_size_3, weight_1.895455, negative_2085, Y92 . 1758 . 1758 . 1910 . 1294 . Y83 . T85 . Z41 . 3­6m .1657 . 1910 . Z42 . 6­12m . Z41 . 1910 . RAREWORD . 1294 . Z42 . 3­6m . H26 . 1910 . 197 . 1­3m . 1909 . H26 .197 . 1910 . Z86 . 1­3m . H26 . 197 . 1910 . 1909 . 0­1m . H53 . 1909 . 193 . 1910 . 1­3m . H53 . RAREWORD . 1909 . 1910 . 0­1m .H53 . 1909 . 193 . 1910 . 0­1m . 1758 . Y92 . 1758 . Y83 . T85 . 1910

filter_1, filter_size_3, weight_1.848119, negative_1649, 309 . 1910 . H65

filter_1, filter_size_3, weight_1.818123, negative_2285, O68 . 1333 . 1338 . 1343 . O81 . Z37 . 12­99m .O02 . 1265 . 1910 . O09 . 6­12m . A09 . O98 . 0­1m . Z29 . O92 . 1333 . 1338 . 1343 . O81 . O36 . 1334 . Z22 . Z37

filter_2, filter_size_3, weight_2.600813, negative_2359, L50 . 12­99m . Z88 . Z03 . 1864

filter_2, filter_size_3, weight_2.600812, negative_4317, Z88 . Z03 . 1864

filter_2, filter_size_3, weight_2.372393, negative_2355, 1864 . Z03 . Z88

filter_2, filter_size_3, weight_2.372393, negative_2931, 1864 . Z03 . Z88 . 3­6m . Z03 . 1864. Z88

filter_2, filter_size_3, weight_2.195103, negative_3789, Z88 . Z41 . 1­3m . Z88 . Z41 .1864

filter_3, filter_size_3, weight_1.948234, negative_1422, 895 . K56 . 986 . K66 . RAREWORD . 899 . Y92 . Y83 . K91 .

Z43 . 1910 . 3­6m . 911 . Z08 . 1910 . Z85 . D12 . 12­99m . Z08 . K57 . Z85 . 1910 . 911 . K63

filter_3, filter_size_3, weight_1.571307, negative_2948, Z86 . D05 . 1744 . 1910 . N62

filter_3, filter_size_3, weight_1.497793, negative_878, Z85 . 1756 . 808 . 1910 . 1747 . Z40 . 1756 . D05 . 1916 .3­6m . N80 . 1758 . 1252 . Z40 . 1910 . Z80 . 12­99m . 1910 . N60 . 1744

filter_3, filter_size_3, weight_1.465393, negative_2181, R20 . E04 . 114 . 1910 . R20

1089 . Z86 . Z85 . 1910 . Z08 . 3­6m . 1910 .1089 . Z08 . Z85 . Z86 . 6­12m . Z08 . 1910 . Z85 . Z86 . E11 . 1089 . 12­99m . 1910 . Z08 .Z86 . 1089 . Z85 . E11 . 12­99m . Z85 . 1089 . E11 . Z86 . 1910 . Z08filter_4, filter_size_3, weight_1.986171, negative_517, N39 . R41 . Y92 . RAREWORD . F05 . M25 . E87 . F41 .TRANSFER . Y92 . I95 . Y42 . M31 . R45 . Y44

filter_4, filter_size_3, weight_1.981100, negative_1221, I95 . J96 . 1916 . J84 . J96 . 1916filter_4, filter_size_3, weight_1.935187, negative_1051, 1828 . G47 . 0­1m . 1828 . G47 . 570 . 1­3m . T81 . Y92 . J47 . Y84

Table IIISOME SENTENCES WITH STRONG RESPONSES FOR FILTERS 1 AND 4. CODE

WITH FIRST LETTER IS DIAGNOSIS, CODE WITH ALL NUMBERS ISPROCEDURE, CODE ENDS WITH “M” IS TIME-GAP. THE HEIGHTS OF THE

CODES ARE PROPORTIONAL TO THEIR RESPONSE WEIGHTS. THESUB-SEQUENCE Z85.1163.1910 AND 1066.1067.I21 RESPONSE STRONGLY

TO THE POSITIVE CLASS.

biopsy procedure (1163) and cerebral anesthesia (1910). Theother sub-sequence is about heart attack (I21) and kidney-related procedures (1066 and 1067).

From strong and frequent filter responses in all sentences,we derive the list of motifs. Table IV lists the motifs withlargest weights and highest frequency of occurrence for codechapter E, I and O. The first motif of Filter 45 shows thepattern that treatment removing toxic substances from theblood co-occurred with care involving dialysis and readmissionwithin 1 month. The second motif in the same row discoversthe pattern that type-I diabetes patients involve in educationabout information and management of diabetes. The thirdmotif in the same row shows type-II diabetes patients readmitwithin 1-3 months. Filter 26 demonstrates the co-occurrence ofdiseases related to diabetes. The three motifs show that type-IIdiabetes patients can have complications such as heart failure,vitamin D deficiency and kidney failure. Filters 10 and 35show diseases and treatments related to the circulatory system,whereas pregnancy and birth related motifs are shown in Filters2 and 33 in the last two rows.

Page 7: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

7

Single delivery by cesarean section Type 2 diabetes mellitus Atrial fibrillation and flutterDiagnoses: Diagnoses: Diagnoses:

Maternal care for disproportionPlacenta praeviaComplications of puerperiumFailed induction of laborDiabetes mellitus in pregnancy

Personal history of medical treatmentPresence of cardiac/vascular implantsPersonal history of certain other diseasesUnspecified diabetes mellitusProblems related to lifestyle

Paroxysmal tachycardiaUnspecified kidney failureCardiomyopathyShock, not elsewhere classifiedOther conduction disorders

Procedures: Procedures: Procedures:Cesarean sectionMedical or surgical induction of labourManipulation of fetal presentationOther procedures associated with deliveryForceps delivery

Cerebral anesthesiaOther digital subtraction angiographyExamination procedures on uterusMedical or surgical induction of labourCoronary angiography

Insertion or removal procedures on aortaElectrophysiological studies [EPS]Other procedures on atriumCoronary artery bypass - other graftCoronary artery bypass - saphenous vein

Table IIRETRIEVING TOP 5 SIMILAR DIAGNOSES AND PROCEDURES.

FilterID Motifs

45 0-1m 1060 Z49Time-gapHaemoperfusionCare involving dialysis

1916 E10 Z45Allied health intervention, diabetes educationType 1 diabetes mellitusAdjustment and management of drugdelivery or implanted device

1-3m E11 Z45Time-gapType 2 diabetes mellitusAdjustment and management of drugdelivery or implanted device

26 E11 I48 I50Type 2 diabetes mellitusAtrial fibrillation and flutterHeart failure

E11 E55 I48Type 2 diabetes mellitusVitamin D deficiencyAtrial fibrillation and flutter

E11 I50 N17Type 2 diabetes mellitusHeart failureAcute kidney failure

10 1893 I48 K35Exchange transfusionAtrial fibrillation and flutterAcute appendicitis

1005 A41 I48Panendoscopy to ileum with administrationof tattooing agentOther sepsisAtrial fibrillation and flutter

1-3m I48 Z45Time-gapAtrial fibrillation and flutterAdjustment and management of drugdelivery or implanted device

35 1909 727 I83Intravenous regional anesthesiaInterruption of sapheno-femoral andsapheno-popliteal junction varicose veinsVaricose veins of lower extremities

1620 I83 L57Excision of lesion(s) of skin andsubcutaneous tissue of footVaricose veins of lower extremitiesSkin changes due to chronic exposure tononionising radiation

1910 768 I83SedationTranscatheter embolisation of other bloodvesselsVaricose veins of lower extremities

2 D68 O80 Z37Other coagulation defectsSingle spontaneous deliveryOutcome of delivery

1344 O75 O80Other suture of current obstetric lacerationor rupture without perineal involvementOther complications of labor and deliverySingle spontaneous delivery

1344 O75 O82Other suture of current obstetric lacerationor rupture without perineal involvementOther complications of labour and deliverySingle delivery by caesarean section

33 1333 1340 O09Neuraxial block during labour and deliveryprocedureEmergency lower segment caesarean sectionDuration of pregnancy

1340 O14 Z37Emergency lower segment caesarean sectionGestational [pregnancy-induced]hypertension with significant proteinuriaOutcome of delivery

1340 3-6m O34Emergency lower segment caesarean sectionTime-gapMaternal care for known or suspectedabnormality of pelvic organs

Table IVRETRIEVING 3 MOTIFS FOR EACH OF THE 6 FILTERS WHICH HAVE LARGEST WEIGHTS AND MOST FREQUENT WITH CODE CHAPTER O, I AND E.

VI. DISCUSSION

We have presented Deepr, a new deep learning architecturethat provides an end-to-end predictive analytics in healthcareservices. Deepr reads directly from raw medical records andpredicts future outcomes. This departs from the traditionalmachine learning that relies on expensive manual featureextraction. Deepr learns to extract meaningful features by itselfwithout expert supervision. This translates to uncovering thepredictive local motifs in the space of diseases and interventions.These capacities are not seen in existing methods.

r) Significance: Deepr contributes to the growing litera-ture of predictive medicine in multiple ways. First, it is ableto uncover the underlying space of diseases and interventions,showing the relationships between them. The largest diseasecluster in Fig. 3 suggests that diseases may interact in a

complex way, and current representation of disease hierarchiessuch as those in ICD-10 may not reflect the true nature ofmedical disorders. Second, Deepr detects predictive motifs ofcomorbidity, care patterns and disease progression. The motifssuggest a new look into the complex interactions betweendiseases and between the diseases and cares. Third, similarpatients can be retrieved not just using past history, but fromlikelihood of future risks as well. This would, for example,help to quickly identify an effective treatment regime based onsimilar patients who responded well to the treatment, or to alertthe care team of a potential risk based on similar patients whohad these before. Finally, Deepr predicts the future risk for apatient and explains why (through means of motifs responses),which is the core of modern prospective healthcare.

With these capabilities, Deepr can enable targeted monitor-

Page 8: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

8

ing, treatments and care packaging. This is highly importantfor chronic disease management that requires an on-going careand evaluation. For health services, a high predictive accuracyof risk will lead to better resources prioritizing and allocation.For patients, accurate risk estimation is an important steptoward personalized care. Patients and family will be promotedto become more aware of the conditions and risk, leadingto proactive health management and help seeking. Deepr isgeneric and it can be implemented on existing EMR systems.This will enable innovative healthcare practices for betterefficiency and outcomes to occur. For example, doctors, whenseeing a patient, may consult the machine for a second opinion,with a transparent, evidence-based reasoning. Because theydo not miss any piece of information in the database, they areless likely to overlook important signals.

s) Comparison to recent work on medical records: Deeplearning in healthcare has recently attracted great interest. Themost popular application is medical imaging using CNNs [8],motivated by the recent successes in cognitive vision [12], [22],[25]. However, there has been limited work on non-cognitivemodalities. On time-series data (e.g., ICU measurements), themain difficulty is the handling missing data with recent workof [4], [23], [28], [38]. In [23], time-series are modeled usingautoencoders (an unsupervised feedforward net) to discovermeaningful phenotypes. In [4], [28], recurrent nets are used,and in [38], a convolutional net is employed. Deepr can beapplied on these data, following a discretization of continuoussignals into discrete words (e.g., through cut-points).

On routine medical records, Deepr is the only methodthat employs convolutional nets but there exist alternativearchitectures. Feedforward nets have been used [26], [10], [35].Recurrent neural networks (RNN) on medical records includeDoctor AI [6] and DeepCare [37]. Doctor AI is a RNN adaptedfor medical events, where both next events and time-gaps arepredicted. DeepCare is a sophisticated model that representstime-gaps using a parametric model. Similar to our observation,the authors of DeepCare also noticed an interesting analogybetween natural languages and EMR, where EMR is similarto a sentence, and diagnoses and interventions play the roleof nouns and modifiers. While DeepCare is powerful on longrecords, it is less effective in short records, e.g., those withonly one or two admissions. Deepr, on the other hand, doesnot suffer from this limitation. Stochastic deep neural nets suchas deep Boltzmann machines are used in [32]. Deep non-neuralnets have also been suggested in [13]. These methods are likelyto be expensive to train and produce prediction.

Embedding of medical concepts has been proposed incontemporary work [5], [7], [37], [45]. In [7], medical conceptsare embedded using word2vec [33], ignoring time gaps. TheMed2Vec in [5] extends word2vec to embed visits. Bothword2vec and Med2Vec model local collocations, but do notexplicitly model motifs (with precise relative positions). In [45],a global model known as eNRBM embeds patients into vectorsvia regularized nonnegative restricted Boltzmann machines[36]. Local motifs are not modeled and and variable recordlength and time gaps are not properly handled. Discoveringlocal motifs by means of convolutions has been suggested in[48] through matrix factorization. However, the work does not

do prediction.t) Limitations and future work: There are rooms for

future work. First, long-term dependencies are simply capturedthrough a max-pooling operation. This is rather simplistic dueto a complex dynamic between care processes and diseaseprocesses [37]. A better model should pool information that istime-sensitive (e.g., recent events are more important to distantones). At present, Deepr works exclusively on recorded eventssuch as diagnoses and interventions. Integration with clinicalnarrative would be highly useful because rich information isburied in unstructured text. This can be done in the sameframework of Deepr because of the sequential nature of text.Our evaluation has been limited to a common risk known asunplanned readmission. However, Deepr is not limited to anyspecific type of future risk. It can be well applied to predictingthe onset or progression of a disease.

REFERENCES

[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467, 2016.

[2] Ognjen Arandjelovic. Discovering hospital admission patterns usingmodels learnt from electronic hospital records. Bioinformatics, pagebtv508, 2015.

[3] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin,Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler inPython. In Proc. 9th Python in Science Conf, pages 1–7, 2010.

[4] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag,and Yan Liu. Recurrent neural networks for multivariate time series withmissing values. arXiv preprint arXiv:1606.01865, 2016.

[5] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, CatherineCoffey, and Jimeng Sun. Multi-layer representation learning for medicalconcepts. KDD, 2016.

[6] Edward Choi, Mohammad Taha Bahadori, and Jimeng Sun. Doctor AI:Predicting Clinical Events via Recurrent Neural Networks. arXiv preprintarXiv:1511.05942, 2015.

[7] Youngduck Choi. Learning low-dimensional representations of medicalconcepts. Proceedings of the AMIA Summit on Clinical ResearchInformatics (CRI), 2016.

[8] Dan C Ciresan, Alessandro Giusti, Luca M Gambardella, and JürgenSchmidhuber. Mitosis detection in breast cancer histology images withdeep neural networks. In International Conference on Medical ImageComputing and Computer-assisted Intervention, pages 411–418. Springer,2013.

[9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, KorayKavukcuoglu, and Pavel Kuksa. Natural language processing (almost)from scratch. The Journal of Machine Learning Research, 12:2493–2537,2011.

[10] Joseph Futoma, Jonathan Morris, and Joseph Lucas. A comparison ofmodels for predicting early hospital readmissions. Journal of biomedicalinformatics, 56:229–238, 2015.

[11] Danning He, Simon C Mathews, Anthony N Kalloo, and Susan Hutfless.Mining high-dimensional administrative claims data to predict earlyhospital readmissions. Journal of the American Medical InformaticsAssociation, 21(2):272–279, 2014.

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delvingdeep into rectifiers: Surpassing human-level performance on imagenetclassification. In Proceedings of the IEEE International Conference onComputer Vision, pages 1026–1034, 2015.

[13] Ricardo Henao, James T Lu, Joseph E Lucas, Jeffrey Ferranti, andLawrence Carin. Electronic Health Record Analysis via Deep PoissonFactor Models. JMLR, 2016.

[14] Rui Henriques, Cláudia Antunes, and Sara C Madeira. Generativemodeling of repositories of health records for predictive tasks. DataMining and Knowledge Discovery, pages 1–34, 2014.

[15] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality ofdata with neural networks. Science, 313(5786):504–507, 2006.

Page 9: Deepr: A Convolutional Net for Medical Records · Deepr: A Convolutional Net for Medical Records Phuoc Nguyen, ... system that learns to extract features from medical records and

9

[16] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforwardnetworks are universal approximators. Neural networks, 2(5):359–366,1989.

[17] Zhengxing Huang, Xudong Lu, and Huilong Duan. Latent treatmentpattern discovery for clinical processes. Journal of medical systems,37(2):1–10, 2013.

[18] Christopher H Jackson, Linda D Sharples, Simon G Thompson,Stephen W Duffy, and Elisabeth Couto. Multistate Markov modelsfor disease progression with classification error. Journal of the RoyalStatistical Society: Series D (The Statistician), 52(2):193–209, 2003.

[19] Anders Boeck Jensen, Pope L Moseley, Tudor I Oprea, Sabrina GadeEllesøe, Robert Eriksson, Henriette Schmock, Peter Bjødstrup Jensen,Lars Juhl Jensen, and Søren Brunak. Temporal disease trajectoriescondensed from population-wide registry data covering 6.2 millionpatients. Nature communications, 5, 2014.

[20] Peter B Jensen, Lars J Jensen, and Søren Brunak. Mining electronichealth records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012.

[21] Yoon Kim. Convolutional neural networks for sentence classification.arXiv preprint arXiv:1408.5882, 2014.

[22] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classifica-tion with deep convolutional neural networks. In Advances in NeuralInformation Processing Systems 25, pages 1106–1114, 2012.

[23] Thomas A Lasko, Joshua C Denny, and Mia A Levy. Computationalphenotype discovery using unsupervised feature learning over noisy,sparse, and irregular clinical data. PloS one, 8(6):e66341, 2013.

[24] Yann LeCun and Yoshua Bengio. Convolutional networks for images,speech, and time series. The handbook of brain theory and neuralnetworks, 3361(10):1995, 1995.

[25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553):436–444, 2015.

[26] Zhaohui Liang, Gang Zhang, Jimmy Xiangji Huang, and Qmming VivianHu. Deep learning for healthcare decision making with EMRs. InBioinformatics and Biomedicine (BIBM), 2014 IEEE InternationalConference on, pages 556–559. IEEE, 2014.

[27] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. Learninglong-term dependencies in NARX recurrent neural networks. IEEETransactions on Neural Networks, 7(6):1329–1338, 1996.

[28] Zachary C Lipton, David C Kale, and Randall Wetzel. Directly ModelingMissing Data in Sequences with RNNs: Improved Classification ofClinical Time Series. arXiv preprint arXiv:1606.04130, 2016.

[29] Chuanren Liu, Fei Wang, Jianying Hu, and Hui Xiong. Temporalphenotyping from longitudinal electronic health records: A graph basedframework. In Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 705–714.ACM, 2015.

[30] Christopher D Manning. Computational linguistics and deep learning.Computational Linguistics, 2015.

[31] Jason Scott Mathias, Ankit Agrawal, Joe Feinglass, Andrew J Cooper,David William Baker, and Alok Choudhary. Development of a 5 year lifeexpectancy index in older adults using predictive mining of electronichealth record data. Journal of the American Medical InformaticsAssociation, 20(e1):e118–e124, 2013.

[32] Saaed Mehrabi, Sunghwan Sohn, Dingheng Li, Joshua J Pankratz, TerryTherneau, Jennifer L St Sauver, Hongfang Liu, and Mathew Palakal.Temporal pattern and association discovery of diagnosis codes usingdeep learning. In Healthcare Informatics (ICHI), 2015 InternationalConference on, pages 408–416. IEEE, 2015.

[33] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. word2vec,2014.

[34] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of words and phrases and theircompositionality. In Advances in Neural Information Processing Systems,pages 3111–3119, 2013.

[35] Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep patient:An unsupervised representation to predict the future of patients from theelectronic health records. Scientific reports, 6, 2016.

[36] T.D. Nguyen, T. Tran, D. Phung, and S. Venkatesh. Learning Parts-basedRepresentations with Nonnegative Restricted Boltzmann Machine . InProc. of 5th Asian Conference on Machine Learning (ACML), Canberra,Australia, Nov 2013.

[37] Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. DeepCare:A Deep Dynamic Memory Model for Predictive Medicine. arXiv preprintarXiv:1602.00357, 2016.

[38] Narges Razavian and David Sontag. Temporal convolutional neuralnetworks for diagnosis from lab tests. arXiv preprint arXiv:1511.07938,2015.

[39] Patrick Royston and Willi Sauerbrei. Interactions between treatment andcontinuous covariates: a step toward individualizing therapy. Journal ofClinical Oncology, 26(9):1397–1399, 2008.

[40] Jürgen Schmidhuber. Deep learning in neural networks: An overview.Neural Networks, 61:85–117, 2015.

[41] Mansour TA Sharabiani, Paul Aylin, and Alex Bottle. Systematic reviewof comorbidity indices for administrative data. Medical care, 50(12):1109–1118, 2012.

[42] Ralph Snyderman and R Sanders Williams. Prospective medicine: thenext health care transformation. Academic Medicine, 78(11):1079–1084,2003.

[43] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: A simple way to prevent neural networksfrom overfitting. Journal of Machine Learning Research, 15:1929–1958,2014.

[44] Truyen Tran, Wei Luo, Dinh Phung, Sunil Gupta, Santu Rana, Richard LKennedy, Ann Larkins, and Svetha Venkatesh. A framework for featureextraction from hospital medical data with applications in risk prediction.BMC bioinformatics, 15(1):6596, 2014.

[45] Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh.Learning vector representation of medical objects via EMR-drivennonnegative restricted Boltzmann machines (eNRBM). Journal ofbiomedical informatics, 54:96–105, 2015.

[46] Truyen Tran, Dinh Phung, Wei Luo, and Svetha Venkatesh. Stabilizedsparse ordinal regression for medical risk stratification. Knowledge andInformation Systems, 2014. DOI: 10.1007/s10115-014-0740-4.

[47] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journalof Machine Learning Research, 9(2579-2605):85, 2008.

[48] F. Wang, N. Lee, J. Hu, J. Sun, and S. Ebadollahi. Towards heterogeneoustemporal clinical event pattern discovery: a convolutional approach. InProceedings of the 18th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 453–461. ACM, 2012.

[49] Xiang Wang, David Sontag, and Fei Wang. Unsupervised learning ofdisease progression models. In Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages85–94. ACM, 2014.

[50] DRGHR Williams and GE Hinton. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.

[51] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXivpreprint arXiv:1502.01710, 2015.