12
Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1 ? , Sebastian Waldstein 2 , Wolf-Dieter Vogl 1,2 , Ursula Schmidt-Erfurth 2 , and Georg Langs 1 1 Computational Imaging Research Lab, Department of Biomedical Imaging and Image-guided Therapy, Medical University Vienna, Austria [email protected], [email protected] 2 Christian Doppler Laboratory for Ophthalmic Image Analysis, Vienna Reading Center, Department of Ophthalmology and Optometry, Medical University Vienna, Austria Abstract. Learning representative computational models from medical imaging data requires large training data sets. Often, voxel-level annota- tion is unfeasible for sufficient amounts of data. An alternative to man- ual annotation, is to use the enormous amount of knowledge encoded in imaging data and corresponding reports generated during clinical rou- tine. Weakly supervised learning approaches can link volume-level labels to image content but suffer from the typical label distributions in med- ical imaging data where only a small part consists of clinically relevant abnormal structures. In this paper we propose to use a semantic repre- sentation of clinical reports as a learning target that is predicted from imaging data by a convolutional neural network. We demonstrate how we can learn accurate voxel-level classifiers based on weak volume-level se- mantic descriptions on a set of 157 optical coherence tomography (OCT) volumes. We specifically show how semantic information increases clas- sification accuracy for intraretinal cystoid fluid (IRC), subretinal fluid (SRF) and normal retinal tissue, and how the learning algorithm links semantic concepts to image content and geometry. 1 Introduction Medical image analysis extracts diagnostically relevant information such as po- sition and segmentations of abnormalities, or the quantitative characteristics of appearance markers from imaging data. To this end, algorithms are typically trained on annotated data. While the detection of subtle disease characteristics requires large training data sets, annotation does not scale well, since it is costly, time consuming and error-prone. An alternative for learning classifiers or pre- dictors from very large data sets is to rely on existing data generated during clinical routine, such as imaging data, and corresponding reports. We propose ? This work has received funding from the European Union FP7 (KHRESMOI FP7- 257528, VISCERAL FP7-318068) and the Austrian Federal Ministry of Science, Research and Economy.

Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

Predicting Semantic Descriptions from MedicalImages with Convolutional Neural Networks

Thomas Schlegl1 ?, Sebastian Waldstein2, Wolf-Dieter Vogl1,2,Ursula Schmidt-Erfurth2, and Georg Langs1

1Computational Imaging Research Lab, Department of Biomedical Imaging andImage-guided Therapy, Medical University Vienna, Austria

[email protected], [email protected] Doppler Laboratory for Ophthalmic Image Analysis, Vienna Reading

Center, Department of Ophthalmology and Optometry, Medical University Vienna,Austria

Abstract. Learning representative computational models from medicalimaging data requires large training data sets. Often, voxel-level annota-tion is unfeasible for sufficient amounts of data. An alternative to man-ual annotation, is to use the enormous amount of knowledge encoded inimaging data and corresponding reports generated during clinical rou-tine. Weakly supervised learning approaches can link volume-level labelsto image content but suffer from the typical label distributions in med-ical imaging data where only a small part consists of clinically relevantabnormal structures. In this paper we propose to use a semantic repre-sentation of clinical reports as a learning target that is predicted fromimaging data by a convolutional neural network. We demonstrate how wecan learn accurate voxel-level classifiers based on weak volume-level se-mantic descriptions on a set of 157 optical coherence tomography (OCT)volumes. We specifically show how semantic information increases clas-sification accuracy for intraretinal cystoid fluid (IRC), subretinal fluid(SRF) and normal retinal tissue, and how the learning algorithm linkssemantic concepts to image content and geometry.

1 Introduction

Medical image analysis extracts diagnostically relevant information such as po-sition and segmentations of abnormalities, or the quantitative characteristics ofappearance markers from imaging data. To this end, algorithms are typicallytrained on annotated data. While the detection of subtle disease characteristicsrequires large training data sets, annotation does not scale well, since it is costly,time consuming and error-prone. An alternative for learning classifiers or pre-dictors from very large data sets is to rely on existing data generated duringclinical routine, such as imaging data, and corresponding reports. We propose

? This work has received funding from the European Union FP7 (KHRESMOI FP7-257528, VISCERAL FP7-318068) and the Austrian Federal Ministry of Science,Research and Economy.

thomas
Schreibmaschinentext
Published in Proceedings of IPMI’15, 2015
Page 2: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

to learn the link between semantic information in textual clinical reports andimaging data, by training convolutional neural networks that predict semanticdescriptions from images without additional annotation. Experiments show thatthe inclusion of semantic representations has advantages over standard multiple-instance learning with independent labels. It increases the classification accuracyof pathologies, and learns semantic concepts such as spatial position.

Learning from Medical Imaging Data Typical clinical imaging departments gen-erate hundreds of thousands of image volumes per year that are assessed byclinical experts. The image and textual information comprise a rich source ofknowledge about epidemiology and imaging markers that are a crucial referenceduring clinical routine and treatment guidance. They promise to serve as basisfor the detection of imaging biomarkers, co-morbidities, and subtle signaturesthat are relevant for treatment decisions. Training reliable classifiers or segmen-tation algorithms is crucial in their processing, but requires annotated trainingdata. While supervised learning based on annotated imaging data yields accurateclassification results, annotation becomes unfeasible for large data. At the sametime, expert reports created during clinical routine offer detailed descriptionsof observations in the imaging data, and could fill this gap. They are currentlylargely unexploited.

Contribution In this paper we propose a method to use the semantic contentof textual reports linked to the image data instead of voxel-wise annotationsfor the training of an image classifier. We evaluate if the semantic informationin medical reports can improve weakly supervised learning of abnormality de-tectors over standard multiple-instance learning with independent labels. Sincemedical reports not only list observations (pathologies) but also semantic con-cepts of their locations, we learn the relationship between these semantic termsand specific local entities in the imaging data, together with their location. Thealgorithm has to learn a mapping from image location to semantic location in-formation encoded in a semantic target vector. The benefits of this algorithmare two-fold. First, we can estimate semantic descriptions from imaging data,second, we learn an accurate voxel-wise classifier without the need for voxel-wiseclassification in the training data.

Related Work Weakly supervised learning approaches use binary class labelsthat indicate the presence or absence of an object of the corresponding class inan image or volume. Multiple-instance learning [1] is a form of weakly supervisedlearning to solve this problem, and views images as labeled bags of instances.A positive class label is assigned to the bag of examples if at least one examplebelongs to the class. The negative class label is assigned to all examples of thebag if no example belongs to the class. The corresponding situation in medicalimaging consists of information that somewhere in the image there is a certainabnormality. Weakly supervised approaches learn from these weak or noisy la-bels. Examples are the Diverse Density Framework by Maron and Lozano-Perez

Page 3: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

[2], mappings among images and captions [3] using algorithms, such as Ran-dom Forests [4] or Support Vector Machines [5]. A recently published work [6]presents a multi-fold multiple-instance learning approach for weakly supervisedobject category localization. The work of Verbeek et al.[7] is another weaklysupervised learning example, wherein semantic segmentation models are learnedusing image-wide class labels. While these methods yield lower classification ac-curacy compared to voxel-wise training set labels, they have proven useful incomputer vision on natural images. Unfortunately, this does not translate di-rectly to medical imaging data. For example, a large part of data such as retinalspectral-domain optical coherence tomography (SD-OCT) images show normaltissue. While abnormalities typically cover only a tiny fraction of the volume,they are the focus of diagnostic attention and observations encoded in the tex-tual report. Furthermore, abnormality appearance can be modulated by location.This puts standard multiple-instance learning at a disadvantage. Our work dif-fers from these weakly supervised learning approaches in two aspects: (i) Wedo not extract local or global image descriptors but the visual input representa-tion is learned by our network and adapts to imaging and tissue characteristics.(ii) We do not only use class labels depicting the global presence or absence ofobjects in the entire image but use semantic information from clinical reports.

We use convolutional neural networks (CNN) to learn representations andclassifiers. They were introduced in 1980 by Fukushima [8], and have been usedto solve various classification problems (cf. [9,10,11]). They can automaticallylearn translation invariant visual input representations, enabling the adaptationof visual feature extractors to data, instead of manual feature engineering. Theapplication of CNN in the domain of medical image analysis ranges from man-ifold learning in the frequency domain of 3D brain magnetic resonance (MR)imaging data [12] to domain adaptation via unsupervised pre-training of CNNto improve lung tissue classification accuracy on computed tomography (CT )imaging data [13]. A weakly supervised approach using CNN performed on nat-ural images was presented in [14]. Our work differs from the aforementionedapproaches as: (i) we do not perform supervised (pre-) training, (ii) we usemedical images and (iii) our classifier does not only predict global image la-bels indicating the presence or absence of object classes in an image but alsocorresponding location information.

2 Weakly Supervised Learning of Semantic Descriptions

A CNN is a hierarchically structured feed-forward neural network comprisingone or more pairs of convolution layers and succeeding max-pooling layers (cf.[10,11,13]). The stack of convolution and max-pooling layer pairs is typicallyfollowed by one or more fully-connected layers and a terminal classification layer.We can train more than one stack of pairs of convolution and max-pooling layersfeeding into the first fully-connected layer. This enables training CNNs based onmultiple scales. We use a CNN to perform voxel-wise classification on visualinputs and corresponding quantitative spatial location information. Figure 1

Page 4: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

Fig. 1. Multi-scale CNN architecture used in our experiments. One stack of threepairs of convolution (Conv) and max-pooling layers (MP) uses input image patchesof size 35 × 35. The second stack comprises two pairs of convolution and max-poolinglayers and uses input image patches of size 71 × 71 (centered at the same position).The resulting outputs Outputk of the k pairs of convolution Convk and max-poolingMPk layers are the inputs for the succeeding layers, with corresponding convolutionfilter sizes fs and sizes of the pooling regions ps. Our CNN also comprises two fullyconnected layers (FC ) and a terminal classification layer (CL). The outputs of bothstacks are connected densely with all neurons of the first fully-connected layer. Thelocation parameters are fed jointly with the activations of the second fully-connectedlayer into the classification layer.

shows the architecture of the CNN. In the following we explain the specificrepresentation of visual inputs and semantic targets.

2.1 Representing Inputs and Targets

The overall data comprises M tuples of medical imaging data, correspondingclinical reports and voxel-wise ground-truth class labels 〈Im,Tm,Lm〉, with m =1, 2, . . . ,M , where Im ∈ Rn×n is an intensity image (e.g., a slice of an SD-OCTvolume scan of the retina) of size n × n, Lm ∈ 1, ...,K + 1n×n is an array ofthe same size containing the corresponding ground-truth class labels and Tm isthe corresponding textual report. During training we are only given 〈Im,Tm〉and train a classifier to predict Lm from Im on new testing data. In this paperwe propose a weakly supervised learning approach using semantic descriptions,where the voxel-level ground-truth class labels Lm are not used for training butonly for evaluation of the voxel-wise prediction accuracy.

Visual and Coordinate Input Information To capture visual information at dif-ferent levels of detail we extract small square-shaped image patches xmi ∈ Rα×αof size α and larger square-shaped image patches xmi ∈ Rβ×β of size β withα < β < n centered at the same spatial position cmi in volume Im, where i isthe index of the centroid of the image patch. For each image patch, we provide

Page 5: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

two additional quantitative location parameters to the network: (i) the 3D spa-tial coordinates cmi ∈ Ω ⊂ R3 of the centroid i of the image patches and (ii)the Euclidean distance dmi ∈ Ω ⊂ R of the patch center i to a given referencestructure (in our case: fovea) within the volume. We do not need to integratethese location parameters in the deep feature representation computation butinject them below the classification layer by concatenating the location parame-ters and activations of the fully-connected layer representing visual information(see Figure 1). The same input information is provided for all experiments.

Semantic Target Labels We assume that objects (e.g. pathology) are reportedtogether with a textual description of their approximate spatial location. Thus areport Tm consists of K pairs of text snippets 〈tm,kP , tm,kLoc 〉, with k = 1, 2, . . . ,K,

where tm,kP ∈ P describes the occurrence of a specific object class term and

tm,kLoc ∈ L represents the semantic description of its spatial locations. These spatiallocations can be both abstract subregions (e.g., centrally located) of the volume

or concrete anatomical structures. Note that tm,kLoc does not contain quantita-tive values, and we do not know the link between these descriptions and imagecoordinate information. This semantic information can come in Γ orthogonalsemantic groups (e.g., in (1) the lowest layer and (2) close to the fovea). That is,different groups represent different location concepts found in clinical reports.The extraction of these pairs from the textual document is based on semanticparsing [15] and is not subject of this paper. We decompose the textual report

Tm into the corresponding semantic target label sm ∈ 0, 1K·∑

γnγ , with

γ = 1, 2, . . . , Γ , where K is the number of different object classes which shouldbe classified (e.g. cyst), and nγ is the number of nominal region classes in onesemantic group γ of descriptions (e.g., nγ = 3 for upper vs. central vs. lowerlayer, nγ = 2 for close vs. far from reference structure). I.e., lets assume we havetwo groups, then sm is a K-fold concatenation of pairs of a binary layer groupgk1 ∈ 0, 1n1 with n1 bits representing different layer classes and a binary refer-ence location group gk2 ∈ 0, 1n2 with n2 bits representing relative locations to areference structure. For all object classes, all bits of the layer group, and all bitsof the reference location group are set to 1, if they are mentioned mutually withthe respective object class in the textual report. All bits of the correspondinglayer group and all bits of the corresponding reference location group are set to0, where the respective object class is not mentioned in the report. The vec-tor sm of semantic target labels is assigned to all input tuples 〈xmi , xmi , cmi , dmi 〉extracted from the corresponding volume Im. Figure 2a shows an example ofa semantic target label representation comprising two object classes. Accordingto this binary representation the first object is mentioned mutually with layerclasses 1, 2 and 3 and with reference location class 1 in the textual report. Figure2c shows the corresponding volume information.

Voxel-wise Ground Truth Labels To evaluate the algorithm, we use the ground-truth class label li ∈ 1, ...,K + 1 from Lm at the center position cmi of thepatches for every multi-scale image patch pair 〈xmi , xmi 〉. Labels include the

Page 6: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

Fig. 2. (a) Example of a semantic target label comprising two object classes (K=2).Each of which comprises a layer group gk1 with 4 bits (layer 1,2,3, or 4) and a referencelocation group gk2 with 2 bits (close or distant). (b) Prediction of a semantic descrip-tion smi that would lead to a corresponding object class label prediction li = 1. (c)Visualization of the volume information which could lead to the given semantic targetlabel shown in (a). (Best viewed in color)

reported observations tm,kP and a healthy background label. li is assigned to thewhole multi-scale image patch pair 〈xmi , xmi 〉 centered at voxel position i.

2.2 Training to Predict Semantic Descriptors

We train a CNN to predict the semantic description from the imaging data andthe corresponding location information provided for the patch center voxels. Weuse tuples 〈xmi , xmi , cmi , dmi , smi 〉 for weakly supervised training of our model. Thetraining objective is to learn the mapping

f : 〈xmi , xmi , cmi , dmi 〉 7−→ smi (1)

from multi-scale image patch pairs 〈xmi , xmi 〉, 3D spatial coordinates cmi and adistance value dmi to corresponding location specific noisy semantic targets smi ina weakly supervised fashion. During testing we apply the mapping to new imagepatches in the test set. During classification, an unseen tuple 〈xmi , xmi , cmi , dmi 〉of multi-scale image patch pairs 〈xmi , xmi 〉 centered at voxel position i and corre-sponding location parameters cmi and dmi causes activations of the classificationlayer, which are the predictions of the semantic descriptions smi . During training,all model parameters θ (weights and bias terms) of the whole model are opti-mized by minimizing the mean squared error between the actual volume-levelsemantic target labels smi and the voxel-level probabilities of semantic descrip-tions smi predicted by the model.

2.3 Evaluation of Local Image Content Classification

We want to know if the proposed approach learns a link between semantic con-cepts and image content and location. To this end, we perform weakly super-vised learning as described above. During testing, we apply the trained CNN tonew data. We transform the voxel-wise predictions of semantic descriptions into

Page 7: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

voxel-level class labels and compare these labels with ground-truth labels on thetesting data. Specifically we are interested in the increase of accuracy causedby the inclusion of semantic information into the training procedure. An objectclass may occur simultaneously in a number of layer classes within the layergroup and in a number of reference location classes within the reference locationgroup. But if an object class is present in an image, then this occurrence has tobe reflected by the predictions of both groups. So, based on the predictions ofthe semantic descriptions smi we compute the mean activation aki of class k overthe maximum activation within each semantic location group gkγ :

aki =1

Γ

Γ∑γ=1

max(gkγ) (2)

Now we can compute location-adjusted predictions li for the class k having thehighest mean activation:

li =

0, if aki < 0.5,∀aki , k = 1, 2, . . . ,K

argmaxk

(aki ), otherwise (3)

If the mean activations of all classes are less than 0.5, the label 0 (backgroundclass) is assigned to the corresponding patch center. The resulting class label

predictions li are assigned to the voxels in the center of the image patches.Based on these class label predictions li we now can measure the performanceof our model in terms of misclassification errors on object class labels. Figure2b shows an example of a prediction smi of a semantic description comprisingtwo object classes and two semantic groups. The corresponding object class-wisemean activations would evaluate to a1i = 0.8 and a2i = 0.2. According to equation

(3) that would lead to the object class label prediction li = 1.

3 Experiments

Data, Data Selection and Preprocessing We evaluate the method on 157 clinicalhigh resolution SD-OCT volumes of the retina with resolutions of 512×128×1024voxels (voxel dimensions 12×47×2µm). The OCT data we use is not generatedinstantly but single slices (in the z/x-plane) are acquired sequentially to formthe volume. Due to relatively strong anisotropy of the imaging data, we workwith 2D in-plane (z/x) patches. From these volumes we extract pairs of 2Dimage patches (see Figure 3a and Figure 3b) with scales 35 × 35 and 71 × 71for 300,000 positions. The positions of the patch centers within an image sliceas well as the slice number within a volume are sampled randomly. A fast patchbased image denoising related to non-local-means is applied as preprocessingstep. Additionally, the intensity values of the image patches are normalized bytransforming the data to zero-mean and unit variance. The human retina can besubdivided into different layers. We use an implementation of an automatic layer

Page 8: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

Fig. 3. (a) Visual inputs with two different scales. Patches in the same column share thecentroid position. (b) Patch extraction at random positions from an SD-OCT intensityimage: patches of size 35 × 35 (red), patches of size 71 × 71 (green) and correspondingpatch centers (yellow). (c) Normalized coordinates of voxels within the retina. (Bestviewed in color)

segmentation algorithm following [16] and based on the top and bottom layer wecompute a retina mask. The voxel positions within this mask are normalized intothe range [0, 1], where the voxels at the top and bottom layer (z-axis), the voxelsin the first and last column of the image (x-axis) and the voxels in the first andlast slice of the volume (y-axis) are assigned to the marginals 0 and 1 respectively(see Figure 3c). These normalized 3D coordinates are used as location specificinputs. In every SD-OCT volume the position of the fovea is also annotated. Weuse the annotated position of the fovea as reference structure and provide theEuclidean distance of every image patch center as additional location specificinput.

Evaluation. For the purposes of evaluation of voxel-wise classification perfor-mance we extract ground-truth class labels at the patch center positions fromthe corresponding volume with voxel-wise annotations. These labels are assignedto the whole corresponding image patches. In our data, 73.43% of the patchesare labeled as healthy tissue, 8.63% are labeled as IRC and 17.94% are labeledas SRF. Pairs of patches sampled at different positions within the same volumemay partially overlap. We split the image patches on a patient basis into trainingand test set to perform 4-fold cross-validation, so that there is no patient bothin the training, and the test set.

We train a classifier to perform 3-class classification between IRC, SRF andnormal retinal tissue. Normal retinal tissue is handled as background class. Wecompare three approaches:

Page 9: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

(1) Naıve weakly supervised learning: We perform weakly supervised learn-ing that links volume-level class labels to image content. We only use the infor-mation which object class (pathology) is present in the volume. This results ina 3-class multiple-instance classification problem. The volume-level target labelis assigned to all image patches of the corresponding volume.

(2) Learning semantic descriptions: We evaluate the performance of ourproposed learning strategy. We use the volume-level semantic representation ofthe reported pathologies. We use semantic target labels encoding two pathologies(IRC and SRF) each of which comprises four bits for the layer group and twobits for the reference structure group resulting in a 12 bit semantic target vector(see Figure 2). The semantic equivalent in a textual report for the four classesin the layer group would be “ganglion cell complex” (top layer of the retina),“inner nuclear and plexiform layers”, “outer nuclear and plexiform layers” and“photoreceptor layers” (bottom layers of the retina). The semantic equivalentfor the two classes in the reference location group are “foveal” (in the vicinityof the fovea) and “extrafoveal” (at a distance from the fovea). This volume-levelsemantic representation is assigned to all image patches of the correspondingvolume.

(3) Supervised learning: We perform fully supervised learning using the voxel-level annotations of class labels. We evaluate what classification accuracy can beobtained when the maximum information at every single voxel-position - namelyvoxel-wise class labels - is available.

All experiments are performed using Python 2.6 with the Theano [17] libraryand run on a graphics processing unit (GPU) using CUDA 5.5.

3.1 Model Parameters

For every approach training of the CNN is performed for 200 epochs. We choosea multi-scale CNN architecture with two parallel stacks of pairs of convolutionand max-pooling layers. These stacks take as input image patches of size 35 ×35 and 71 × 71 and comprise 3 and 2 pairs of convolution and max-poolinglayers respectively (see Figure 1). The outputs of the max-pooling layers ontop of both stacks are concatenated and fed into a fully-connected layer with2048 neurons. This layer is followed by a second fully-connected layer with 64neurons. The activations of this layer are concatenated with the spatial locationparameters of the patch centers and fed into the terminal classification layer. Alllayer parameters are learned during classifier training. The architecture of ourmulti-scale CNN and the detailed model parameters are shown in Figure 1. Themodel parameters were found empirically due to preceding experiments usingOCT data that differs in visual appearance from the data used in our presentedexperiments. The model parameters were tuned in these preceding experimentssolely on the supervised training task to be a good trade-off between attainableclassification accuracy and runtime efficiency. Thereafter they were fixed andused in all of our presented experiments to ensure comparability between theresults of the different experiments.

Page 10: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

Table 1. Confusion matrix of classification results and corresponding class-wise accu-racies on (a) the naıve weakly supervised learning approach, (b) the weakly supervisedlearning approach using semantic descriptions and (c) the supervised learning approach.

prediction

healthy IRC SRF accuracy

healthy 144329 4587 70994 0.6563(a) IRC 10391 5653 9718 0.2194

SRF 4978 231 48511 0.9030

healthy 173121 10603 36186 0.7872(b) IRC 2230 22102 1430 0.8579

SRF 2963 1285 49472 0.9209

healthy 214848 2303 2759 0.9770(c) IRC 2222 23086 454 0.8961

SRF 3670 638 49412 0.9198

3.2 Classification Results

Experiment (1) The naıve weakly supervised learning approach represents themost restricted learning approach and serves as reference scenario. Classificationresults are shown in Table 1a. This approach yields a classification accuracy overall three classes of 66.30%. Only 21.94% of samples showing IRC are classifiedcorrectly, while the SRF class is classified relatively accurately (90.30% of allpatches showing SRF are correctly classified).Experiment (2) The classification results of our proposed weakly supervisedlearning approach using semantic descriptions are shown in Table 1b. This ap-proach yields a classification accuracy over all three classes of 81.73% with loweraccuracy for the healthy class (78.72%) compared to SRF (92.09% accuracy)which is also the best performance on SRF over all three approaches.Experiment (3) As expected, the supervised learning approach performs best.This approach yields a overall classification accuracy over all three classes of95.98%. Classification results are shown in Table 1c. While it has most dif-ficulties with IRC (89.61% accuracy) it still obtains the highest accuracy forIRC over all three approaches. This approach also performs best for the healthyclass (97.70% accuracy). Figure 4 shows a comparison of voxel-wise classifica-tion results obtained by the different approaches. For each of the three trainingapproaches the computation of the voxel-wise map took below 10 seconds.

4 Discussion

We propose a weakly supervised learning method using semantic descriptionsto improve classification performance when no voxel-wise annotations but onlytextual descriptions linked to image data are available. A CNN learns opti-mal multi-scale visual representations and integrates them with location specificinputs to perform multi-class classification. We evaluated the accuracy of the

Page 11: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

Fig. 4. (a) Intensity image of a single slice (zx-view) of a clinical SD-OCT scan ofthe retina. (b) Voxel-wise ground-truth annotations of IRC (red) and SRF (blue).Automatic segmentation results corresponding to voxel-wise class label predictions ob-tained with (c) the naıve weakly supervised learning approach, (d) our weakly super-vised learning approach using semantic descriptions and (e) the supervised learningapproach. On the class label predictions (c-e) no post-processing was performed. (Bestviewed in color)

proposed approach on clinical SD-OCT data of the retina and compared its per-formance on class label prediction accuracy with naıve weakly supervised learn-ing and with fully supervised learning. Experiments demonstrate that based onvolume-level semantic target labels the model learns voxel-level predictions of ob-ject classes. Including semantic information substantially improves classificationperformance. In addition to capturing the structure in intensity image patchesand building a pathology specific model, the algorithm learns a mapping from3D spatial coordinates and Euclidean distances to the fovea to semantic descrip-tion classes found in reports. That is, the CNN learns diverse abstract conceptsof “location”. The learning approach can be applied to automatic classificationand segmentation on medical imaging data for which corresponding report holdssemantic descriptions in the form of pathology - anatomical location pairs.

Exploiting semantics over naıve weakly supervised learning has several ben-efits. The latter performs poorly for classes occurring only in few volumes or ina vanishingly low amount of voxels. While this is a characteristic of many di-agnostically relevant structures in medical imaging, weakly supervised learningperformed particularly poorly on them (e.g., IRC), while exhibiting the strongestbias towards the SRF class in our experiments. This can be explained by thefact that many volumes show SRF resulting in a large amount of patches havingthe (false) noisy SRF class label. Our approach achieves higher classificationaccuracy on class labels over all three classes by approximately 15%. Resultsindicate that semantic descriptions which provide class occurrences and corre-sponding abstract location information provide a rich source to improve clas-sification tasks in medical image analysis where no voxel-wise annotations areavailable. This is important, because in many cases medical images have as-sociated textual descriptions generated and used during clinical routine. Theproposed approach enables the use of these data on a scale for which annotationwould not be possible.

Page 12: Predicting Semantic Descriptions from Medical Images with ... · Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks Thomas Schlegl 1?, Sebastian

References

1. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple instanceproblem with axis-parallel rectangles. Artificial Intelligence 89(1) (1997) 31–71

2. Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. Ad-vances in Neural Information Processing Systems (1998) 570–576

3. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann ma-chines. In: Advances in Neural Information Processing Systems 25. (2012) 2231–2239

4. Leistner, C., Saffari, A., Santner, J., Bischof, H.: Semi-supervised random forests.In: 12th International Conference on Computer Vision, IEEE (2009) 506–513

5. Zhou, Z.H., Zhang, M.L.: Multi-instance multi-label learning with application toscene classification. Advances in NIPS 19 (2007) 1609–1616

6. Cinbis, R.G., Verbeek, J., Schmid, C.: Multi-fold MIL training for weakly super-vised object localization. In: Conference on Computer Vision and Pattern Recog-nition, IEEE (2014)

7. Verbeek, J., Triggs, B.: Region classification with markov field aspect models. In:Conference on Computer Vision and Pattern Recognition, IEEE (2007) 1–8

8. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mech-anism of pattern recognition unaffected by shift in position. Biological cybernetics36(4) (1980) 193–202

9. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Unsupervised learning of hierarchicalrepresentations with convolutional deep belief networks. Communications of theACM 54(10) (2011) 95–103

10. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks forimage classification. In: Conference on Computer Vision and Pattern Recognition,IEEE (2012) 3642–3649

11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deepconvolutional neural networks. In: Advances in Neural Information ProcessingSystems. Volume 1. (2012) 4

12. Brosch, T., Tam, R.: Manifold learning of brain MRIs by deep learning. MedicalImage Computing and Computer-Assisted Intervention (2013) 633–640

13. Schlegl, T., Ofner, J., Langs, G.: Unsupervised pre-training across domains im-proves lung tissue classification. In: Medical Computer Vision. Large Data inMedical Imaging. Springer (2014)

14. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Weakly supervised object recogni-tion with convolutional neural networks. Technical Report HAL-01015140, INRIA(2014)

15. Pradhan, S., Ward, W., Hacioglu, K., Martin, J., Jurafsky, D.: Shallow semanticparsing using support vector machines. In: Proceedings of HLT/NAACL. (2004)233–240

16. Garvin, M.K., Abramoff, M.D., Wu, X., Russell, S.R., Burns, T.L., Sonka, M.:Automated 3-D intraretinal layer segmentation of macular spectral-domain opticalcoherence tomography images. Transactions on Medical Imaging, IEEE 28(9)(2009) 1436–1447

17. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math expres-sion compiler. In: Proceedings of the Python for scientific computing conference(SciPy). Volume 4. (2010)