12
Automatic L3 slice detection in 3D CT images using fully-convolutional networks Fahdi Kanavati, Shah Islam, Eric O. Aboagye, and Andrea Rockall Comprehensive Cancer Imaging Centre, Hammersmith Hospital, Imperial College London, UK Abstract The analysis of single CT slices extracted at the third lumbar vertebra (L3) has gar- nered significant clinical interest in the past few years, in particular in regards to quantifying sarcopenia (muscle loss). In this paper, we propose an efficient method to automatically detect the L3 slice in 3D CT images. Our method works with images with a variety of fields of view, occlusions, and slice thicknesses. 3D CT images are first converted into 2D via Maximal Intensity Projection (MIP), reducing the dimensionality of the problem. The MIP images are then used as input to a 2D fully-convolutional network to predict the L3 slice locations in the form of 2D confidence maps. In addition we propose a variant archi- tecture with less parameters allowing 1D confidence map prediction and slightly faster pre- diction time without loss of accuracy. Quantitative evaluation of our method on a dataset of 1006 3D CT images yields a median error of 1mm, similar to the inter-rater median error of 1mm obtained from two annotators, demonstrating the effectiveness of our method in ef- ficiently and accurately detecting the L3 slice. Code and dataset will be made available at https://github.com/fk128/ct-slice-detection. 1 Introduction The analysis of CT slices extracted at the third lumbar vertebra (L3) has garnered significant clinical interest in the past few years, in particular in regards to computing a sarcopenia measure [SWMN16]. Sarcopenia refers to loss of muscle mass and is computed as the total area of the skeletal muscle mass divided by the square of the patient’s height. Sarcopenia is particularly relevant in oncology where severe muscle loss in adult patients is typically found to be associated with poor outcome [TVVR + 15, MCB + 12, KBMB16]. L3 is taken as a standard landmark by a majority of medical researchers for sarcopenia measurement [SWMN16], as muscle and adipose fat areas at L3 and L4 have been found to be most linearly correlated to their whole-body counterparts [KBMB16]. The main motivation for automating the whole process of computing a sarcopenia measure is to provide it as prognostic information to clinicians in cancer populations alongside the CT images. Extracting measurement directly from CT images is convenient as CT is frequently obtained as part of cancer staging and disease assessment. The current work-flow for computing a sarcopenia measurement is as follows: manual extraction of the L3 slice; this involves scrolling through the 3D image slice by slice until the L3 slice is found. Semi-automated segmentation software (e.g. Slice-O-Matic or ImageJ) that involves manual refinement is then used to segment the skeletal muscle and adipose fat tissue. This process takes 5 to 10 minutes per image, and it becomes time-consuming to run on large datasets. In this paper we solely address the problem of automatic slice detection, as skeletal mus- cle segmentation has already been addressed in the literature using fully-convolutional networks [LTT + 17, LHC + 15]. Slice detection in [BCH + 17] is formulated as a regression problem, where a VGG architecture with a single output fully-connected layer is used to predict the slice location. The main advantage of the approach is that it operates on 2D images instead of 3D; it does so by converting the 3D CT images to 2D via Maximal Intensity Projection (MIP). This involves projecting the maximal intensity pixel value along the perpendicular direction of the frontal plane. The MIP image representation still contains enough information for locating the L3 vertebra. This greatly reduces the dimensionality of the problem and allows feeding into the CNN images that have more context, as opposed to a 3D volume where input size is limited by memory capacity. 1 arXiv:1811.09244v1 [cs.CV] 22 Nov 2018

Automatic L3 slice detection in 3D CT images using fully

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic L3 slice detection in 3D CT images using fully

Automatic L3 slice detection in 3D CT images usingfully-convolutional networks

Fahdi Kanavati, Shah Islam, Eric O. Aboagye, and Andrea Rockall

Comprehensive Cancer Imaging Centre, Hammersmith Hospital, Imperial College London, UK

AbstractThe analysis of single CT slices extracted at the third lumbar vertebra (L3) has gar-

nered significant clinical interest in the past few years, in particular in regards to quantifyingsarcopenia (muscle loss). In this paper, we propose an efficient method to automaticallydetect the L3 slice in 3D CT images. Our method works with images with a variety offields of view, occlusions, and slice thicknesses. 3D CT images are first converted into 2Dvia Maximal Intensity Projection (MIP), reducing the dimensionality of the problem. TheMIP images are then used as input to a 2D fully-convolutional network to predict the L3slice locations in the form of 2D confidence maps. In addition we propose a variant archi-tecture with less parameters allowing 1D confidence map prediction and slightly faster pre-diction time without loss of accuracy. Quantitative evaluation of our method on a datasetof 1006 3D CT images yields a median error of 1mm, similar to the inter-rater median errorof 1mm obtained from two annotators, demonstrating the effectiveness of our method in ef-ficiently and accurately detecting the L3 slice. Code and dataset will be made available athttps://github.com/fk128/ct-slice-detection.

1 IntroductionThe analysis of CT slices extracted at the third lumbar vertebra (L3) has garnered significantclinical interest in the past few years, in particular in regards to computing a sarcopenia measure[SWMN16]. Sarcopenia refers to loss of muscle mass and is computed as the total area of theskeletal muscle mass divided by the square of the patient’s height. Sarcopenia is particularlyrelevant in oncology where severe muscle loss in adult patients is typically found to be associatedwith poor outcome [TVVR+15, MCB+12, KBMB16]. L3 is taken as a standard landmark by amajority of medical researchers for sarcopenia measurement [SWMN16], as muscle and adipose fatareas at L3 and L4 have been found to be most linearly correlated to their whole-body counterparts[KBMB16].

The main motivation for automating the whole process of computing a sarcopenia measure isto provide it as prognostic information to clinicians in cancer populations alongside the CT images.Extracting measurement directly from CT images is convenient as CT is frequently obtained aspart of cancer staging and disease assessment. The current work-flow for computing a sarcopeniameasurement is as follows: manual extraction of the L3 slice; this involves scrolling through the3D image slice by slice until the L3 slice is found. Semi-automated segmentation software (e.g.Slice-O-Matic or ImageJ) that involves manual refinement is then used to segment the skeletalmuscle and adipose fat tissue. This process takes 5 to 10 minutes per image, and it becomestime-consuming to run on large datasets.

In this paper we solely address the problem of automatic slice detection, as skeletal mus-cle segmentation has already been addressed in the literature using fully-convolutional networks[LTT+17, LHC+15]. Slice detection in [BCH+17] is formulated as a regression problem, where aVGG architecture with a single output fully-connected layer is used to predict the slice location.The main advantage of the approach is that it operates on 2D images instead of 3D; it does soby converting the 3D CT images to 2D via Maximal Intensity Projection (MIP). This involvesprojecting the maximal intensity pixel value along the perpendicular direction of the frontal plane.The MIP image representation still contains enough information for locating the L3 vertebra. Thisgreatly reduces the dimensionality of the problem and allows feeding into the CNN images thathave more context, as opposed to a 3D volume where input size is limited by memory capacity.

1

arX

iv:1

811.

0924

4v1

[cs

.CV

] 2

2 N

ov 2

018

Page 2: Automatic L3 slice detection in 3D CT images using fully

As vertebrae are similar in appearance, context is an important feature in discriminating betweenthem. A dense layer with a single output is attached to the last convolutional layer output of a pre-trained VGG network. The weights are then fine-tuned. The method only trains on image crops offixed size ([100, 512, 3]). However, there are a few drawbacks: (1) In order to detect the L3 slice,a sliding window approach is adopted; this means that convolutions from overlapping image areasare being repeatedly recomputed as the window slides over the whole image, and this increasesprediction time; (2) the method has no probabilistic output as it only outputs a single number thatindicates the predicted slice location; (3) only image crops that contain the L3 vertebra are usedfor training, while image crops that do not contain it are excluded. With this approach, a specialelimination process is used during test time that relies on the assumption that image crops thatdo not contain the target vertebra will produce random outputs. (2) and (3) could potentially besolved by adding another output to the network that serves as an indicator whether the L3 sliceis present or absent. This would allow training on negative examples and produce a probabilisticoutput; however, the sliding window approach is still computationally inefficient. In this paper,we propose an efficient method to detect the L3 slice (or potentially any other slice) based onfully-convolutional networks (FCNN) that output full-resolution confidence maps either in 2D or1D. This is motivated by previous works [TGJ+15, PCZ15, PŠBU16, CSWS17] that use FCNN topredict confidence maps for landmark localisation.

In summary, we make the following contributions: (1) an efficient method to automaticallydetect the L3 slice (or potentially any other slice) without using a sliding window approach; thisis due to the use of a fully-convolutional network formulation for confidence map prediction; (2)we propose a novel variant architecture based on a simple modification of the UNet [RFB15]architecture that produces 1D confidence map output instead of the standard 2D output, which issuited for our particular use case; (3) we compile a large annotated dataset of 1070 images obtainedfrom multiple public sources; and (4) the dataset and code for reproducing our method will bepublicly available online at https://github.com/fk128/ct-slice-detection.

2 Related WorkIn computer vision, pose estimation methods such as OpenPose [CSWS17] and others [TGJ+15,PCZ15] rely on FCNNs to regress confidence maps of multiple landmarks, some of which achievestate-of-the-art performance. For medical-based application, [PŠBU16] investigated the applicabil-ity of several configurations of FCNNs for multiple landmark localisation for hand X-ray images.Notably, a UNet-based architecture [RFB15], defined by the presence of a down-sampling and up-sampling path linked via skip connections, compared to a network with a down-sampling path only,achieves the best accuracy overall for the localisation of landmarks on 2d hand X-ray image. Theup-sampling path helps in improving localisation by integrating information from higher resolutionfeature maps. On the other hand, a network with only a down-sampling path outputs confidencemaps of a lower resolution due to the max-pooling operations, resulting in loss of localisation pre-cision. To overcome the loss in localisation precision from max-pooling, [TGJ+15] use a 2-stagenetwork, where the 1st stage network performs coarse localisation of a given landmark; the outputof this network, along with cropped volumes of the feature maps, from the previous layers, areused as input for a 2nd stage network to refine the location of the landmark. This is analogous inprinciple to the approach adopted with a UNet-like architecture with skip connections, except thatcropping is involved. Multi-stage approaches such as [TGJ+15, CSWS17] are especially helpfulwhen the goal is to predict multiple landmarks simultaneously. In effect, intermediate feature andconfidence maps from the previous stage help in improving localisation accuracy by integrating alarger context and using the intermediate confidence maps to resolve ambiguity between landmarksof similar appearance (e.g. distinguishing between left and right arm joint).

Other works have investigated the detection of multiple vertebrae simultaneously in 3D images.The problem is typically posed as a vertebrae detection [GZK+13, CSQ+15, YXX+17b, YXX+17a]and/or segmentation task [GRCD11, RRA13, KIL+15]. For machine learning-based methods, thisrequires ground truth annotations or segmentations to be available for all the vertebrae that ap-pear in the images of the training set; a drawback of this is an increase of ground truth annotationtime. In [GZK+13] a random classification forest is used to localise vertebra centroids in 3DCT images. The sparse centroid annotations are transformed into dense probabilistic labels fortraining. Once probabilistic outputs have been predicted, a false positive removal strategy, whichtakes into account the global shape of the spine, is used to remove spurious outputs and improve

2

Page 3: Automatic L3 slice detection in 3D CT images using fully

the predictions. [CSQ+15] use a hybrid approach that initially coarsely localises potential verte-brae candidates, then uses a CNN to output the final predictions based on the initial candidates.[SSL+15] extract feature vectors around voxel locations in an image and feed the feature vectorsto a deep neural network to predict the presence or absence of a vertebra. [YXX+17a] makeuse of a 3D CNN for vertebrae localisation. While such approaches are beneficial if the goal is3D vertabrae localisation, for our intended application of single slice detection that entail singlevertebra detection, we find that working with 2D images provides good localisation accuracy andefficiency.

3 MethodsThe system takes in as input a 3D CT volume. The volume is converted into a 2D image viaMaximal Intensity Projection (MIP) and is further post-processed. The 2D MIP image is usedas input to the network. Depending on the network, the output is a 1D or 2D confidence map.The location of maximal probability from the confidence map is used as the prediction for thelocation of L3, allowing the extraction of the transverse slice from the CT volume. In this sectionwe describe in detail each step of the pipeline.

3.1 Pre-processingThe input 3D CT volumes are first converted into 2D Maximal Intensity Projection (MIP) imagesalong the frontal and sagittal views, similarly to [BCH+17]; however, we compute a restricted MIPfor the sagittal view in order to eliminate the outer edges of the pelvis and to have a clear viewof the sacrum vertebrae, which is an essential reference point for determining the position of L3 ifthe annotator counts the vertabrae bottom-up. As the spine tends to be situated in the middle ofthe images in the majority of cases, we only compute the restricted sagittal MIP using the range[-20, 20] from the centre of the image; however, for the rare cases where the spine is not at thecentre of the image, it is potentially possible to use a more elaborate image processing techniqueto attempt to detect the centreline of the spine and centre it. As CT images tend to have differentslice thicknesses, we normalise the pixel size of the resulting MIPs to 1× 1mm2 to allow consistentinput to the algorithm.

Finally, we threshold the images between 100 HU and 1500 HU in order to eliminate themajority of soft tissue at the lower end of HU and minimise the effect of metal implants andartifacts above 1500 HU. The images are then mapped to 8bit ([−127, 127]). Figure 1 shows anexample of MIP images obtained from a 3D CT volume.

3.2 Model ArchitectureWe investigate the use of a FCNN based on the UNet architecture [RFB15] with a 2D confidencemap output, and we propose a variant architecture that allows for a 1D confidence map output.

2D output The FCNN is based on a UNet-like architecture, and it consists of multiple down-sampling and up-sampling blocks, with the latter mirroring the former. Each block consists of 1 or2 convolutional units, where each unit is a sequence of 3x3 convolution, batch normalisation, andLeaky ReLU (α = 0.05) activation. Each block in the up-sampling path ends with an additional1x1 convolution with the same number of channels as the corresponding block. 2x2 max-pooling isperformed at the end of each block in the down-sampling path, with a 4x4 max-pooling performedat the last layer of the down-sampling path. Skip connections, which consist of the concatenationof the output of the up-sampling blocks with the outputs from the down-sampling blocks at thesame level, are used to link the down-sampling path with the up-sampling path. In the up-samplingpath we use spatial dropout at each block with p = 0.25. At the output of the network we attach a1x1 convolutional layer with a sigmoid activation function and 1 channel output for the confidencemap prediction. The resulting 2D confidence map prediction output has the same dimensions asthe input image. We refer to this network as L3UNet-2D. Figure 2a illustrates the architecture.The network has 8,493,537 parameters.

3

Page 4: Automatic L3 slice detection in 3D CT images using fully

(a) A stack of CT slices.

(b) Frontal MIP image ob-tained from performing a max-imal intensity projection alongthe direction perpendicular tothe frontal plane (blue in (a)).

(c) Sagittal MIP image ob-tained from performing a max-imal intensity projection alongthe direction perpendicular tothe sagittal plane (green in (a)).

(d) Restricted sagittal MIP im-age computed only on the re-stricted range (in red) of the CTin (a)

Figure 1: Maximal intensity projection (MIP) images

1D output The FCNN is similarly based on the UNet architecture; the down-sampling path isthe same as that of L3UNet-2D; however, the main difference is that we apply global horizontalmax-pooling along the up-sampling path. The resulting up-sampling path consists of 1D convolu-tions, and we employ dropout with p = 0.25. The output is 1D and has the same size as the heightof the image. We refer to this network as L3UNet-1D. Figure 2b illustrates the architecture of thenetwork. The network has 6,189,025 parameters.

3.3 AugmentationAugmentation is a straightforward approach that typically helps in improving generalisation per-formance. Image transformations are applied on the input images to create artificial variants. Weuse a set of image transformations such as: horizontal flipping, scaling [0.8, 1.2], intensity offsets [-70, 70], piece-wise affine deformation, region drop-outs and over-exposures (to simulate occlusions),and vertical image sub-sampling (to simulate different slice thicknesses). In MIP images, occlu-sions can show up due to the presence of metal implants, bowel content, or contrast agents; regiondrop-outs and over-exposures can help make the algorithm less susceptible to such occlusions. Tosimulate images with a variety of slice thicknesses (up to 7mm), an image is down-sampled alongthe vertical axis and then up-sampled back to its original size using linear interpolation.

4

Page 5: Automatic L3 slice detection in 3D CT images using fully

(a) L3UNet-2D

(b) L3UNet-1D

Figure 2: Network architecture. Each conv block consists of a repetition of convolutional units.A conv unit consists of a sequence of convolution, batch normalisation, and ReLU activation.The figure indicates each block’s parameters as follows: Conv<dim> block x<number of units> -<kernel size> <number of layers>. Arrows indicate skip connections. Global h. max-pool refersto global horizontal max-pooling.

3.4 LocalisationWe convert point-wise annotations into confidence maps Hi for each MIP image i. The onlyavailable annotation is the position of the slice along the y-axis. In the frontal MIP images, thespine is expected to be located in most of the cases within a fixed range along the x-axis. And wemake this assumption when we generate the ground truth 2D confidence maps for the frontal MIPimages. Given yi as the ground truth coordinates of the L3 slice for image i along the y-axis, thevalue of the 2D confidence map at any coordinates (x, y) is defined as

Hi(x, y) = A× (fi ∗ gσ)(x, y), (1)

where fi is the step function

fi(x, y) =

{1 x0 − v ≤ x ≤ x0 + v, y = yi

0 otherwise,(2)

gσ is a Gaussian filter function, v is an offset, and A is the max norm ||Hi(x, y)||∞. Figure3 shows examples of the generated 2D confidence maps overlaid on top of the images. In the 1Dcase, fi is reduced to an indicator function

fi(y) =

{1 y = yi

0 otherwise.(3)

For the objective function, we use L2 loss between the predicted confidence map P and theground-truth map H, averaged over a training batch:

5

Page 6: Automatic L3 slice detection in 3D CT images using fully

(a) (b) (c) (d)

Figure 3: Examples of different images with the ground-truth confidence maps (σ = 3mm, v = 50)overlaid on top of the images. As the only available annotation is the y coordinate, we generatean asymmetric Gaussian map centred in the middle of the image to account for offsets of the spinein the x coordinate. The dataset comprises of different images with a variety of fields of view, slicethicknesses, and artefacts.

L =1

N

N∑i=1

∑x

||Pi(x)−Hi(x)||2, (4)

where N is the batch size.

4 Experiments and ResultsWe implement our architecture using keras [C+15] with the tensorflow backend. Data augmentationis performed on the fly via the data input generator using imgaug library1. We compare ourproposed method with the sliding window regression approach as described in [BCH+17], as wellas a modified version where we produce two outputs: the y coordinate of the slice, and a Booleanthat indicates the presence or absence of the slice (to allow training on images that do not containa view of the L3 vertebra).

4.1 DatasetWe collected a diverse dataset consisting of 1070 CT images from multiple publicly availabledatasets. 3 sets were obtained from the Cancer Imaging Archive (TCIA) 2: head and neck3,ovarian 4, colon; a liver tumour dataset is obtained from the LiTS segmentation challenge5; and anovarian cancer dataset is obtained from Hammersmith Hospital (HH), London. Figure 4 providesthe distribution of slice thicknesses and image heights the combined dataset.

All the 1070 3D CT images were pre-processed, where each 3D image results in a set consistingof a frontal image and a restricted sagittal image. The annotation were carried out on imagesnormalised to 1x1mm. The MIP images were annotated by 2 annotators: a radiologist with 7years of experience and an annotator with 5 years of experience working with CT images. Foreach image set, the annotator was presented with the frontal and restricted sagittal MIPs side byside, and the annotator clicked on the location of the L3 slice. The main landmark was chosenas the middle of the pedicle, lining up with the top edge of the transverse process. Only the

1https://github.com/aleju/imgaug2http://www.cancerimagingarchive.net/3http://doi.org/10.7937/K9/TCIA.2017.umz8dv6s4http://dx.doi.org/10.7937/K9/TCIA.2016.NDO1MDFQ5https://competitions.codalab.org/competitions/17094

6

Page 7: Automatic L3 slice detection in 3D CT images using fully

(a) (b)

Figure 4: Distribution of slices thicknesses (a) and image heights (b) in the dataset.

mean std median maxError between A and B (mm) 1.90 1.76 1.00 9.00Error between A and B (slice) 1.94 2.36 0.80 11.43

Error between A/B and mean of A and B (mm) 1.14 0.97 1.00 5.00Error between A/B and mean of mean A and B (slice) 0.97 1.18 0.40 5.71

Table 1: Error, computed as absolute difference, between annotators A and B. Errors are reportedin mm and in number of slices. As the mean slice location from both annotators is later taken asground truth for training, we report the error between any given annotator and the ground truth.

position along the y-axis was recorded. It took about 2-4 seconds to manually annotate a singleimage. 9 images had disagreements between the annotators and 57 out of 1070 were ambiguous asthere was uncertainty in assigning the location of the L3 vertebra. Further inspection by a seniorradiologist with 18 years of experience as a consultant revealed that the majority of ambiguouscases consisted of patients with congenital vertebral anomalies, with the principal anomaly beingtransitional vertebra. Transitional vertebrae are ones that exhibit ambiguous characteristics, theyare relatively common in the population (15-35%) [CCJL+11, UUC+13], and they occur at thejunction between spinal segments with various degrees of apparent transition: atlanto-occipitaljunction, cervicothoracic junction (with a cervical rib from C7), thoracolumbar junction (withlumbar rib at L1 or a 13th rib from T13), and lumbosacral junction (commonly refered to asLumbosacral transitional vertebrae (LSTV)). Inaccurate identification of the correct level dueto LSTV has led to procedures being carried out at the wrong vertebra level [KW10]. Correctidentification of L3 in ambiguous cases can only be resolved if the image contains a view of the wholespine [Bro07, CCJL+11]. Results reported in Table 1 correspond to the errors where annotationsfrom both annotators fell within the same vertebra for each image. The error in slices was computedby dividing the error in mm of a given image by the slice thickness, without rounding. Transitionalvertebra cases (57 ) were excluded, leaving 1006 images for the training process; however, we stillevaluated the detection algorithm on the transitional cases to verify the output, as in real-worldscenarios, such cases are expected to be encountered. The average (rounded down) L3 slice locationfrom the 2 annotators is used as ground truth for training.

4.2 ResultsFor L3UNet-1D and 2D, we trained using image crops of size [256, 384], with crops randomlysampled along the y-axis and centred along the x-axis. We used a batch size of 5 for L3UNet-2Dand 8 for L3UNet-1D. We set σ = 1.5 for generating the confidence maps. During training wefound it helpful to start with a larger σ = 10 and linearly reduce it to 1.5 as the training progresses.The networks were trained for 50 epochs (sufficient enough to allow no further improvements on asmall validation subset of the training set) using the Adam optimiser with a learning rate of 1e-3.For L3UNet-2D we only used the frontal MIP image as input During testing, the whole MIP imagewas provided as input to the network. The image were padded as necessary to ensure that theheight and width were divisible by the amount of max-pooling in the network.

7

Page 8: Automatic L3 slice detection in 3D CT images using fully

mean std median max > 10

L3UNet-2D - frontal error (mm) 2.09 4.56 1.00 51.00 19error (slice) 1.43 3.52 0.80 51.00 12

L3UNet-1D - frontal error (mm) 2.12 4.56 1.00 38.00 22error (slice) 1.53 4.22 0.67 45.71 15

L3UNet-1D - sagittal error (mm) 1.99 5.41 1.00 52.00 28error (slice) 1.41 5.02 0.50 65.00 23

VGG16 Regression [BCH+17] - frontal error (mm) 13.78 8.57 12.00 48.00 591error (slice) 10.26 9.92 6.12 60.00 360

VGG16 Regression with dual output - frontal error (mm) 6.94 5.90 6.00 62.00 191error (slice) 5.54 6.29 3.20 40.00 180

Table 2: 3-fold cross validation results using our method compared to the method from [BCH+17]on our dataset. The error in slices is computed as the error in mm divided by the slice thicknessfor a given image, without rounding. We also report the number of outlier images that have anerror greater than 10.

(a) L3UNet-2Dfrontal

(b) L3UNet-2Dfrontal

(c) L3UNet-1Dfrontal

(d) L3UNet-1Dsagittal

Figure 5: Examples of prediction output. The predicted confidence map is overlaid on the images.For the 1D case, the confidence map is stretched out along the x-axis for visualisation purposes.The red line corresponds to prediction, while green line is the ground truth.

For the sliding window VGG16 regression, we attempted to reproduce the methodology asdescribed in [BCH+17] using their suggested parameters; however, not all parameters needed fortraining were reported; to fine-tune the network, we use a smaller learning rate 1e − 5 to avoidquickly destroying the learned weights of VGG16, and we used a batch size of 12. We used cropwindows of size [100, 512], as recommended, sampled from frontal MIP image regions that containthe L3 vertebrae. In addition, we trained a modified version of VGG16 regression with dualoutput: a y coordinate and a Boolean that indicates the presence or absence of the L3 vertebra.This allowed training on positive and negative image crops. We report the results of 3-fold crossvalidation in Table 2.

Training and testing were carried out on a workstation with 2 Nvidia TitanX GPUs. Oncethe MIP images were obtained, the average testing times for an image with average 440mm heightwere: L3UNet-1D 0.06s, L3UNet-2D 0.17s, sliding window VGG16 2.98s.

Finally, we applied our trained L3UNets on the excluded set of transitional vertebra cases.Outputs in all cases consisted in one of the two adjacent true candidate vertebrae or, in somecases, both. Figure 6 show a sample output on a transitional vertebra case.

4.3 DiscussionResults show that FCNNs with confidence map outputs achieve state-of-the-art L3 localisationperformance, with a median error using our proposed L3UNet models equal to 1mm, similarto the human annotator error. We note that using our proposed approach of 1D confidence mapoutputs achieves slightly better results than the 2D confidence map output, despite the global max-

8

Page 9: Automatic L3 slice detection in 3D CT images using fully

Figure 6: Output on a transitional vertebra case by L3UNet-2D. The confidence map indicatestwo potential locations for L3.

pooling applied along the up-sampling path. As the only available annotation is the slice locationalong the y-axis, it is straight-forward to apply L3UNet-1D on either the frontal or sagittal MIPimages. However, with the 2D output, we only make use of the frontal MIP image as we makethe assumption that the spine is located in the centre of the image. In the sagittal MIP images,the spine is not always located at the same position. Although, it would have been possible topotentially detect the spine with some image processing, we have not done so. Therefore, we onlyuse the frontal image as input for L3UNet-2D.

There still remains a small subset of images where the prediction is maximally off by onevertebra (outliers are images with an error more than 10mm). An inspection of the outlier imagesrevealed that roughly half of them have an apparent reason that might explain why the networkmade an incorrect prediction. Figure 7 shows such an example of outliers. In addition, a few outliershad probabilities less than 50%, so it would potentially be possible to detect them. Nonetheless,the other half of outliers had confident, incorrect prediction outputs with no apparent reason. It ispotentially likely, though, despite the best efforts of the annotators that some transitional vertebraecases have slipped past them into the training set, given that the cases are at most prevalent in thegeneral population 15-35%, while only 57 out of 1070 , which amounts to about 5% were found inour dataset. With transitional vertebrae cases, it is not possible to determine the correct L3 levelwithout a full view of the spine, which would allow counting from the cervical segment. We testedour network on the excluded set of transitional vertebrae images, the outputted confidence mapsalways resulted in predictions for one or the other potential L3 candidate, and occasionally both.In a practical application where the goal is performing sarcopenia measurement, it would be worthinvestigating the amount of tolerable offset from L3 that would result in a different prognosticoutput; this is of course also subject to accurate segmentation of the muscle area.

The results that we obtained using the sliding window approach of VGG16 regression are notwithin the same range as those reported in [BCH+17] despite attempts to improve the output.This could be due to three things: (1) our dataset is more complex than the one used in [BCH+17]and contains more finer slice thicknesses (minimum of 0.8mm vs 2mm); however, their dataset isnot publicly available, so there is no means to verify this. (2) Based on previous reported resultsin the literature ([PŠBU16, TGJ+15]) on the use of networks with one down-sampling path, thereis always a reduction in localisation accuracy due to max-pooling: this is the case with VGG16,so the obtained results are somewhat to be expected. In addition [BCH+17] report that verticalmax-pooling distorts the target position when using a custom network trained from scratch, butthis does not seem to be reflected in the results reported for VGG16. (3) An implementation errorof their slice detection method on our part, despite our best efforts of reproduction. Nonetheless,disregarding the accuracy results, one clear advantage of our proposed method is the the inferencespeed, where 50x speed-up can be obtained using L3UNet-1D. Using either L3UNet-1D or 2D, aprediction can be made in 0.06s and 0.17s, respectively, for an image of average height 440mm.

9

Page 10: Automatic L3 slice detection in 3D CT images using fully

(a) Incorrect identification of the L3 level byL3UNet-2D potentially due to the low contrast ofthe spine, rendering the transverse process almostinvisible. The outputted confidence map is low.

(b) Incorrect identification of the L3 level byL3UNet-1D with sagittal image input. The spineis curved, resulting in an apparent merging of L5with the sacrum.

Figure 7: Examples of incorrect identification of outlier cases.

5 ConclusionWe have proposed an efficient method for the automatic detection of the L3 slice within a 3DCT volume, with the principal goal of using our method as part of an automatic sarcopeniameasurement tool. The method is based on a fully-convolutional UNet-like architecture with norestriction on the image input size. A 3D CT image is first converted into a 2D MIP image,reducing the dimensionality of the problem. MIP images and associated confidences maps are thenused to train a FCNN. We have in addition proposed a novel architecture variant that consists in asimple modification of the UNet architecture that is suited for our particular use case of predictinga confidence map along a single axis. Our method achieves state-of-the-art results with a medianerror rate (in mm) that is comparable to the inter-annotator median error. Future work could lookat the potential of using our method for multiple vertebrae detection, instead of a single vertebra,and comparing that with 3D methods. 3D coordinates of a vertebra could be obtained using thefrontal and sagittal MIP images, and the traverse slices. The algorithm could first detect thecoordinates of the vertabrae along the sagittal and frontal planes, and then along the transverseplanes.

AcknowledgementsThis work was supported by United Kingdom NIHR Biomedical Research Centre award to ImperialCollege London. We acknowledge programmatic support from Imperial College ExperimentalCancer Medicines Centres and United Kingdom Medical Research Council (MR/N020782/1)

References[BCH+17] Soufiane Belharbi, Clément Chatelain, Romain Hérault, Sébastien Adam, Sébastien

Thureau, Mathieu Chastan, and Romain Modzelewski. Spotting l3 slice in ct scansusing deep convolutional network and transfer learning. Computers in biology andmedicine, 87:95–103, 2017.

[Bro07] Johannes L Bron. The clinical significance of lumbosacral transitional anomalies. ActaOrthopaedica Belgica, 73(6):687, 2007.

[C+15] François Chollet et al. Keras. https://keras.io, 2015.

[CCJL+11] John A Carrino, Paul D Campbell Jr, Dennis C Lin, William B Morrison, Mark ESchweitzer, Adam E Flanders, John Eng, and Alexander R Vaccaro. Effect of spinalsegment variants on numbering vertebral levels at lumbar mr imaging. Radiology,259(1):196–202, 2011.

10

Page 11: Automatic L3 slice detection in 3D CT images using fully

[CSQ+15] Hao Chen, Chiyao Shen, Jing Qin, Dong Ni, Lin Shi, Jack CY Cheng, and Pheng-Ann Heng. Automatic localization and identification of vertebrae in spine ct via ajoint learning model with deep neural networks. In International Conference on Med-ical Image Computing and Computer-Assisted Intervention, pages 515–522. Springer,2015.

[CSWS17] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2dpose estimation using part affinity fields. In CVPR, 2017.

[GRCD11] Subarna Ghosh, Alomari Raja’S, Vipin Chaudhary, and Gurmeet Dhillon. Automaticlumbar vertebra segmentation from clinical ct for wedge compression fracture diagno-sis. In Medical Imaging 2011: Computer-Aided Diagnosis, volume 7963, page 796303.International Society for Optics and Photonics, 2011.

[GZK+13] Ben Glocker, Darko Zikic, Ender Konukoglu, David R Haynor, and Antonio Cri-minisi. Vertebrae localization in pathological spine ct via dense classification fromsparse annotations. In International Conference on Medical Image Computing andComputer-Assisted Intervention, pages 262–270. Springer, 2013.

[KBMB16] Seyyed Mohammad Reza Kazemi-Bajestani, Vera C Mazurak, and Vickie Baracos.Computed tomography-defined muscle and fat wasting are associated with cancerclinical outcomes. In Seminars in cell & developmental biology, volume 54, pages2–10. Elsevier, 2016.

[KIL+15] Robert Korez, Bulat Ibragimov, Boštjan Likar, Franjo Pernuš, and Tomaž Vrtovec.A framework for automated spine and vertebrae interpolation-based detection andmodel-based segmentation. IEEE transactions on medical imaging, 34(8):1649–1662,2015.

[KW10] GP Konin and DM Walz. Lumbosacral transitional vertebrae: classification, imagingfindings, and clinical relevance. American Journal of Neuroradiology, 31(10):1778–1786, 2010.

[LHC+15] Julien Lerouge, Romain Herault, Clément Chatelain, Fabrice Jardin, and RomainModzelewski. Ioda: an input/output deep architecture for image labeling. Patternrecognition, 48(9):2847–2858, 2015.

[LTT+17] Hyunkwang Lee, Fabian M Troschel, Shahein Tajmir, Georg Fuchs, Julia Mario, Flo-rian J Fintelmann, and Synho Do. Pixel-level deep segmentation: artificial intelligencequantifies muscle on computed tomography for body morphometric analysis. Journalof digital imaging, 30(4):487–498, 2017.

[MCB+12] Olivier Mir, Romain Coriat, Benoit Blanchet, Jean-Philippe Durand, PascalineBoudou-Rouquette, Judith Michels, Stanislas Ropert, Michel Vidal, Stanislas Pol,Stanislas Chaussade, et al. Sarcopenia predicts early dose-limiting toxicities andpharmacokinetics of sorafenib in patients with hepatocellular carcinoma. PloS one,7(5):e37563, 2012.

[PCZ15] Tomas Pfister, James Charles, and Andrew Zisserman. Flowing convnets for humanpose estimation in videos. In Proceedings of the IEEE International Conference onComputer Vision, pages 1913–1921, 2015.

[PŠBU16] Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler. Regressingheatmaps for multiple landmark localization using cnns. In International Conferenceon Medical Image Computing and Computer-Assisted Intervention, pages 230–238.Springer, 2016.

[RFB15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networksfor biomedical image segmentation. In International Conference on Medical imagecomputing and computer-assisted intervention, pages 234–241. Springer, 2015.

[RRA13] Abtin Rasoulian, Robert Rohling, and Purang Abolmaesumi. Lumbar spine seg-mentation using a statistical multi-vertebrae anatomical shape+ pose model. IEEEtransactions on medical imaging, 32(10):1890–1900, 2013.

11

Page 12: Automatic L3 slice detection in 3D CT images using fully

[SSL+15] Amin Suzani, Alexander Seitel, Yuan Liu, Sidney Fels, Robert N Rohling, and PurangAbolmaesumi. Fast automatic vertebrae detection and localization in pathologicalct scans-a deep learning approach. In International Conference on Medical ImageComputing and Computer-Assisted Intervention, pages 678–686. Springer, 2015.

[SWMN16] Shlomit Strulov Shachar, Grant R Williams, Hyman B Muss, and Tomohiro F Nishi-jima. Prognostic value of sarcopenia in adults with solid tumours: a meta-analysisand systematic review. European journal of cancer, 57:58–67, 2016.

[TGJ+15] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler.Efficient object localization using convolutional networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 648–656, 2015.

[TVVR+15] Juul JW Tegels, Jeroen La Van Vugt, Kostan W Reisinger, Karel WE Hulsewé,Anton GM Hoofwijk, Joep PM Derikx, and Jan HMB Stoot. Sarcopenia is highlyprevalent in patients undergoing surgery for gastric cancer but not associated withworse outcomes. Journal of surgical oncology, 112(4):403–407, 2015.

[UUC+13] Demet Uçar, Bekir Yavuz Uçar, Yahya Coşar, Kurtuluş Emrem, Gürkan Gümüşsuyu,Serhat Mutlu, Burcu Mutlu, Mehmet Akif Çaçan, Yılmaz Mertsoy, and Hatice Gümüş.Retrospective cohort study of the prevalence of lumbosacral transitional vertebra ina wide and well-represented population. Arthritis, 2013, 2013.

[YXX+17a] Dong Yang, Tao Xiong, Daguang Xu, Qiangui Huang, David Liu, S Kevin Zhou,Zhoubing Xu, JinHyeong Park, Mingqing Chen, Trac D Tran, et al. Automatic ver-tebra labeling in large-scale 3d ct using deep image-to-image network with messagepassing and sparsity regularization. In International Conference on Information Pro-cessing in Medical Imaging, pages 633–644. Springer, 2017.

[YXX+17b] Dong Yang, Tao Xiong, Daguang Xu, S Kevin Zhou, Zhoubing Xu, Mingqing Chen,JinHyeong Park, Sasa Grbic, Trac D Tran, Sang Peter Chin, et al. Deep image-to-image recurrent network with shape basis learning for automatic vertebra labeling inlarge-scale 3d ct volumes. In International Conference on Medical Image Computingand Computer-Assisted Intervention, pages 498–506. Springer, 2017.

12