4
A Comprehensive and Publicly Available Head CT Segmentation Tool and Atlas Jason C. Cai, MD, Mayo Clinic Rochester; Kenneth A. Philbrick, PhD; Zeynettin Akkus, PhD; Arunnit Boonrod, MD; Safa Hoodeshenas, MD; Bradley J. Erickson, MD, PhD, CIIP Background/Problem Being Solved CT is a cornerstone of neuroimaging and its use has increased steadily 1 . Given the large volume of examinations, fully automated algorithms can augment clinical workflow and improve diagnostic accuracy. Several studies have demonstrated the clinical utility of CT structural imaging biomarkers in predicting neurological disease and patient outcomes 2–8 . However, the inability to rapidly and accurately segment neuroanatomy has impeded the routine clinical use of such biomarkers, as well as the discovery of new associations and normative population metrics for these biomarkers. To address this need, we developed a deep learning model that segments 16 intracranial structures from non-contrast CT. Interventions The primary dataset contained 62 normal non-contrast head CT examinations. One observer annotated 16 structures on axial images. The dataset was split into 40 for training, 10 for validation, and 12 for testing. From each test volume, three observers additionally annotated the same five slices representing all 16 structures. From these annotations, a set of multi-rater consensus labels (termed Ground Truth Masks [GTM]) was created using STAPLE 9 . We additionally curated two secondary test datasets to assess model generalizability: the first dataset contained 12 volumes demonstrating idiopathic normal pressure hydrocephalus (iNPH); the second dataset contained 30 normal volumes from the RSNA 2019 Hemorrhage Detection Challenge 10 . One observer annotated all slices from each iNPH scan and five slices from each RSNA scan (capturing all 16 structures). We used RIL-Contour to annotate the datasets 11 . We modified a 2D U-Net 12 to perform segmentation. To address class imbalance, we weighed the cross-entropy loss based on class prevalence. The optimal weighing scheme was selected using the validation dataset, and the final model was evaluated on the primary and secondary test datasets. Additionally, all observers and the model were evaluated on GTMs. For statistical analysis, we used categorical linear regression with p<0.05. Outcome Overall Dice coefficient on the primary test dataset was 0.83 (range: 0.74-0.94). Between the iNPH and the primary test datasets, the model performed equally well or better in 13/16 structures (both datasets contained 12 fully annotated volumes each). Between the RSNA dataset and the GTMs, the model performed equally well in 15/16 structures (both datasets contained five annotated slices per volume). Using GTMs as a reference, the model performed equally well as observers in 25/48 comparisons (3 observers, 16 structures per observer). Mean difference between the model and the average observer's Dice coefficient was 0.05.

A Comprehensive and Publicly Available Head CT ... · We modified a 2D U-Net. 12. to perform segmentation. To address class imbalance, we weighed the cross -entropy loss based on

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Comprehensive and Publicly Available Head CT ... · We modified a 2D U-Net. 12. to perform segmentation. To address class imbalance, we weighed the cross -entropy loss based on

A Comprehensive and Publicly Available Head CT Segmentation Tool and Atlas Jason C. Cai, MD, Mayo Clinic Rochester; Kenneth A. Philbrick, PhD; Zeynettin Akkus, PhD; Arunnit Boonrod, MD; Safa Hoodeshenas, MD; Bradley J. Erickson, MD, PhD, CIIP Background/Problem Being Solved CT is a cornerstone of neuroimaging and its use has increased steadily1. Given the large volume of examinations, fully automated algorithms can augment clinical workflow and improve diagnostic accuracy. Several studies have demonstrated the clinical utility of CT structural imaging biomarkers in predicting neurological disease and patient outcomes2–8. However, the inability to rapidly and accurately segment neuroanatomy has impeded the routine clinical use of such biomarkers, as well as the discovery of new associations and normative population metrics for these biomarkers. To address this need, we developed a deep learning model that segments 16 intracranial structures from non-contrast CT. Interventions The primary dataset contained 62 normal non-contrast head CT examinations. One observer annotated 16 structures on axial images. The dataset was split into 40 for training, 10 for validation, and 12 for testing. From each test volume, three observers additionally annotated the same five slices representing all 16 structures. From these annotations, a set of multi-rater consensus labels (termed Ground Truth Masks [GTM]) was created using STAPLE9. We additionally curated two secondary test datasets to assess model generalizability: the first dataset contained 12 volumes demonstrating idiopathic normal pressure hydrocephalus (iNPH); the second dataset contained 30 normal volumes from the RSNA 2019 Hemorrhage Detection Challenge10. One observer annotated all slices from each iNPH scan and five slices from each RSNA scan (capturing all 16 structures). We used RIL-Contour to annotate the datasets11. We modified a 2D U-Net12 to perform segmentation. To address class imbalance, we weighed the cross-entropy loss based on class prevalence. The optimal weighing scheme was selected using the validation dataset, and the final model was evaluated on the primary and secondary test datasets. Additionally, all observers and the model were evaluated on GTMs. For statistical analysis, we used categorical linear regression with p<0.05. Outcome Overall Dice coefficient on the primary test dataset was 0.83 (range: 0.74-0.94). Between the iNPH and the primary test datasets, the model performed equally well or better in 13/16 structures (both datasets contained 12 fully annotated volumes each). Between the RSNA dataset and the GTMs, the model performed equally well in 15/16 structures (both datasets contained five annotated slices per volume). Using GTMs as a reference, the model performed equally well as observers in 25/48 comparisons (3 observers, 16 structures per observer). Mean difference between the model and the average observer's Dice coefficient was 0.05.

Page 2: A Comprehensive and Publicly Available Head CT ... · We modified a 2D U-Net. 12. to perform segmentation. To address class imbalance, we weighed the cross -entropy loss based on

Figure 1: Sample images from the primary and secondary (iNPH) test datasets.

Figure 2: Box and whisker plot comparing the model's predictions with the observers' annotations, using Ground Truth Masks as a reference. Red asterisk: observer has higher Dice coefficients as compared to the model (p<0.05). Green asterisk: observer has lower Dice coefficients as compared to the model (p<0.05). Blue asterisk: observer has higher Dice coefficients as compared to the model (p<0.05). However, the Ground Truth Masks contain a large number of slices that included the boundary between the temporal and parietal lobes, which is defined by an arbitrary straight line in the sagittal plane posteriorly. The overall Dice coefficients for the parietal and temporal lobes were higher when the model was evaluated on full volumes from the primary test dataset (see Figure 3).

Page 3: A Comprehensive and Publicly Available Head CT ... · We modified a 2D U-Net. 12. to perform segmentation. To address class imbalance, we weighed the cross -entropy loss based on

Figure 3: Box and whisker plot comparing the model's performance between the primary and secondary test datasets. The model could not consistently identify the central sulcus in iNPH patients because ventricular enlargement severely distorted its appearance. Two volumes were excluded because the central sulcus could not be identified manually as well. Conclusion Automated segmentation of CT neuroanatomy is feasible with a high degree of accuracy. The model generalized to external scans as well as scans demonstrating iNPH. With further optimization, the algorithm can extract quantitative information and spatial context from head CT, which can be used to localize disease, guide treatment, and accelerate the discovery of new imaging biomarkers. Statement of Impact Our model is available from https://jasonccai.github.io/HeadCTSegmentation/. Developers can utilize transfer learning to further optimize it for their specific needs. Keywords Head CT, idiopathic normal pressure hydrocephalus, segmentation, deep learning, convolutional neural network, U-Net References 1. Rosman DA, Duszak R, Wang W, Hughes DR, Rosenkrantz AB. Changing Utilization of Noninvasive Diagnostic

Imaging Over 2 Decades: An Examination Family–Focused Analysis of Medicare Claims Using the Neiman Imaging Types of Service Categorization System. Am J Roentgenol. 2017;210(2):364-368. doi:10.2214/AJR.17.18214

2. Frisoni GB, Geroldi C, Beltramello A, et al. Radial Width of the Temporal Horn: A Sensitive Measure in Alzheimer Disease. Am J Neuroradiol. 2002;23(1):35. http://www.ajnr.org/content/23/1/35.abstract

3. Diprose WK, Diprose JP, Wang MTM, Tarr GP, McFetridge A, Barber PA. Automated Measurement of Cerebral Atrophy and Outcome in Endovascular Thrombectomy. Stroke. 0(0):STROKEAHA.119.027120. doi:doi:10.1161/STROKEAHA.119.027120

4. Anderson RC, Grant Jj Fau - de la Paz R, de la Paz R Fau - Frucht S, Frucht S Fau - Goodman RR, Goodman RR, Neurosurg J. Volumetric measurements in the detection of reduced ventricular volume in patients with normal-pressure hydrocephalus whose clinical condition improved after ventriculoperitoneal shunt placement. (0022-3085 (Print)).

5. Toma AK, Holl E, Kitchen ND, Watkins LD. Evans’ Index Revisited: The Need for an Alternative in Normal Pressure Hydrocephalus. Neurosurgery. 2011;68(4):939-944. doi:10.1227/NEU.0b013e318208f5e0

6. Relkin N, Marmarou A, Klinge P, Bergsneider M, Black PM. Diagnosing Idiopathic Normal-pressure Hydrocephalus. Neurosurgery. 2005;57(suppl_3):S2-4-S2-16. doi:10.1227/01.NEU.0000168185.29659.C5

7. Kauw F, Bennink E, de Jong HWAM, et al. Intracranial Cerebrospinal Fluid Volume as a Predictor of Malignant Middle Cerebral Artery Infarction. Stroke. 2019;50(6):1437-1443. doi:10.1161/STROKEAHA.119.024882

Page 4: A Comprehensive and Publicly Available Head CT ... · We modified a 2D U-Net. 12. to perform segmentation. To address class imbalance, we weighed the cross -entropy loss based on

8. Takahashi N, Shinohara Y, Kinoshita T, et al. Computerized Identification of Early Ischemic Changes in Acute Stroke in Noncontrast CT Using Deep Learning. Vol 10950. SPIE; 2019. https://doi.org/10.1117/12.2507351

9. Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans Med Imaging. 2004;23(7):903-921. doi:10.1109/TMI.2004.828354

10. AI challenge. Accessed April 4, 2020. https://www.rsna.org/en/education/ai-resources-and-training/ai-image-challenge

11. Philbrick KA, Weston AD, Akkus Z, et al. RIL-Contour: a Medical Imaging Dataset Annotation Tool for and with Deep Learning. J Digit Imaging. 2019;32(4):571-581. doi:10.1007/s10278-019-00232-0

12. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv e-prints. Published online 2015. https://ui.adsabs.harvard.edu/abs/2015arXiv150504597R