12
Original Investigation | Ophthalmology Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening Ji Wang, PhD; Jie Ji, MEng; Mingzhi Zhang, MD; Jian-Wei Lin, MEng; Guihua Zhang, MD, PhD; Weifen Gong, MD; Ling-Ping Cen, MD, PhD; Yamei Lu, MD; Xuelin Huang, MD; Dingguo Huang, MD; Taiping Li, MD; Tsz Kin Ng, PhD; Chi Pui Pang, DPhil Abstract IMPORTANCE A retinopathy of prematurity (ROP) diagnosis currently relies on indirect ophthalmoscopy assessed by experienced ophthalmologists. A deep learning algorithm based on retinal images may facilitate early detection and timely treatment of ROP to improve visual outcomes. OBJECTIVE To develop a retinal image–based, multidimensional, automated, deep learning platform for ROP screening and validate its performance accuracy. DESIGN, SETTING, AND PARTICIPANTS A total of 14 108 eyes of 8652 preterm infants who received ROP screening from 4 centers from November 4, 2010, to November 14, 2019, were included, and a total of 52 249 retinal images were randomly split into training, validation, and test sets. Four main dimensional independent classifiers were developed, including image quality, any stage of ROP, intraocular hemorrhage, and preplus/plus disease. Referral-warranted ROP was automatically generated by integrating the results of 4 classifiers at the image, eye, and patient levels. DeepSHAP, a method based on DeepLIFT and Shapley values (solution concepts in cooperative game theory), was adopted as the heat map technology to explain the predictions. The performance of the platform was further validated as compared with that of the experienced ROP experts. Data were analyzed from February 12, 2020, to June 24, 2020. EXPOSURE A deep learning algorithm. MAIN OUTCOMES AND MEASURES The performance of each classifier included true negative, false positive, false negative, true positive, F1 score, sensitivity, specificity, receiver operating characteristic, area under curve (AUC), and Cohen unweighted κ. RESULTS A total of 14 108 eyes of 8652 preterm infants (mean [SD] gestational age, 32.9 [3.1] weeks; 4818 boys [60.4%] of 7973 with known sex) received ROP screening. The performance of all classifiers achieved an F1 score of 0.718 to 0.981, a sensitivity of 0.918 to 0.982, a specificity of 0.949 to 0.992, and an AUC of 0.983 to 0.998, whereas that of the referral system achieved an F1 score of 0.898 to 0.956, a sensitivity of 0.981 to 0.986, a specificity of 0.939 to 0.974, and an AUC of 0.9901 to 0.9956. Fine-grained and class-discriminative heat maps were generated by DeepSHAP in real time. The platform achieved a Cohen unweighted κ of 0.86 to 0.98 compared with a Cohen κ of 0.93 to 0.98 by the ROP experts. CONCLUSIONS AND RELEVANCE In this diagnostic study, an automated ROP screening platform was able to identify and classify multidimensional pathologic lesions in the retinal images. This platform may be able to assist routine ROP screening in general and children hospitals. JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 Key Points Question Can deep learning algorithms achieve a performance comparable with that of ophthalmologists on multidimensional identification of retinopathy of prematurity (ROP) using wide-field retinal images? Findings In this diagnostic study of 14 108 eyes of 8652 preterm infants, a deep learning–based ROP screening platform could identify retinal images using 5 classifiers, including image quality, stages of ROP, intraocular hemorrhage, preplus/plus disease, and posterior retina. The platform achieved an area under the curve of 0.983 to 0.998, and the referral system achieved an area under the curve of 0.9901 to 0.9956; the platform achieved a Cohen κ of 0.86 to 0.98 compared with 0.93 to 0.98 by the ROP experts. Meaning Results suggest that a deep learning platform could identify and classify multidimensional ROP pathological lesions in retinal images with high accuracy and could be suitable for routine ROP screening in general and children’s hospitals. + Invited Commentary + Supplemental content Author affiliations and article information are listed at the end of this article. Open Access. This is an open access article distributed under the terms of the CC-BY License. JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 1/12 Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Original Investigation | Ophthalmology

Automated Explainable Multidimensional Deep Learning Platformof Retinal Images for Retinopathy of Prematurity ScreeningJi Wang, PhD; Jie Ji, MEng; Mingzhi Zhang, MD; Jian-Wei Lin, MEng; Guihua Zhang, MD, PhD; Weifen Gong, MD; Ling-Ping Cen, MD, PhD; Yamei Lu, MD;Xuelin Huang, MD; Dingguo Huang, MD; Taiping Li, MD; Tsz Kin Ng, PhD; Chi Pui Pang, DPhil

Abstract

IMPORTANCE A retinopathy of prematurity (ROP) diagnosis currently relies on indirectophthalmoscopy assessed by experienced ophthalmologists. A deep learning algorithm based onretinal images may facilitate early detection and timely treatment of ROP to improve visualoutcomes.

OBJECTIVE To develop a retinal image–based, multidimensional, automated, deep learningplatform for ROP screening and validate its performance accuracy.

DESIGN, SETTING, AND PARTICIPANTS A total of 14 108 eyes of 8652 preterm infants whoreceived ROP screening from 4 centers from November 4, 2010, to November 14, 2019, wereincluded, and a total of 52 249 retinal images were randomly split into training, validation, and testsets. Four main dimensional independent classifiers were developed, including image quality, anystage of ROP, intraocular hemorrhage, and preplus/plus disease. Referral-warranted ROP wasautomatically generated by integrating the results of 4 classifiers at the image, eye, and patientlevels. DeepSHAP, a method based on DeepLIFT and Shapley values (solution concepts incooperative game theory), was adopted as the heat map technology to explain the predictions. Theperformance of the platform was further validated as compared with that of the experienced ROPexperts. Data were analyzed from February 12, 2020, to June 24, 2020.

EXPOSURE A deep learning algorithm.

MAIN OUTCOMES AND MEASURES The performance of each classifier included true negative,false positive, false negative, true positive, F1 score, sensitivity, specificity, receiver operatingcharacteristic, area under curve (AUC), and Cohen unweighted κ.

RESULTS A total of 14 108 eyes of 8652 preterm infants (mean [SD] gestational age, 32.9 [3.1]weeks; 4818 boys [60.4%] of 7973 with known sex) received ROP screening. The performance of allclassifiers achieved an F1 score of 0.718 to 0.981, a sensitivity of 0.918 to 0.982, a specificity of 0.949to 0.992, and an AUC of 0.983 to 0.998, whereas that of the referral system achieved an F1 score of0.898 to 0.956, a sensitivity of 0.981 to 0.986, a specificity of 0.939 to 0.974, and an AUC of 0.9901to 0.9956. Fine-grained and class-discriminative heat maps were generated by DeepSHAP in realtime. The platform achieved a Cohen unweighted κ of 0.86 to 0.98 compared with a Cohen κ of 0.93to 0.98 by the ROP experts.

CONCLUSIONS AND RELEVANCE In this diagnostic study, an automated ROP screening platformwas able to identify and classify multidimensional pathologic lesions in the retinal images. Thisplatform may be able to assist routine ROP screening in general and children hospitals.

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758

Key PointsQuestion Can deep learning algorithms

achieve a performance comparable with

that of ophthalmologists on

multidimensional identification of

retinopathy of prematurity (ROP) using

wide-field retinal images?

Findings In this diagnostic study of

14 108 eyes of 8652 preterm infants, a

deep learning–based ROP screening

platform could identify retinal images

using 5 classifiers, including image

quality, stages of ROP, intraocular

hemorrhage, preplus/plus disease, and

posterior retina. The platform achieved

an area under the curve of 0.983 to

0.998, and the referral system achieved

an area under the curve of 0.9901 to

0.9956; the platform achieved a Cohen

κ of 0.86 to 0.98 compared with 0.93

to 0.98 by the ROP experts.

Meaning Results suggest that a deep

learning platform could identify and

classify multidimensional ROP

pathological lesions in retinal images

with high accuracy and could be suitable

for routine ROP screening in general and

children’s hospitals.

+ Invited Commentary

+ Supplemental content

Author affiliations and article information arelisted at the end of this article.

Open Access. This is an open access article distributed under the terms of the CC-BY License.

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 1/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 2: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Introduction

Retinopathy of prematurity (ROP) is a leading cause of visual impairment and irreversible blindnessof children worldwide, mainly affecting preterm infants with extremely low birth weight and thosewho are small for gestational age. Approximately 1.2% (184 700 of 1 4900 000) of preterm infantsworldwide have been estimated to have ROP, among which 30 000 preterm infants have permanentvisual impairment.1 Poor visual outcomes from ROP can be largely avoided if ROP is detected earlyand treated appropriately.2

In clinical scenarios, 3 ROP-related features (the stages of ROP and preplus/plus disease[considered specific features] and intraocular hemorrhage [considered a risk-indicative feature])3-5

have been adopted in ROP detection among preterm infants. According to the InternationalClassification of Retinopathy of Prematurity, stages 1 to 5 of ROP are defined as abnormal responseof immature vasculature in the retina, with increasing severity from stage 1 to 5. Preplus/plus diseaseis a continuum of abnormal changes with dilatation and tortuosity of posterior pole retinal vessels,indicating the need for intensive observation or treatment.6,7 In addition, intraocular hemorrhage isreported as a frequent predictor of the presence of ROP and poor outcomes in preterm infants.8,9

The standard method for ROP diagnosis relies on indirect ophthalmoscopy, which requiresassessments performed by experienced ophthalmologists. In remote areas and places where ROPexpertise is not readily available, a delayed or missed diagnosis of ROP can lead to vision loss.10 Thedevelopment of an automated ROP screening platform that can meet the diagnostic criteria shouldfacilitate timely treatment for patients.

Deep learning (DL) algorithms, especially convolutional neural networks (CNNs), have beenwidely applied in medical image analysis for different diseases, including glaucoma, intracranialhemorrhage, and lung cancers.11,12 Image-based automated ROP screening systems using deep CNNshave also been developed. Specifically, Hu et al13 focused on the stages of ROP detection at the imagelevel and used the Guided Backpropagation14 algorithm to visualize the lesion based on a data set of5511 retinal images; Brown et al15 developed a classifier for the differentiation of normal from thepreplus and plus diseases using images of the posterior retina; and Wang et al16 developed a modelof 2 deep CNN networks for classifying ROP into gradations of normal, minor, and severe. Theirmodel adopted the multi-instance learning17 method and only generated eye-level results. However,these studies focused only on a single-dimensional classifier. Herein, we aimed to develop anautomated multidimensional platform for ROP detection and screening using retinal images.

In this study, to address the previously mentioned problems, we developed an automatedclassification system covering 4 independent main classifiers (image quality, any stage of ROP,intraocular hemorrhage, and preplus/plus disease) and 1 auxiliary parameter (the posterior retina).We also developed an algorithm for the referral recommendation by integrating different outcomesfrom multiple dimensional analyses. The performance of our automated platform was furthervalidated and compared with that of the ROP experts. This cloud-based platform was opened forexternal validation.

Methods

This diagnostic study, conducted from September 1, 2018, to June 24, 2020, was performed incompliance with the Declaration of Helsinki18 and approved by the Human Medical Ethics Committeeof Joint Shantou International Eye Center of Shantou University and the Chinese University of HongKong. Written informed consents were waived because the retinal images used for platformdevelopment were deidentified for personal information. This study followed the Standards forReporting of Diagnostic Accuracy (STARD) reporting guideline.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 2/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 3: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Data SetsRetinal images of infants taken by corneal contact retinal cameras RetCam II or III (Clarity MedicalSystems) for ROP screening were collected from 4 centers in southern China: Joint ShantouInternational Eye Center of Shantou University and the Chinese University of Hong Kong (JSIEC),Guangdong Women and Children Hospital in Yuexiu branch (Yuexiu) and Panyu branch (Panyu), andthe Sixth Affiliated Hospital of Guangzhou Medical University and Qingyuan People’s Hospital(Qingyuan) (eFigure 1 in the Supplement). All retinal images including those of the normal fundus orthose displaying any ROP were included. Exclusion criteria comprised (1) nonfundus photos orfundus photos taken by imaging devices other than RetCam; (2) infants with other ocular diseases,eg, congenital cataract, retinoblastoma, or persistent hyperplastic primary vitreous; and (3) anyimages with disagreeing labels.

Image Labeling and GradingBinary classification was used to categorize each of the 5 dimensions related to ROP retinal imagediagnosis; this system met the requirements of the screening application, including (1) image quality(defined as gradable or ungradable; ungradable images were defined as those of poor quality withsignificant blur, darkness, defocus, poor exposure, or numerous artifacts that could not be identified;the remaining were classified as gradable), (2) any stage of ROP (defined as any stage or nonstage;any stage was assigned to images with any stage of ROP identified, whereas nonstage was assignedto those without any stage of ROP), (3) intraocular hemorrhage (defined as hemorrhage ornonhemorrhage; hemorrhage was assigned to the images with any identifiable hemorrhage), (4)preplus/plus disease (defined as preplus/plus or non–preplus/plus; preplus/plus described aspectrum of posterior retinal vessel abnormalities including venous dilation and arteriolar tortuosity,whereas non–preplus/plus described normal vessels in the posterior retina), and (5) posterior retina(defined as posterior or nonposterior). In order to accurately identify preplus/plus disease, the regionof the posterior retina had to be defined. In this study, the posterior retina was defined as a circulararea centered at the optic disc with a radius 3 times the diameter of the optic disc. Any portion of theimages within this predefined area were classified as within the posterior pole.

The ground truth (criterion-standard) labels were determined by a group of ophthalmologists.The graders were trained according to our previously published protocol.19 Briefly, 2 trained juniorophthalmologists labeled independently, and the images with disagreeing labels were submitted to asenior ophthalmologist. If the decision was still uncertain, the label would be determined by anexperienced ROP expert (G.Z.). Finally, the images with agreeing labels were kept for the automatedsystem training, validation, and test data sets. Images with disagreeing labels were excluded. Inaddition, the optic disc and blood vessels for preplus/plus disease identification were labeled by anexperienced grader.

Classifiers and Platform DevelopmentThe pipeline of our system is shown in eFigure 2 in the Supplement. An image was first evaluated forimage quality. If it was predicted as ungradable, a recommendation to rephotograph was given. Ifthe image was predicted as gradable, it entered the main pipeline. The main structure of the systemwas a multilabel classification20 and a postprocessing method that aggregated single-dimensionalresults to image-level results and image-level results to eye-level and patient-level results using maxpooling. On a single image, ROP diagnosis could be viewed as a multilabel classification in that 1 imagecan present multiple features (stages of ROP, hemorrhage, and preplus/plus disease) simultaneously.This classification system was implemented using multiple independent classifiers based on thebinary relevance. Multilabel classification was implemented using multiple independent classifiersinstead of 1 classifier, because preplus/plus disease classification was implemented using anindependent and complex pipeline. Taking the model ensemble into consideration, everyclassification task was implemented using a set of different neural networks. Dynamic dataresampling and cost-sensitive learning were used simultaneously to resolve the class imbalance.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 3/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 4: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Model ensemble and test time image augmentation were used to improve accuracy and makepredictions robust to small perturbations.21 Label smoothing22 was used to calibrate the predictedprobabilities. Because preplus/plus diseases have fewer positive samples than other classifiers andthe presence of preplus/plus disease is attributed only to the blood vessels in the posterior pole,preplus/plus classification was considered to be a fine-grained classification and was implementedusing an independent pipeline. An input image first needed to be judged on whether it was, in fact, aposterior image. The nonposterior image was regarded as non–preplus/plus. If the image belongedto a posterior image, the blood vessels were extracted using a patch-based DL technique calledRes-UNet.23,24 Moreover, the optic disc was detected using a Mask-RCNN,25 and the posteriorregions were calculated based on the optic disc. Afterward, the blood vessels in the posterior polearea were cropped and input into the final classifier. A set of neural networks was used to classifywhether the image was preplus/plus disease or not. More details are available in eMethods 1 in theSupplement.

Finally, the image-level referral decision was automatically generated by integrating the resultsof multiple classifiers, and the eye-level and patient-level referral decisions were generated byintegrating multiple image-level results. Details of the methods, especially that of algorithmdevelopment, are shown in eMethods 2 and 3 in the Supplement.

Statistical AnalysisThe data set was randomly split into training, validation, and test data sets with a ratio of 75:10:15 bya patient-based split policy in order to ensure all images of a patient were allocated into the samesub–data set of any classifier. The test set was also used to evaluate the performance of theautomated referral decision. The performance of each classifier was evaluated by true negative (TN),false positive (FP), false negative (FN), true positive (TP), F1 score, sensitivity, and specificity. Thereceiver operating characteristic (ROC) analysis and area under curve (AUC) with 95% CIs were alsocalculated. Two-sided 95% CIs with the Delong method for AUC were calculated using the open-source package pROC, version 1.14.0 (Xavier Robin). Data were analyzed from July 15, 2019, to June24, 2020.

The comparison was carried out between our platform, JSIEC Platform for Retinopathy ofPrematurity (J-PROP), and 3 experienced ROP experts (W.G., D.G., and T.L.) from JSIEC on 200retinal images extracted randomly from the test set. Three ROP-related features were identified, anda diagnosis of referral-warranted (RW) ROP was generated automatically by J-PROP via integrationof the feature identification results or generated manually from the results of the ROP experts. Acriterion-standard diagnosis originated from the ground truth labels. The Cohen unweighted κ wascalculated and displayed in the interobserver heat map with a conventional scale where 0.2 or lesswas considered to be slight agreement, 0.21 to 0.40 was labeled as fair, 0.41 to 0.60 was labeled asmoderate, 0.61 to 0.80 was labeled as strong, and 0.80 to 1.0 was considered to be near-completeagreement. The indexes of TN, FP, FN, TP, F1 score, sensitivity, and specificity were also calculated.

Results

Of 55 490 retinal images, 3241 (5.8%) were discarded because they were nonfundus photos, fundusphotos imaged by devices other than RetCam, not ROP but other ocular diseases such as congenitalcataract and retinoblastoma, and images without agreed-upon labeling. A total of 52 249 retinalimages from 14 108 eyes of 8652 preterm infants (mean [SD] gestational age, 32.9 [3.1] weeks; 4818of 7973 boys with known sex [60.4%]) were annotated and included as the ground truth data set(Table 1). With the available data, the mean (SD) birth weight was 1925 (774) g. The data set wasrandomly split into training (n = 39 029), validation (n = 5140), and test (n = 8080) data sets with aratio of 75:10:15 by a patient-based split policy. The demographic characteristics of the patients areshown in Table 1, and the data set distribution is listed in eTable 1 in the Supplement.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 4/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 5: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

System PerformanceThe performance of 5 independent classifiers was validated and tested. In the test set, all classifiersachieved an F1 score of 0.718 to 0.981, a sensitivity of 0.918 to 0.982, a specificity of 0.949 to 0.992,and an AUC of 0.9827 to 0.9981 (Table 2, Figure 1, and eFigure 3 in the Supplement). For theROP-related features, any stage of ROP achieved an F1 score of 0.946, a sensitivity of 0.982, aspecificity of 0.985, and an AUC of 0.9981 (95% CI, 0.9974-0.9989), whereas hemorrhage achieved0.961, 0.972, 0.992, and 0.9977 (95% CI, 0.9963-0.9991), respectively. The performance ofpreplus/plus disease achieved an F1 score of 0.718, a sensitivity of 0.918, a specificity of 0.970, andan AUC of 0.9827 (95% CI, 0.9706-0.9948).

In the test data set, the performance of RW ROP detection at the image level achieved an F1score of 0.956, a sensitivity of 0.981, a specificity of 0.974, and an AUC of 0.9956 (95% CI, 0.9942-0.9970). Eye-level and patient-level F1 scores decreased to 0.915 and 0.898, respectively, whereasthe outcomes of other indexes were similar to that of the image level (Table 2 and Figure 1). Theperformance of RW ROP detection ignoring the hemorrhage dimension was also analyzed (Table 2).In addition, the performance on 2 subsets, the RetCam II and RetCam III sets, were analyzedseparately (eTable 2 and 3 in the Supplement). For comparing the performance between the singlemodel and the ensemble model, the AUC of the models was calculated, and the performance of theensemble model was more accurate than that of the single model (for identifying stage, theensemble model achieved an AUC of 0.9981 [95% CI, 0.9974-0.9989] compared with that of0.9968-0.9971 by a single model; for identifying hemorrhage, the ensemble model achieved an AUCof 0.9977 [95% CI, 0.9963-0.9991] compared with that of 0.9940-0.9969 by a single model; foridentifying preplus/plus disease, the ensemble model achieved an AUC of 0.9827 [95% CI, 0.9706-0.9948] compared with that of 0.9712-0.9809 by a single model) (eTable 4 in the Supplement).

Visualization and ExplainabilityThe features extracted by neural networks just before the classification header were visualized usingt-Distributed Stochastic Neighbor Embedding, which is a technique for dimensionality reduction(eFigure 4 in the Supplement). DeepSHAP31 as a heat map technique was adopted to provideexplainability, and extensive experiments were carried out to compare the heat maps generated bydifferent techniques including DeepSHAP, Class Activation Mapping (CAM),26,27 Saliency Maps,28

Guided Backpropagation, Integrated Gradients,29 Layer-wise Relevance Propagation (LRP)-Epsilon,and LRP-Z30 (Figure 2 and eFigure 5 in the Supplement).

Table 1. Demographic Details of Patientsa From 4 Centers

Device

No./total No. (%)

Total JSIEC Yuexiu Panyu Qingyuan

RetCamb II & III RetCam II & III RetCam II RetCam III RetCam III RetCam III RetCam IIIPatient, No. 8652 3714 2533 1181 2155 1251 1532

Images, No. 52 249 15 382 9225 6157 11 269 14 708 10 890

Eyes, No. 14 108 5907 3964 1943 3521 2225 2455

Patient sex known 7973/8652 (92.2) 3714/3714 (100) 2533/2533 (100) 1181/1181(100) 2058/2155 (95.5) 766/1251 (61.2) 1435/1532 (93.7)

Boys 4818/7973 (60.4) 2287/3714 (61.6) 1607/2533 (63.4) 680/1181 (57.6) 1295/2058 (62.9) 446/766 (58.2) 790/1435 (55.1)

Patients with GAavailable

6364/8652 (73.6) 3545/3714 (95.4) 2399/2533 (94.7) 1146/1181 (97.0) 2056/2155 (95.4) 763/1251 (61.0) 0

GA, mean (SD), wk 32.9 (3.05) 32.7 (2.60) 32.72 (2.54) 32.62 (2.71) 32.8 (3.40) 34.0 (3.67) NA

Patients with BWavailable

6363/8652 (73.5) 3546/3714 (95.5) 2400/2533 (94.7) 1146/1181 (97.0) 2056/2155 (95.4) 761/1251 (60.8) 0

BW, mean (SD), g 1925 (774) 1885 (677) 1889 (742) 1879 (516) 1922 (931) 2118 (694) NA

Abbreviations: BW, birth weight; GA, gestational age; JSIEC, Joint Shantou InternationalEye Center; NA, not available.a A total of 566 of 6364 infants (8.89%) were term infants with a GA of 37 weeks or

older. The data of postmenstrual age was only available from 2457 infants of JSIEC, andthe mean (SD) postmenstrual age was 40.6 (3.07) weeks. Images were taken from 1

visit of each infant randomly. The average (range) number of images per eye was 3(1-11 images).

b RetCam II and III are corneal contact retinal cameras used to photograph the retina.They are manufactured by Clarity Medical Systems.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 5/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 6: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Human-Platform ComparisonOur platform, J-PROP, was further compared with the ROP experts (Table 3). For the detection ofintraocular hemorrhage, preplus/plus disease, and image-level RW ROP, J-PROP achieved asensitivity of 1.000, whereas the experts achieved an average sensitivity of 0.958 to 1.000. Theconfusion matrix of the agreements among the ROP experts and criterion-standard diagnosis isshown in eFigure 6 in the Supplement. J-PROP achieved a Cohen κ of 0.93 for any stage of ROP, 0.97for intraocular hemorrhage, 0.86 for preplus/plus disease, and 0.98 for RW ROP, whereas ROPexperts achieved a mean Cohen κ of 0.93 (range, 0.87-1.00), 0.93 (range, 0.91-0.95), 0.98 (range,0.95-1.00), and 0.95 (range, 0.93-0.99) for the 4 classifiers, respectively.

Misclassification AnalysisIn the test set, the images misclassified by any independent classifier were analyzed case by case forpossible reasons. Poor contrast and artifacts were the most common reasons for misclassification inthe stage and hemorrhage classifiers, whereas atypical vessel morphology was the main reason inpreplus/plus disease with FN predictions. There were some errors due to incorrect annotationslabeled by the junior ophthalmologists instead of by the J-PROP (eTable 5, eFigure 7, and eFigure 8 inthe Supplement).

After full validation and testing, the neural network models were deployed to the productionenvironment, and our cloud-based platform, J-PROP, was built and is openly accessible (eFigures 9and 10 in the Supplement).

Table 2. Performance of 5 Classifiers of the Platform

Classifiers Data set

No.

F1 Sensitivity Specificity AUC (95% CI)TN FP FN TPImage quality Training 2771 23 849 35386 0.988 0.977 0.992 0.9976 (0.9973-0.9979)

Validation 306 21 156 4657 0.981 0.968 0.936 0.9899 (0.9874-0.9924)

Test 558 30 253 7239 0.981 0.966 0.949 0.9922 (0.9906-0.9938)

Stage Training 31076 334 2 4823 0.966 1.000 0.989 0.9997 (0.9996-0.9998)

Validation 4253 67 14 479 0.922 0.972 0.984 0.9977 (0.9959-0.9994)

Test 6349 98 19 1026 0.946 0.982 0.985 0.9981 (0.9974-0.9989)

Hemorrhage Training 30800 167 11 5257 0.983 0.998 0.995 0.9999 (0.9999-0.9999)

Validation 4123 40 8 642 0.964 0.988 0.990 0.9982 (0.9957-1.0000)

Test 6389 54 29 1020 0.961 0.972 0.992 0.9977 (0.9963-0.9991)

Posterior Training 27203 861 488 7683 0.919 0.940 0.969 0.9936 (0.9931-0.9941)

Validation 3558 153 70 1032 0.902 0.936 0.959 0.9908 (0.9890-0.9927)

Test 5629 220 127 1516 0.897 0.923 0.962 0.9901 (0.9884-0.9918)

Preplus/plus Training 12701 192 0 631 0.868 1.000 0.985 0.9993 (0.9991-0.9996)

Validation 1707 27 12 120 0.860 0.909 0.984 0.9882 (0.9753-1.0000)

Test 2518 78 10 112 0.718 0.918 0.970 0.9827 (0.9706-0.9948)

RW

Images Test 5295 144 40 2013 0.956 0.981 0.974 0.9956 (0.9942-0.9970)

Eyes 1694 83 7 482 0.915 0.986 0.953 0.9938 (0.9898-0.9977)

Patients 1047 68 6 327 0.898 0.982 0.939 0.9901 (0.9835-0.9966)

RW without hemorrhagea

Images Test 6193 156 28 1115 0.924 0.976 0.975 0.9958 (0.9943-0.9973)

Eyes 1902 75 4 285 0.878 0.986 0.962 0.9959 (0.9927-0.9991)

Patients 1183 61 3 201 0.863 0.985 0.951 0.9937 (0.9884-0.9991)

Abbreviations: AUC, area under curve; F1, F1 score; FN, false negative; FP, false positive;ROP, retinopathy of prematurity; RW, referral warranted; TN, true negative; TP, truepositive.

a Ignoring the hemorrhage dimension, 3 levels of RW ROP were regenerated based onthe results of stage and preplus/plus classifiers in the test set.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 6/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 7: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Discussion

Results from this study suggest that (1) we developed a cloud-based DL platform integratingmultidimensional classification and multilevel referral strategies that has the potential to meetclinical needs; (2) preplus/plus disease classification was implemented using an independentpipeline; and (3) DeepSHAP, which is a combination of DeepLIFT32 and Shapley values, could beadopted to generate fine-grained and class-discriminative heat maps in real time. Collectively, ourautomated ROP screening system, J-PROP, covering 4 main dimensions (image quality, any stages ofROP, intraocular hemorrhage, and preplus/plus diseases), was not only associated with high accuracyin both single-dimensional classification and image-, eye- and patient-level referral decisions but alsogenerated fine-grained and class-discriminative heat maps for the explainability. The J-PROPplatform is an open platform and appears to be promising for ROP screening.

From the point of view on a single image, ROP diagnosis can be viewed as a multilabelclassification,20 wherein 1 image can belong to multiple classes simultaneously. This classification wasimplemented using multiple independent classifiers based on binary relevance (neglecting classdependence). With multiple images, ROP diagnosis is a multiple-instance learning problem17 whereinthe images are instances and the eyes or patients are labeled as “bags.” Our study adopted amultimodal learning with decision-level fusion33 method, which differed from a previous study16 thatused the standard multi-instance learning method. In the previous study’s method,16 the training

Figure 1. The Receive Operating Characteristic (ROC) Curves for System Performance

Detection of stageA

1.0

0.8

0.6

0.4

0.2

0

True

pos

itive

rate

0 0.2

0 0.05 0.10 0.15 2.00

0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

Detection of hemorrhageB

1.0

0.8

0.6

0.4

0.2

0

True

pos

itive

rate

0 0.2

0 0.05 0.10 0.15 2.00

0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

False positive rateFalse positive rate

Detection of pre-plus/plusC

1.0

0.8

0.6

0.4

0.2

0

True

pos

itive

rate

0 0.2

0 0.05 0.10 0.15 2.00

0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

Detection of RW-ROPD

1.0

0.8

0.6

0.4

0.2

0

True

pos

itive

rate

0 0.2

0 0.05 0.10 0.15 2.00

0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

False positive rateFalse positive rate

Training (AUC= 0.9999)Validation (AUC= 0.9982)Test (AUC= 0.9977)

Image level (AUC= 0.9956)Eye level (AUC= 0.9938)Patient level (AUC= 0.9901)

Training (AUC= 0.9997)Validation (AUC= 0.9977)Test (AUC= 0.9981)

Training (AUC= 0.9993)Validation (AUC= 0.9882)Test (AUC= 0.9827)

The ROC for detecting any stage of retinopathy ofprematurity (ROP) (A), intraocular hemorrhage (B),and preplus/plus disease (C). The area under curve(AUC) of training, validation, and test sets from each ofthe classifiers are shown (D). The ROC for referral-warranted ROP (RW ROP) was obtained byaggregating the results of 4 classifiers (stage,hemorrhage, posterior, and preplus/plus). Any positivefindings of ROP-related features would result in theRW, and the AUC at the image, eye, and patient levelsare shown.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 7/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 8: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

instances come in bags, with all examples in a bag sharing the same label. A neural network hasmultiple inputs and a single output (Softmax was considered as a single output). Predictions weremade only at the bag level. In contrast, a single-image classification with a postprocessing methodwas adopted in this study. The training instances were considered as singletons instead of bags.Predictions were made on the instance level, and the postprocessing method was used to generatethe results of the bag level by aggregating (max pooling) the instance level results. The label of asingle image can be given based on this image alone. As we labeled every image, J-PROP can fully usethe samples, and neural networks can be trained quickly. On the contrary, for the traditionalmultiple-instance method, likely only 1 of multiple images was used to train the feature extractorduring 1 backpropagation. In addition to the eye-level results, J-PROP has the potential to provideimage-specific results. The explainable heat map of an image would not interfere with that ofother images.

Preplus/plus disease classification is challenging and easily confused with the ROP stage orhemorrhage. For the preplus/plus disease classification, there are fewer positive samples thannegative samples. Most of the preplus/plus disease images simultaneously contain the features ofany stage of ROP or intraocular hemorrhages. Because preplus/plus disease classification isessentially related to the blood vessels in the posterior pole of the retina, preplus/plus diseaseclassification was considered to be a fine-grained classification and was implemented using an

Figure 2. Visualization of Stage and Hemorrhage in Heat Maps

ROP stage and hemorrhageA

ROP stage and reflectionB

Retinal hemorrhages and artifactsC

Original images

Stag

eSt

age

Hem

orrh

age

Hem

orrh

age

Preprocessed images Heat maps of CAM Heat maps of DeepSHAP

The first and second columns indicate the original andpreprocessed retinal images, respectively. The thirdand fourth columns are heat maps generated by ClassActivation Mapping (CAM) and DeepSHAP,respectively. A, The original image presents both thestage of ROP and retinal hemorrhage on the peripheralretina. The upper row contains heat maps showing thestage of ROP, whereas the lower row contains heatmaps showing retinal hemorrhages. B, The imagepresents both the stage of ROP and the reflection.Though it shares a similar morphology with itsreflection, the lesion is successfully recognized. C, Theimage shows retinal hemorrhages and many artifacts;however, the hemorrhage area is highlighted by theheat map. DeepSHAP shows the more fine-grainedheat map than CAM on each feature.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 8/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 9: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

independent pipeline, including blood vessel segmentation, optic disc detection, selection of theblood vessels of the posterior pole region, and preplus/plus disease classification. This design wasbased on domain knowledge, and the core idea aimed to allow the inputs of the preplus/plus diseaseclassifier containing only the region of interest and removing the irrelevant features as much aspossible.

The reasons for the false predictions on ROP-related features in the test set are shown ineTable 5 in the Supplement. Lesions with poor contrast and artifacts were the 2 common factors tointerfere with the recognition of the staging of ROP and hemorrhage detection that led to FN and FPpredictions. For preplus/plus disease, atypical morphology was the common cause ofmisclassification. Notably, various proportions of FP and FN were caused by incorrect annotations,which may have been due to the poor contrast, artifacts, and atypical morphologies as commonlyfound in DL platforms. Although J-PROP was not affected by artifacts in most of the cases, artifactsand atypical morphologies could still result in false predictions. In the future, we will work tocontinuously improve the generalization ability by adding more specific samples, such as images withdifferent kinds of artifacts and with atypical morphologies. We hope to adopt a hard negative miningtechnique so as to pay more attention to the existing difficult samples.

LimitationsThere are several limitations in this study. First, according to the International Classification ofRetinopathy of Prematurity, there are 5 stages of ROP (stage 1-5) and 3 levels of plus disease (normal,

Table 3. Performance Comparison Between Human Experts and J-PROP

ROP-related features Reader

No.

F1 Sensitivity Specificity AUC (95% CI)TN FP FN TPStage of ROP Expert 1 152 0 0 48 1.000 1.000 1.000 NA

Expert 2 152 0 9 39 0.897 0.812 1.000 NA

Expert 3 148 4 3 45 0.928 0.938 0.974 NA

Experts_average NA NA NA NA 0.943 0.917 0.991 NA

J-PROP 148 4 1 47 0.949 0.979 0.974 0.9980 (0.9950-1.0000)

Intraocular hemorrhage Expert1 156 4 1 39 0.940 0.975 0.975 NA

Expert 2 160 0 3 37 0.961 0.925 1.000 NA

Expert 3 155 5 1 39 0.929 0.975 0.969 NA

Experts_average NA NA NA NA 0.943 0.958 0.981 NA

J-PROP 158 2 0 40 0.976 1.000 0.988 1.0000 (1.0000-1.0000)

Preplus/plus disease Expert 1 190 0 0 10 1.000 1.000 1.000 NA

Expert 2 190 0 0 10 1.000 1.000 1.000 NA

Expert 3 189 1 0 10 0.952 1.000 0.995 NA

Experts_average NA NA NA NA 0.984 1.000 0.998 NA

J-PROP 187 3 0 10 0.870 1.000 0.984 1.0000 (1.0000-1.0000)

RW ROP, image level Expert 1 117 0 1 82 0.994 0.988 1.000 NA

Expert 2 117 0 6 77 0.962 0.928 1.000 NA

Expert 3 114 3 4 79 0.958 0.952 0.974 NA

Experts_average NA NA NA NA 0.971 0.956 0.991 NA

J-PROP 115 2 0 83 0.988 1.000 0.983 0.9999 (0.9996-1.0000)

RW ROP, image level,without hemorrhagea

Expert 1 147 0 0 53 1.000 1.000 1.000 NA

Expert 2 147 0 5 48 0.950 0.906 1.000 NA

Expert 3 147 0 3 50 0.971 0.943 1.000 NA

Experts_average NA NA NA NA 0.974 0.950 1.000 NA

J-PROP 144 3 0 53 0.972 1.000 0.980 1.0000 (1.0000-1.0000)

Abbreviations: AUC, area under curve; F1, F1 score; FN, false negative; FP, false positive;J-PROP, Joint Shantou International Eye Center Platform for Retinopathy of Prematurity;NA, not available; ROP, retinopathy of prematurity; RW, referral warranted; TN, truenegative; TP, true positive.

a The results of referral-warranted ROP here were generated by ignoring thehemorrhage dimension.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 9/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 10: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

preplus, and plus). However, in this study, each class was divided into only 2 levels (ie, any stage ornonstage and preplus/plus or non–preplus/plus). Second, posterior pole zones I to III proposed by theInternational Classification of Retinopathy of Prematurity represent an important parameteraffecting the severity of ROP. We did not include the zone factor in our classification. Third, in somecases, DeepSHAP heat maps are fragile, just as many other methods, and do not meet the sensitivityand implementation invariance29 at the same time. Fourth, our work did not include a considerationof cost and staff training to acquire the images. We focused on the development and validation of acloud-based ROP platform. In the future, we hope to design an edge and cloud platform and tocomplete the protocol on running-cost use and imaging technician selection and training. Finally, inaddition to the local explanations, global understanding of neural networks is needed. Future studiesshould focus on the following questions: what patterns learned by the neural networks couldrepresent the stages of ROP, and how are the features extracted by the neural network matched tothese patterns? Even though the global interpretability of neural networks is an open question,future studies should attempt to understand neural networks in detail by using global interpretabilitymethods, such as activation maximization and filter visualization.

Conclusions

This diagnostic study developed a cloud-based DL platform integrating a multidimensionclassification and multilevel referral strategy for ROP screening and referral recommendation. Resultssuggest that the referral decision could be automatically generated at the image, eye, and patientlevel. Our platform, J-PROP, has the potential to be applied in neonatal intensive care units, children’shospitals, and rural primary health care centers for routine ROP screening. It may be useful in remoteareas lacking in ROP expertise.

ARTICLE INFORMATIONAccepted for Publication: February 17, 2021.

Published: May 5, 2021. doi:10.1001/jamanetworkopen.2021.8758

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Wang Jet al. JAMA Network Open.

Corresponding Author: Mingzhi Zhang, MD, Joint Shantou International Eye Center of Shantou University, TheChinese University of Hong Kong, North Dongxia Road, Shantou, Guangdong, China 515041 ([email protected]).

Author Affiliations: Joint Shantou International Eye Center of Shantou University, the Chinese University of HongKong, Shantou, Guangdong, China (Wang, M. Zhang, Lin, G. Zhang, Gong, Cen, D. Huang, Li, Ng, Pang); Networkand Information Center, Shantou University, Shantou, Guangdong, China (Ji); XuanShi Med Tech (Shanghai)Company Limited, Shanghai, China (Ji); Department of Ophthalmology, The Sixth Affiliated Hospital of GuangzhouMedical University, Qingyuan People’s Hospital, Qingyuan, Guangdong, China (Lu); Department of Ophthalmology,Guangdong Women and Children Hospital, Guangzhou, Guangdong, China (X. Huang); Shantou University MedicalCollege, Shantou, Guangdong, China (Ng); Department of Ophthalmology and Visual Sciences, The ChineseUniversity of Hong Kong, Hong Kong, China (Ng, Pang).

Author Contributions: Dr M. Zhang had full access to all of the data in the study and takes responsibility for theintegrity of the data and the accuracy of the data analysis. Dr Wang, Dr M. Zhang, and Mr Ji contributed equally tothis work.

Concept and design: Wang, Ji, M. Zhang, Pang.

Acquisition, analysis, or interpretation of data: Wang, Ji, M. Zhang, Lin, G. Zhang, Gong, Cen, Lu, X. Huang, D.Huang, Li, Ng.

Drafting of the manuscript: Wang, Ji.

Critical revision of the manuscript for important intellectual content: Cen, Wang, Ji, M. Zhang, Lin, G. Zhang, Gong,Lu, X. Huang, D. Huang, Li, Ng, Pang.

Statistical analysis: Wang, Ji, Lin.

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 10/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 11: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

Obtained funding: M. Zhang, Cen.

Administrative, technical, or material support: Ji, M. Zhang, Lin, Gong, Cen, Lu, X. Huang, D. Huang, Li.

Supervision: M. Zhang, Ng, Pang.

Conflict of Interest Disclosures: None reported.

Funding/Support: This work was supported in part by Science and Technology Innovation Strategy Special FundProject of Guangdong Province (project code [2018]157-46; Dr M. Zhang), grant 002-18120304 from the Grant forKey Disciplinary Project of Clinical Medicine under the Guangdong High-level University Development Program,China (Dr M. Zhang), and grant 2020LKSFG16B from the Li Ka Shing Foundation cross-disciplinary research grants(Drs M. Zhang and Cen).

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection,management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; anddecision to submit the manuscript for publication.

REFERENCES1. Blencowe H, Lawn JE, Vazquez T, Fielder A, Gilbert C. Preterm-associated visual impairment and estimates ofretinopathy of prematurity at regional and global levels for 2010. Pediatr Res. 2013;74(suppl 1):35-49. doi:10.1038/pr.2013.205

2. Solebo AL, Teoh L, Rahi J. Epidemiology of blindness in children. Arch Dis Child. 2017;102(9):853-857. doi:10.1136/archdischild-2016-310532

3. International Committee for the Classification of Retinopathy of Prematurity. The international classification ofretinopathy of prematurity revisited. Arch Ophthalmol. 2005;123(7):991-999. doi:10.1001/archopht.123.7.991

4. Hutcheson KA, Nguyen ATQ, Preslan MW, Ellish NJ, Steidl SM. Vitreous hemorrhage in patients with high-riskretinopathy of prematurity. Am J Ophthalmol. 2003;136(2):258-263. doi:10.1016/S0002-9394(03)00190-9

5. Daniel E, Ying GS, Siatkowski RM, Pan W, Smith E, Quinn GE; e-ROP Cooperative Group. Intraocularhemorrhages and retinopathy of prematurity in the telemedicine approaches to evaluating acute-phaseRetinopathy of Prematurity (e-ROP) study. Ophthalmology. 2017;124(3):374-381. doi:10.1016/j.ophtha.2016.10.040

6. Wallace DK, Freedman SF, Hartnett ME, Quinn GE. Predictive value of pre-plus disease in retinopathy ofprematurity. Arch Ophthalmol. 2011;129(5):591-596. doi:10.1001/archophthalmol.2011.63

7. Adams GG, Bunce C, Xing W, et al. Treatment trends for retinopathy of prematurity in the UK: active surveillancestudy of infants at risk. BMJ Open. 2017;7(3):e013366. doi:10.1136/bmjopen-2016-013366

8. Watts P, Maguire S, Kwok T, et al. Newborn retinal hemorrhages: a systematic review. J AAPOS. 2013;17(1):70-78. doi:10.1016/j.jaapos.2012.07.012

9. Quinn GE, Vinekar A. The role of retinal photography and telemedicine in ROP screening. Semin Perinatol.2019;43(6):367-374. doi:10.1053/j.semperi.2019.05.010

10. Vander JF, Handa J, McNamara JA, et al. Early treatment of posterior retinopathy of prematurity: a controlledtrial. Ophthalmology. 1997;104(11):1731-1735. doi:10.1016/S0161-6420(97)30034-7

11. Medeiros FA, Jammal AA, Thompson AC. From machine to machine: an OCT-trained deep learning algorithmfor objective quantification of glaucomatous damage in fundus photographs. Ophthalmology. 2019;126(4):513-521. doi:10.1016/j.ophtha.2018.12.033

12. Coudray N, Ocampo PS, Sakellaropoulos T, et al. Classification and mutation prediction from non-small cell lungcancer histopathology images using deep learning. Nat Med. 2018;24(10):1559-1567. doi:10.1038/s41591-018-0177-5

13. Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analysis for retinopathy of prematurity by deep neural networks.IEEE Trans Med Imaging. 2019;38(1):269-279. doi:10.1109/TMI.2018.2863562

14. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M. Striving for simplicity: the all convolutional net. AccessedMay 15, 2020. https://arxiv.org/abs/1412.6806

15. Brown JM, Campbell JP, Beers A, et al; Imaging and Informatics in Retinopathy of Prematurity (i-ROP) ResearchConsortium. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neuralnetworks. JAMA Ophthalmol. 2018;136(7):803-810. doi:10.1001/jamaophthalmol.2018.1934

16. Wang J, Ju R, Chen Y, et al. Automated retinopathy of prematurity screening using deep neural networks.EBioMedicine. 2018;35:361-368. doi:10.1016/j.ebiom.2018.08.033

17. Carbonneau M-A, Cheplygina V, Granger E, Gagn G. Multiple instance learning: a survey of problemcharacteristics and applications. Pattern Recognition. 2018;77:329-353. doi:10.1016/j.patcog.2017.10.009

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 11/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021

Page 12: Automated Explainable Multidimensional Deep Learning ......Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

18. World Medical Association. World Medical Association Declaration of Helsinki: ethical principles for medicalresearch involving human subjects. JAMA. 2013;310(20):2191-2194. doi:10.1001/jama.2013.281053

19. Wang J, Zhang G, Lin J, Ji J, Qiu K, Zhang M. Application of standardized manual labeling on identification ofretinopathy of prematurity images in deep learning. Chin J Exp Ophthalmol. 2019;37(8):653-657.

20. Zhang M-L, Zhou Z-H. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and DataEngineering. 2014;26(8):1819-1837. doi:10.1109/TKDE.2013.39

21. Engstrom L, Tran B, Tsipras D, Schmidt L, Madry A. Exploring the landscape of spatial robustness. AccessedMay 10, 2020. https://arxiv.org/abs/1712.02779

22. Müller R, Kornblith S, Hinton G. When does label smoothing help? Accessed May 17, 2020. https://arxiv.org/abs/1906.02629

23. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation.Accessed May 10, 2020. https://arxiv.org/abs/1505.04597

24. Zhang Z, Liu Q, Wang Y. Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters.2018;15:749-753. doi:10.1109/LGRS.2018.2802944

25. He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn. Accessed April 13, 2020. https://arxiv.org/abs/1703.06870

26. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization.Accessed February 27, 2020. https://arxiv.org/abs/1512.04150

27. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deepnetworks via gradient-based localization. Accessed April 5, 2020. https://arxiv.org/abs/1610.02391

28. Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classificationmodels and saliency maps. Accessed May 16, 2020. https://arxiv.org/abs/1312.6034

29. Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. Accessed May 30, 2020. https://arxiv.org/abs/1703.01365

30. Bach S, Binder A, Montavon G, Klauschen F, Müller K-R, Samek W. On pixel-wise explanations for non-linearclassifier decisions by layer-wise relevance propagation. PLoS One. 2015;10(7):e0130140. doi:10.1371/journal.pone.0130140

31. Lundberg S, Lee S-I. A unified approach to interpreting model predictions. Accessed January 3, 2020. https://arxiv.org/abs/1705.07874

32. Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences.Accessed November 11, 2019. https://arxiv.org/abs/1704.02685

33. Ramachandram D, Taylor GW. Deep multimodal learning: a survey on recent advances and trends. IEEE SignalProcessing Magazine. 2017;34(6):96-108. doi:10.1109/MSP.2017.2738401

SUPPLEMENT.eFigure 1. Workflow of Retinal Images Annotation and Split ProcesseFigure 2. The Pipeline of Image-Based Automated ROP Screening SystemeFigure 3. The Receiver Operating Characteristic (ROC) Curves for System PerformanceeFigure 4. The T-Distributed Stochastic Neighbor Embedding (t-SNE) of 5 ClassifierseFigure 5. Visualization of 7 Mainstream Heat MapseFigure 6. Interobserver Comparison Heat MapseFigure 7. Representative Images and Reasons With False Negative Predictions Generated by Platform on 3ROP-Related FeatureseFigure 8. Representative Images and Reasons With False Positive Predictions Generated by Platform on 3ROP-Related FeatureseFigure 9. The Topology Structure of the PlatformeFigure 10. Screenshots of Application Procedures on Cloud-Based ROP Screening PlatformeTable 1. Dataset Distribution of 5 DimensionseTable 2. Performance of 5 Classifiers Based on Image Set of RetCam IIeTable 3. Performance of 5 Classifiers Based on Image Set of RetCam IIIeTable 4. The Performance Comparison of Each Classifier Between Single Model and Model Ensemble in the TestSeteTable 5. The Reasons of Misclassification on ROP-Related Features in the Test SeteMethods 1. Dataset DevelopmenteMethods 2. Deep Learning Algorithm DevelopmenteMethods 3. Deployment and Code/Data AvailabilityeReferences

JAMA Network Open | Ophthalmology Multidimensional Deep Learning Platform for Retinopathy of Prematurity Screening

JAMA Network Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758 (Reprinted) May 5, 2021 12/12

Downloaded From: https://jamanetwork.com/ by a Non-Human Traffic (NHT) User on 09/06/2021