November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division

November 18, 2009 1

CAD Panel Meeting

Statistical Issues in CADe Evaluations

Thomas E. Gwise, Ph.D.Mathematical Statistician / Acting Team Leader

Division of BiostatisticsOffice of Surveillance and Biometrics

November 18, 2009

November 18, 2009 2

Outline

• Statistical concepts• Reader studies for CADe evaluation

• Prospective and retrospective• Retrospective study design examples• Complications in retrospective studies• Choice of endpoints• Choice of controls

• Standalone studies• Compared to reader studies• Re-use of data

November 18, 2009 3

Statistical Evaluation of Diagnostic Tests

• Two dimensions are considered when evaluating diagnostic test performance.

• How well can the test detect diseased cases?• Sensitivity: Fraction of diseased patients who are test

positive

• How well can the test correctly identify the non-diseased cases?• Specificity : Fraction of non-diseased patients who are

test negative

• Sensitivity and Specificity are not comparable if estimated in separate studies

November 18, 2009 4

ROC Curves

ROC curves are plots of Se/Sp considering all possible cutoffs.

November 18, 2009 5

Statistical Evaluation of Diagnostic Tests

Does the test add value?

Example: Is a diagnostic test for bone mineral density better than just using a person’s age in diagnosing osteoporosis?

Example: Does use of a CADe device improve diagnostic performance of readers?

Examples of improvement:Sensitivity, Specificity both betterROC plot (or area) betterImproved reading time, same performance

November 18, 2009 6

Intended Use• The vast majority of submissions for CADe

devices to date have been for those labeled as second reader, aids to physicians.

• User is directed to completely evaluate images as practice dictates before initiating CADe

• As such, it is expected that using the device in accordance with the label will improve performance of the physician.

November 18, 2009 7

Prospective Study

• If study conduct matches intended use, it is generally believed that a good way to test for a change in performance is to do a multi-center, prospective, randomized clinical trial, e.g.• Randomize patients to the respective experimental

conditions: unassisted image reading; CADe assisted image reading.

• Manage patients according to the evaluations as in routine clinical practice.

• Follow-up patients to determine true disease state.• Analyze results and compare performance under the

two experimental conditions.

November 18, 2009 8

Prospective Studies: Pros

• Study conduct matches indications for use (routine clinical practice, where reader decisions affect patient management).• Estimate of performance under intended use

conditions

November 18, 2009 9

Drawbacks to Prospective Randomized Trials

• For intended use populations where disease prevalence is low, a prospective study as described would require large amounts of time and result in large enrollments to obtain enough disease cases to compare the performance of the two modalities.

• Risk to participants, if patient management will depend on readings in the study (IDE may be required)

November 18, 2009 10

Possible Proxies for Dx Performance in Population

• Retrospective Reader Studies

• Standalone Studies• (bench testing without reader)


Retrospective Reader Studies

• Reader evaluations are made off-line on a retrospective data set of images on which disease status of patients has been established according to ground-truthing rules.

• Multi-reader Multi-case (MRMC) designs: multiple readers read some or all images

• Sample is enriched with disease cases.


Retrospective Reader Studies: Pros

• Not significant risk because reader results are not used manage patients (IDE not required)

• Very efficient. Relatively small sample size can result in precise estimates of sensitivity, specificity, ROC curve, and CADe effect on these endpoints.


Retrospective Reader Studies: Cons

• Reading behavior may not be the same as in routine clinical practice because:• Readers know their readings do not matter

to the patient. • Readers may detect enrichment, which

could affect their reading behavior.• Enrichment causes spectrum bias • Example: enriching with challenging cases

results in• downward bias in reader performance• upward bias in CADe effect on reader

• A small number of readers may not generalize


Complications In Retrospective Reader Studies

• Reader variability issues

• Enrichment related biases

• Choice of controls

• Assumptions


Reader Variability

• 108 US mammographers • reading a common set of 79 mammograms• provided a rating of suspicion of disease using

the breast imaging recording and data system (BIRADS) rating scale of 1–5, where 5 is the highest level of suspicion of cancer

Data from Beam et al., Variability in the interpretation of screening mammograms by US radiologists Arch Intern Med 1996;156:209-213, as in Wagner et al., Assessment of Medical Imaging and Computer-assist Systems: Lessons from Recent Experience, 9 Acad Radiol 1264–1277, 2002


14

0 .0

0 .0

0 .0

0 .0

0 .1

0 .1

0 .1

0 .1

0 .2

0 .2

0 .2

0 .2

0 .3

0 .3

0 .3

0 .3

0 .4

0 .4

0 .4

0 .4

0 .5

0 .5

0 .5 0 .5

0 .6

0 .6

0 .6

0 .6

0 .7

0 .7

0 .7

0 .7

0 .8

0 .8

0 .8

0 .8

0 .9

0 .9

0 .9

0 .9

1 .0

1 .01 .0

1 .0

F alse P o sitiv e F ractio n

Tru e N ega tiv e F ractio n

Tru

e P

osit

ive

Fra

ctio

n

Fal

se N

egat

ive

Fra

ctio

n

TPF vs FPF for 108 US radiologists in study by Beam et al.

(Se

ns

itiv

ity

)

(Specificity)


Number of Readers

• Companies have submitted studies with from 5 to 20 readers.

• Reader sample should represent intended use population of readers.

• A small number of readers may not be representative of the reader population.


Enrichment

• The process of supplementing the image sample with disease positive images.

• Performance estimates obtained with enriched study samples will likely be different than performance in the intended use population

• Infer that differences in performance between modalities may be qualitatively applicable to the intended use population if the spectrum of disease is properly represented.


Enrichment (Spectrum Effect)

• Different case mixes of lesion types will likely result in different performance estimates (spectrum effect)

• For example: in mammography, a CADe may have more difficulty detecting some masses than microcalcifications. A sample in which the proportion of microcalcifications to masses is large will give higher performance estimates than a sample in which that proportion is smaller.


Dis

ea

se (

-)

Dis

ea

se (

+)


Dis

ea

se (

-)

Dis

ea

se (

+)


Enrichment (Easy Cases)

• Consider a sample of images enriched with a large proportion of disease positive cases easily detected by readers and CADes. • Performance estimates for both modalities will

likely be high.• Possibly difficult to detect a difference in

performance between the two modalities.


Reader Alone(red)

Reader

W/CADe

Simulated data


Enrichment (Challenging Cases)

• Stress Test : A study in which a sample of images is enriched with a large proportion of positive cases considered to be difficult to detect by readers and CADes.

• Goal: to show that the device can add value in cases that are difficult for readers.

• Performance results obtained from studies on enriched samples cannot be easily generalized across studies.


Reader

Reader W/CADe

Simulated data


Enrichment (Context Bias)

• Readers in a study environment will become aware of the enrichment and could change their reading behavior in response.

• Investigators attempt to mitigate this context bias by estimating relative performance.

Egglin et al., Context Bias: A Problem in Diagnostic Radiology, JAMA 1996;276:1752-1755


Background for Questions on Endpoints

• Contrast endpoints with specific thresholds (Se/Sp) to aggregating endpoints (ROC)


ROC Curves and Decision Variable Models

• ROC curves show how well a test separates disease test scores from non-disease test scores.

• Assume that a decision variable can model a reader’s decision process• Example: Probability of Malignancy (POM)• Readers are instructed to rate an image with respect

to the probability that it is malignant

• Ratings simulated for 25 healthy and 25 diseased images


Dis

ea

se (

-)

Dis

ea

se (

+)

Gaussian


ROC Curves Depend on Relative Ranking

• ROC curves are invariant to monotone transformations

• Relative ranking is the key


Gaussian


Complication

• Very large fraction of responses for certain detection tasks are in the extreme ranges of the scale.- Gur, et al.

• Similar pattern is not uncommon in reader study results submitted to FDA

Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876


Dis

ea

se (

-)

Dis

ea

se (

+)


• Certain tasks that are binary in nature are better

represented by a binary endpoint-both conceptually and statistically.

• In simulations Gur, et al showed that a binary task is evaluated with less bias and variability if a binary scale rather than continuous scale is used.

• For a task that is essentially binary, such as detecting microcalcifications, how rigorous can we expect relative rankings to be?

Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876


ROC Based Endpoints

• Good for comparing tests over all possible cutoffs

• Use information efficiently

• Following slides discuss details associated with ROC analyses


Control Modality

CADe Modality

Difference between AUCs is the average difference in Se over all Sp


Comparable AUCs?

Depends on clinical context


Is all of the difference in AUC clinically relevant?

Control Modality

CADe Modality

Possible to weight regions according to clinical relevance?

Partial AUC?

Use other device specific criteria?

Context dependent bound


Thresholds (Se/Sp)

• Intuitive• Binary, similar to practice => work up or no?• Obviate adapting readers to unfamiliar rating

scales

• Mimic reality• Same framework as Post-Market Information

(spectrum bias still an issue)


Example“Keep All Positives from Unaided

Read” rule

• Several 2nd reader CADe device labels require or imply that positive findings on the initial unaided read should not be negated by the CADe-aided read.


Endpoints Specific to Intended Use(“Keep All Positives from Unaided Read” rule)

• “Therefore, the radiologist’s work-up decision should not be altered if the system fails to mark an area that the radiologist has detected on the initial film review and has already decided requires further work-up. Nor should the decision be affected if the system marks an area that the radiologist decides is not suspicious enough to warrant further work-up, whether the area is detected by the radiologist on initial film review or only after being marked by the system.”

• From SecondLook label

• • The radiologist should base interpretation only upon the original images and not depend on the CAD markers for interpretation.

• The device is a detection aid, not an interpretative aid. The CAD markers should be activated only after the first reading.

• • • • The device does not identify all areas that are suspicious for cancer. - Some lesions are not marked by the device and a user should not be dissuaded from working up a finding if the device fails to mark that site.

• From R2 label


Applying “Keep All Positives from Unaided Read” Rule

Se Un-Aided to CADe-Aided => NON NEGATIVE

Sp Un-Aided to CADe-Aided => NON POSITIVE

Bound increase of FPF

Unaided reader Se, 1-Sp)

Success* Region

*Biggerstaff, “Comparing diagnostic tests: a simple graphic using likelihood ratios,” Stat Med, 2000.


Image Sample Required for Comparing the Same Two ROC

Curves Using Different Accuracy Measures

• Compare sample size needs for various measures

• Context: Two specified ROC curves• Detectable change in AUC• Corresponding detectable change at given

false positive rates (FPRs) or over given FPR intervals

Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY


Se

FPR0.1 0.2

Se at FPR=0.2

PAUC(0.1<FPR<0.2)

(0.2-0.1)

(FPR interval)

Detectable Changes


Measure of Accuracy Detectable

Change

Ntotal

(n+=n-)

ROC AUC 0.100 278

Se (FPR=0.01) 0.108 930

Se (FPR=0.10) 0.201 482

Se (FPR=0.20) 0.276 382

PAUC(FPR<0.1) /(FPR2-FPR1) 0.167 722

PAUC(FPR<0.2) /(FPR2-FPR1) 0.182 522

PAUC(0.1<FPR<0.2) /(FPR2-FPR1) 0.198 384

Sample Size Efficiency

Adapted from table 6.8, Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY


Not Uncommon Problem

AUC difficult to interpret

Post hoc PAUC as rescue?

Choosing Bound has Type I error implications

N for AUC may be too small to get useful PAUC or

Se/Sp estimates

Inadequate information => Failed Study


Endpoint Summary

• Sensitivity and specificity are more relevant than ROC AUC to the dichotomous decisions made in image reading.

• Drawbacks to using ROC analysis• Not always easy to interpret AUCs

• Crossing curves• Comparable FPF regions

• Reader scoring representative of practice?


Endpoint Summary

• “So my comment is about CADe. I want to point out that the ROC, which is, of course, a wonderful device for assessing the process is not perfectly relevant from the clinical setting. The clinical setting, there is a particular algorithm cut-point and decision are dichotomous. And so one had ought to focus on specific points on the ROC curve. And it seems to me that it is essential that your– that companies show that they have improved sensitivity, which to me means statistical significance or Bayesian probability that the sensitivity is improved. This is a very low hurdle.”—D Berry, March 2008 Radiological Devices Advisory Panel meeting.

• Pepe, M.S, Urban, N., Rutter, C. and Longton, G. (1997) Design of a study to improve accuracy in reading mammograms, J Clin. Epidemiol 50: 1327-1338.

• Van Belle, G. 2002, Statistical Rules of Thumb, Wiley & Sons, Inc. NY (p 100)


Control Arm Discussion

• It is assumed that effectiveness or clinical utility can be shown by comparing unaided image reading to CADe-aided image reading

• We formulated several questions for the panel concerning control arms for 510K (substantial equivalence). The next slides provide some background.


Example Non-Inferiority Test

0-

Reader Performance CADenew

Reader Performance CADepredicate

Success: CI of difference in improvements is greater than some preset limit


Study Design #1

• Readers read common set of images under three modalities• Unaided reading• CADe aided reading with study device• CADe aided reading with predicate

• Note: CADe aided reading according to label

• Randomize image order• Washout periods between modalities• Compare performance results

• Unaided reading comparisons ensure clinical utility• Non inferiority delta can be defined


Study Design #2

• Un-aided reading vs CADe-aided reading• Unaided reading• CADe aided reading with study device

• Randomize image order• Washout periods between modalities• Compare performance results to recorded

predicate performance (label, prior study)


CADe SE Study Example

• Assume study design #2 from previous slide• Case mix

• Predicate study: Difficult to detect• New device study: Easy to detect

• Readers• Predicate study: Experienced

specialist• New device study: Minimally experienced

• performance (W/CADe- W/O CADe) are similar in the two studies.


Changes In Performance Are Not Comparable Across Studies

• In design two, the comparison across studies is confounded by spectrum bias and reader differences.

• Using such a study design comparing changes across enriched studies effectively reduces the question to one of whether or not the CADe device offers any increase in performance over unaided reading.

• With respect to performance, comparing across enriched studies invites imprecise or erroneous SE and NSE conclusions due to confounding (case mix, reader differences, others).


Example Non-Inferiority Test

0-

Reader Performance CADenew

Reader Performance CADepredicate

Given there is an improvement with CADe-aided reading over reader alone

Compared in same study

Success: CI of difference in improvements is greater than some preset limit


Standalone Studies

• Cannot show clinical utility because no reader is involved

• Standalone studies may be useful in comparing a CADe device to a previous version or investigating the performance of the device without the reader.• Example:

• Studying a sample large enough to characterize all important strata (diseased & non-diseased cases) can be useful label information


Enriched Standalone Studies

• Suffer the same complications as reader studies with respect to sample enrichment.• Results are not generalizable across studies.

• Performance estimators apply only to the sample• Not simple random samples of population• Do not represent stand alone performance in

population


Reuse of Test Data (Standalone Studies)

• Some companies have proposed re-using test data in evaluating updated versions of CADes.


Multiplicity

• Multiple tests on the same data set will inflate type I error

• Sponsors must account for multiplicity• Example: Bonferroni correction

• Practical problem: Choosing in a “reuse” test if = 0.05 for first test of several


“Teaching to the Test”

• Each upgrade iteration on the same data could be considered training.

• Test on training data => unreliable results “teaching to the test” “fitting to the noise”

• This is in addition to multiplicity problems

• Difficult to quantify this bias.


Example of Overfitting

Randomly generate data set of 20 profiles having 6000 features each

Arbitrarily assign each member to one of two classes

Develop and evaluate classifiers using 3 processes.

Nearly unbiased cross validation

Resubstitution Method

1) Build predictor on full data set

2) Reapply predictor to each specimen

(Teaching to test)

Partial cross validation1) Leave one out

2) Build classifier on remaining data

3) Classify last point

Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data forDiagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003

4) Repeat (total 20)

This example from Simon et al, illustrates the problems of overfitting in the context of developing algorithms for class prediction with gene expression data.

The large number of features within relatively small samples make this a good parallel to the situation faced by CADe developers.


Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data forDiagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003

Random data:Expect ~ ½ to be misclassified

Resubstitution Method (Teaching to Test) 98.2% of data sets had zero misclassifications


Added Review Questions

• Any variation of reusing data would raise many difficult review issues:• Data integrity/access controls

• Who has access to test data? When?

• Theoretical basis for procedures• Published method?• Assumptions verifiable?

• Selection bias • How were images chosen?

• Type I error control


Using Only Standalone Data

• A change in marker style can affect reader behavior--- Krupinski, et al 1992, Gilbert, et al 2008

• Changes in prevalence affect reader behavior--- Egglin et al 1996

• Deduce that changes in CADe mark placement or frequency could impact reader behavior.

• A change to the algorithm is a change to the device, the device is acting on reader Dx. It is difficult to know a priori what change to an algorithm will produce a change in Dx performance.


Reader Studies Compared to Standalone Studies

• Reader studies investigate reader-device interaction.

• Standalone studies investigate only device performance.


Summary• Endpoints for reader studies

• Binary endpoint more relevant to study question• Sample size for appropriate endpoint

• Control arms for 510K reader studies• Is any improvement over un-aided reading adequate?

• Reuse of data• Teaching to the test

• Evaluating CADes without readers• Does not show clinical utility• Does not investigate device under its intended use


Thank You

Documents

November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division