Upload
anabel-berry
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
November 18, 2009 1
CAD Panel Meeting
Statistical Issues in CADe Evaluations
Thomas E. Gwise, Ph.D.Mathematical Statistician / Acting Team Leader
Division of BiostatisticsOffice of Surveillance and Biometrics
November 18, 2009
November 18, 2009 2
Outline
• Statistical concepts• Reader studies for CADe evaluation
• Prospective and retrospective• Retrospective study design examples• Complications in retrospective studies• Choice of endpoints• Choice of controls
• Standalone studies• Compared to reader studies• Re-use of data
November 18, 2009 3
Statistical Evaluation of Diagnostic Tests
• Two dimensions are considered when evaluating diagnostic test performance.
• How well can the test detect diseased cases?• Sensitivity: Fraction of diseased patients who are test
positive
• How well can the test correctly identify the non-diseased cases?• Specificity : Fraction of non-diseased patients who are
test negative
• Sensitivity and Specificity are not comparable if estimated in separate studies
November 18, 2009 4
ROC Curves
ROC curves are plots of Se/Sp considering all possible cutoffs.
November 18, 2009 5
Statistical Evaluation of Diagnostic Tests
Does the test add value?
Example: Is a diagnostic test for bone mineral density better than just using a person’s age in diagnosing osteoporosis?
Example: Does use of a CADe device improve diagnostic performance of readers?
Examples of improvement:Sensitivity, Specificity both betterROC plot (or area) betterImproved reading time, same performance
November 18, 2009 6
Intended Use• The vast majority of submissions for CADe
devices to date have been for those labeled as second reader, aids to physicians.
• User is directed to completely evaluate images as practice dictates before initiating CADe
• As such, it is expected that using the device in accordance with the label will improve performance of the physician.
November 18, 2009 7
Prospective Study
• If study conduct matches intended use, it is generally believed that a good way to test for a change in performance is to do a multi-center, prospective, randomized clinical trial, e.g.• Randomize patients to the respective experimental
conditions: unassisted image reading; CADe assisted image reading.
• Manage patients according to the evaluations as in routine clinical practice.
• Follow-up patients to determine true disease state.• Analyze results and compare performance under the
two experimental conditions.
November 18, 2009 8
Prospective Studies: Pros
• Study conduct matches indications for use (routine clinical practice, where reader decisions affect patient management).• Estimate of performance under intended use
conditions
November 18, 2009 9
Drawbacks to Prospective Randomized Trials
• For intended use populations where disease prevalence is low, a prospective study as described would require large amounts of time and result in large enrollments to obtain enough disease cases to compare the performance of the two modalities.
• Risk to participants, if patient management will depend on readings in the study (IDE may be required)
November 18, 2009 10
Possible Proxies for Dx Performance in Population
• Retrospective Reader Studies
• Standalone Studies• (bench testing without reader)
November 18, 2009 11
Retrospective Reader Studies
• Reader evaluations are made off-line on a retrospective data set of images on which disease status of patients has been established according to ground-truthing rules.
• Multi-reader Multi-case (MRMC) designs: multiple readers read some or all images
• Sample is enriched with disease cases.
November 18, 2009 12
Retrospective Reader Studies: Pros
• Not significant risk because reader results are not used manage patients (IDE not required)
• Very efficient. Relatively small sample size can result in precise estimates of sensitivity, specificity, ROC curve, and CADe effect on these endpoints.
November 18, 2009 13
Retrospective Reader Studies: Cons
• Reading behavior may not be the same as in routine clinical practice because:• Readers know their readings do not matter
to the patient. • Readers may detect enrichment, which
could affect their reading behavior.• Enrichment causes spectrum bias • Example: enriching with challenging cases
results in• downward bias in reader performance• upward bias in CADe effect on reader
• A small number of readers may not generalize
November 18, 2009 14
Complications In Retrospective Reader Studies
• Reader variability issues
• Enrichment related biases
• Choice of controls
• Assumptions
November 18, 2009 15
Reader Variability
• 108 US mammographers • reading a common set of 79 mammograms• provided a rating of suspicion of disease using
the breast imaging recording and data system (BIRADS) rating scale of 1–5, where 5 is the highest level of suspicion of cancer
Data from Beam et al., Variability in the interpretation of screening mammograms by US radiologists Arch Intern Med 1996;156:209-213, as in Wagner et al., Assessment of Medical Imaging and Computer-assist Systems: Lessons from Recent Experience, 9 Acad Radiol 1264–1277, 2002
November 18, 2009 16
14
0 .0
0 .0
0 .0
0 .0
0 .1
0 .1
0 .1
0 .1
0 .2
0 .2
0 .2
0 .2
0 .3
0 .3
0 .3
0 .3
0 .4
0 .4
0 .4
0 .4
0 .5
0 .5
0 .5 0 .5
0 .6
0 .6
0 .6
0 .6
0 .7
0 .7
0 .7
0 .7
0 .8
0 .8
0 .8
0 .8
0 .9
0 .9
0 .9
0 .9
1 .0
1 .01 .0
1 .0
F alse P o sitiv e F ractio n
Tru e N ega tiv e F ractio n
Tru
e P
osit
ive
Fra
ctio
n
Fal
se N
egat
ive
Fra
ctio
n
TPF vs FPF for 108 US radiologists in study by Beam et al.
(Se
ns
itiv
ity
)
(Specificity)
November 18, 2009 17
Number of Readers
• Companies have submitted studies with from 5 to 20 readers.
• Reader sample should represent intended use population of readers.
• A small number of readers may not be representative of the reader population.
November 18, 2009 18
Enrichment
• The process of supplementing the image sample with disease positive images.
• Performance estimates obtained with enriched study samples will likely be different than performance in the intended use population
• Infer that differences in performance between modalities may be qualitatively applicable to the intended use population if the spectrum of disease is properly represented.
November 18, 2009 19
Enrichment (Spectrum Effect)
• Different case mixes of lesion types will likely result in different performance estimates (spectrum effect)
• For example: in mammography, a CADe may have more difficulty detecting some masses than microcalcifications. A sample in which the proportion of microcalcifications to masses is large will give higher performance estimates than a sample in which that proportion is smaller.
November 18, 2009 20
Dis
ea
se (
-)
Dis
ea
se (
+)
November 18, 2009 21
Dis
ea
se (
-)
Dis
ea
se (
+)
November 18, 2009 22
Enrichment (Easy Cases)
• Consider a sample of images enriched with a large proportion of disease positive cases easily detected by readers and CADes. • Performance estimates for both modalities will
likely be high.• Possibly difficult to detect a difference in
performance between the two modalities.
November 18, 2009 23
Reader Alone(red)
Reader
W/CADe
Simulated data
November 18, 2009 24
Enrichment (Challenging Cases)
• Stress Test : A study in which a sample of images is enriched with a large proportion of positive cases considered to be difficult to detect by readers and CADes.
• Goal: to show that the device can add value in cases that are difficult for readers.
• Performance results obtained from studies on enriched samples cannot be easily generalized across studies.
November 18, 2009 25
Reader
Reader W/CADe
Simulated data
November 18, 2009 26
Enrichment (Context Bias)
• Readers in a study environment will become aware of the enrichment and could change their reading behavior in response.
• Investigators attempt to mitigate this context bias by estimating relative performance.
Egglin et al., Context Bias: A Problem in Diagnostic Radiology, JAMA 1996;276:1752-1755
November 18, 2009 27
Background for Questions on Endpoints
• Contrast endpoints with specific thresholds (Se/Sp) to aggregating endpoints (ROC)
November 18, 2009 28
ROC Curves and Decision Variable Models
• ROC curves show how well a test separates disease test scores from non-disease test scores.
• Assume that a decision variable can model a reader’s decision process• Example: Probability of Malignancy (POM)• Readers are instructed to rate an image with respect
to the probability that it is malignant
• Ratings simulated for 25 healthy and 25 diseased images
November 18, 2009 29
Dis
ea
se (
-)
Dis
ea
se (
+)
Gaussian
November 18, 2009 30
ROC Curves Depend on Relative Ranking
• ROC curves are invariant to monotone transformations
• Relative ranking is the key
November 18, 2009 31
Gaussian
November 18, 2009 32
Complication
• Very large fraction of responses for certain detection tasks are in the extreme ranges of the scale.- Gur, et al.
• Similar pattern is not uncommon in reader study results submitted to FDA
Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876
November 18, 2009 33
Dis
ea
se (
-)
Dis
ea
se (
+)
November 18, 2009 34
• Certain tasks that are binary in nature are better
represented by a binary endpoint-both conceptually and statistically.
• In simulations Gur, et al showed that a binary task is evaluated with less bias and variability if a binary scale rather than continuous scale is used.
• For a task that is essentially binary, such as detecting microcalcifications, how rigorous can we expect relative rankings to be?
Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876
November 18, 2009 35
ROC Based Endpoints
• Good for comparing tests over all possible cutoffs
• Use information efficiently
• Following slides discuss details associated with ROC analyses
November 18, 2009 36
Control Modality
CADe Modality
Difference between AUCs is the average difference in Se over all Sp
November 18, 2009 37
Comparable AUCs?
Depends on clinical context
November 18, 2009 38
Is all of the difference in AUC clinically relevant?
Control Modality
CADe Modality
Possible to weight regions according to clinical relevance?
Partial AUC?
Use other device specific criteria?
Context dependent bound
November 18, 2009 39
Thresholds (Se/Sp)
• Intuitive• Binary, similar to practice => work up or no?• Obviate adapting readers to unfamiliar rating
scales
• Mimic reality• Same framework as Post-Market Information
(spectrum bias still an issue)
November 18, 2009 40
Example“Keep All Positives from Unaided
Read” rule
• Several 2nd reader CADe device labels require or imply that positive findings on the initial unaided read should not be negated by the CADe-aided read.
November 18, 2009 41
Endpoints Specific to Intended Use(“Keep All Positives from Unaided Read” rule)
• “Therefore, the radiologist’s work-up decision should not be altered if the system fails to mark an area that the radiologist has detected on the initial film review and has already decided requires further work-up. Nor should the decision be affected if the system marks an area that the radiologist decides is not suspicious enough to warrant further work-up, whether the area is detected by the radiologist on initial film review or only after being marked by the system.”
• From SecondLook label
• • The radiologist should base interpretation only upon the original images and not depend on the CAD markers for interpretation.
• The device is a detection aid, not an interpretative aid. The CAD markers should be activated only after the first reading.
• • • • The device does not identify all areas that are suspicious for cancer. - Some lesions are not marked by the device and a user should not be dissuaded from working up a finding if the device fails to mark that site.
• From R2 label
November 18, 2009 42
Applying “Keep All Positives from Unaided Read” Rule
Se Un-Aided to CADe-Aided => NON NEGATIVE
Sp Un-Aided to CADe-Aided => NON POSITIVE
Bound increase of FPF
Unaided reader Se, 1-Sp)
Success* Region
*Biggerstaff, “Comparing diagnostic tests: a simple graphic using likelihood ratios,” Stat Med, 2000.
November 18, 2009 43
Image Sample Required for Comparing the Same Two ROC
Curves Using Different Accuracy Measures
• Compare sample size needs for various measures
• Context: Two specified ROC curves• Detectable change in AUC• Corresponding detectable change at given
false positive rates (FPRs) or over given FPR intervals
Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY
November 18, 2009 44
Se
FPR0.1 0.2
Se at FPR=0.2
PAUC(0.1<FPR<0.2)
(0.2-0.1)
(FPR interval)
Detectable Changes
November 18, 2009 45
Measure of Accuracy Detectable
Change
Ntotal
(n+=n-)
ROC AUC 0.100 278
Se (FPR=0.01) 0.108 930
Se (FPR=0.10) 0.201 482
Se (FPR=0.20) 0.276 382
PAUC(FPR<0.1) /(FPR2-FPR1) 0.167 722
PAUC(FPR<0.2) /(FPR2-FPR1) 0.182 522
PAUC(0.1<FPR<0.2) /(FPR2-FPR1) 0.198 384
Sample Size Efficiency
Adapted from table 6.8, Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY
November 18, 2009 46
Not Uncommon Problem
AUC difficult to interpret
Post hoc PAUC as rescue?
Choosing Bound has Type I error implications
N for AUC may be too small to get useful PAUC or
Se/Sp estimates
Inadequate information => Failed Study
November 18, 2009 47
Endpoint Summary
• Sensitivity and specificity are more relevant than ROC AUC to the dichotomous decisions made in image reading.
• Drawbacks to using ROC analysis• Not always easy to interpret AUCs
• Crossing curves• Comparable FPF regions
• Reader scoring representative of practice?
November 18, 2009 48
Endpoint Summary
• “So my comment is about CADe. I want to point out that the ROC, which is, of course, a wonderful device for assessing the process is not perfectly relevant from the clinical setting. The clinical setting, there is a particular algorithm cut-point and decision are dichotomous. And so one had ought to focus on specific points on the ROC curve. And it seems to me that it is essential that your– that companies show that they have improved sensitivity, which to me means statistical significance or Bayesian probability that the sensitivity is improved. This is a very low hurdle.”—D Berry, March 2008 Radiological Devices Advisory Panel meeting.
• Pepe, M.S, Urban, N., Rutter, C. and Longton, G. (1997) Design of a study to improve accuracy in reading mammograms, J Clin. Epidemiol 50: 1327-1338.
• Van Belle, G. 2002, Statistical Rules of Thumb, Wiley & Sons, Inc. NY (p 100)
November 18, 2009 49
Control Arm Discussion
• It is assumed that effectiveness or clinical utility can be shown by comparing unaided image reading to CADe-aided image reading
• We formulated several questions for the panel concerning control arms for 510K (substantial equivalence). The next slides provide some background.
November 18, 2009 50
Example Non-Inferiority Test
0-
Reader Performance CADenew
Reader Performance CADepredicate
Success: CI of difference in improvements is greater than some preset limit
November 18, 2009 51
Study Design #1
• Readers read common set of images under three modalities• Unaided reading• CADe aided reading with study device• CADe aided reading with predicate
• Note: CADe aided reading according to label
• Randomize image order• Washout periods between modalities• Compare performance results
• Unaided reading comparisons ensure clinical utility• Non inferiority delta can be defined
November 18, 2009 52
Study Design #2
• Un-aided reading vs CADe-aided reading• Unaided reading• CADe aided reading with study device
• Randomize image order• Washout periods between modalities• Compare performance results to recorded
predicate performance (label, prior study)
November 18, 2009 53
CADe SE Study Example
• Assume study design #2 from previous slide• Case mix
• Predicate study: Difficult to detect• New device study: Easy to detect
• Readers• Predicate study: Experienced
specialist• New device study: Minimally experienced
• performance (W/CADe- W/O CADe) are similar in the two studies.
November 18, 2009 54
Changes In Performance Are Not Comparable Across Studies
• In design two, the comparison across studies is confounded by spectrum bias and reader differences.
• Using such a study design comparing changes across enriched studies effectively reduces the question to one of whether or not the CADe device offers any increase in performance over unaided reading.
• With respect to performance, comparing across enriched studies invites imprecise or erroneous SE and NSE conclusions due to confounding (case mix, reader differences, others).
November 18, 2009 55
Example Non-Inferiority Test
0-
Reader Performance CADenew
Reader Performance CADepredicate
Given there is an improvement with CADe-aided reading over reader alone
Compared in same study
Success: CI of difference in improvements is greater than some preset limit
November 18, 2009 56
Standalone Studies
• Cannot show clinical utility because no reader is involved
• Standalone studies may be useful in comparing a CADe device to a previous version or investigating the performance of the device without the reader.• Example:
• Studying a sample large enough to characterize all important strata (diseased & non-diseased cases) can be useful label information
November 18, 2009 57
Enriched Standalone Studies
• Suffer the same complications as reader studies with respect to sample enrichment.• Results are not generalizable across studies.
• Performance estimators apply only to the sample• Not simple random samples of population• Do not represent stand alone performance in
population
November 18, 2009 58
Reuse of Test Data (Standalone Studies)
• Some companies have proposed re-using test data in evaluating updated versions of CADes.
November 18, 2009 59
Multiplicity
• Multiple tests on the same data set will inflate type I error
• Sponsors must account for multiplicity• Example: Bonferroni correction
• Practical problem: Choosing in a “reuse” test if = 0.05 for first test of several
November 18, 2009 60
“Teaching to the Test”
• Each upgrade iteration on the same data could be considered training.
• Test on training data => unreliable results “teaching to the test” “fitting to the noise”
• This is in addition to multiplicity problems
• Difficult to quantify this bias.
November 18, 2009 61
Example of Overfitting
Randomly generate data set of 20 profiles having 6000 features each
Arbitrarily assign each member to one of two classes
Develop and evaluate classifiers using 3 processes.
Nearly unbiased cross validation
Resubstitution Method
1) Build predictor on full data set
2) Reapply predictor to each specimen
(Teaching to test)
Partial cross validation1) Leave one out
2) Build classifier on remaining data
3) Classify last point
Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data forDiagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003
4) Repeat (total 20)
This example from Simon et al, illustrates the problems of overfitting in the context of developing algorithms for class prediction with gene expression data.
The large number of features within relatively small samples make this a good parallel to the situation faced by CADe developers.
November 18, 2009 62
Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data forDiagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003
Random data:Expect ~ ½ to be misclassified
Resubstitution Method (Teaching to Test) 98.2% of data sets had zero misclassifications
November 18, 2009 63
Added Review Questions
• Any variation of reusing data would raise many difficult review issues:• Data integrity/access controls
• Who has access to test data? When?
• Theoretical basis for procedures• Published method?• Assumptions verifiable?
• Selection bias • How were images chosen?
• Type I error control
November 18, 2009 64
Using Only Standalone Data
• A change in marker style can affect reader behavior--- Krupinski, et al 1992, Gilbert, et al 2008
• Changes in prevalence affect reader behavior--- Egglin et al 1996
• Deduce that changes in CADe mark placement or frequency could impact reader behavior.
• A change to the algorithm is a change to the device, the device is acting on reader Dx. It is difficult to know a priori what change to an algorithm will produce a change in Dx performance.
November 18, 2009 65
Reader Studies Compared to Standalone Studies
• Reader studies investigate reader-device interaction.
• Standalone studies investigate only device performance.
November 18, 2009 66
Summary• Endpoints for reader studies
• Binary endpoint more relevant to study question• Sample size for appropriate endpoint
• Control arms for 510K reader studies• Is any improvement over un-aided reading adequate?
• Reuse of data• Teaching to the test
• Evaluating CADes without readers• Does not show clinical utility• Does not investigate device under its intended use
November 18, 2009 67
Thank You