Confidence in neural networks:methodological issues arising from a review of safety-related applications
P.J.G. Lisboa
Computing and Mathematical Sciences
Liverpool John Moores University
Outline
Developments in commercial safety-related systems comprising artificial neural networks
Increasing demand for decision support e.g. healthcare
Where is the evidence of healthcare benefit from ANNs?
Framework for assuring confidence in neural networks
Design assurance Risk analysis Evidence of effectiveness
Methodological issues arising from the review
Fire alarm for office blocks
SiemensFP11
FirePrintTechnology
Very high specificity
Commercial safety-related systems
Automotive industry
Tow-by-wireNACT (http://www.mech.gla.ac.uk/~nact/nact.html)
Fuel injectionFAMIMO (http://iridia.ulb.ac.be/~famimo/)
Electronic ABSH2C (http://www.control.lth.se/H2C/)
Lisboa, P.J.G. ‘Industrial use of safety-related artificial neural networks’ HSE CR, 2001 http://www.hse.gov.uk/crr_pdf/crr01327.pdf
Papnet
Cytology screening
FDA approvedfor secondary screening
Proven sensitivity
Specificityleft to user
Cost-effective
In the US 44,000 - 98,000 preventable deaths attributed to medical errors (Weingart et al, BMJ 2000)
Exceeds combined toll from motor crashes suicides falls poisonings drowning
Epidemiology of medical error
Epidemiology of medical error
Managing error
The just-world hypothesis Systemic approaches
(Reason, BMJ 2000)
Oncology
Critical care
Cardiology
Other
Diagnosisand staging
Prostatic cancer (2).Cervical cancer.Breast cancer.Acute leukemia.
Intracranial haemorrhage in neonates.
Acute Myocardial Infarction (2).
Appendicitis.
Outcome prediction
Response to therapyin head & neck cancer.Recurrence of breast cancer in axillary node-negative patients.
Length-of-stay in preterm neonates.
Tracolimus blood levels. Effect of treatment in schizophrenia and depression. Rib fracture injury.
Radiology MRI of osteosarcoma.
Perfusion scintigraphy for detection of coronary stenosis. Doppler microembolic signal counts in patients with prosthetic heart valves.
Physiological monitoring
Fetal surveillance during labour from fetal ECG.
Randomised Controlled Trials
Lisboa, P.J.G. ‘A review of evidence of health benefit from artificial neural networks in medical intervention’, Neural Networks, 15, 1, 9-37,2002.
Clinical function
Oncology
Critical
care
Cardiology
Neurology
Other
Diagnosisand staging
Cervical cancer (3). Pre-cancerous breast.
Transient ischaemia. Acute ischaemia.
Embolus detection in stroke. Spontaneous EEG. Sleep EEG (2). Quantitative EEG. Ventilation mode recognition.
Referral methods for patients with third molars. Bladder outlet obstruction. Tear protein patterns. Haemodialysis. Ovulation time. Pure tone thresholds.
Outcome prediction
Multiple myeloma.
Stone growth after lithotripsy.
Radiology Myocardial perfusion images. Detection of stenoses from Doppler u/s waveforms.
MRS of epilepsy. PET of 5-HT reuptake sites PET in Alzheimer’s.
MRS of muscle.
Physiological monitoring
EEG in Pediatrics.
Single trial PVEP (2). Correlation of EEG and MEG. Lorazepan and sleep EEG. Evoked potentials in multiple-sclerosis.
Oxygenation in infants. EGG of gastric empting (2). Subcutaneous adipose tissue. Nonstress tests in obstetrics. Bone dimeneralization.
Clinical Trials
Reference No. of subjects
Clinical function Performance assessment
Results
Prostatic cancer
Gamito et al, 2000
4,133 Prediction of risk of lymph node spread (LNS) from age, race, PSA, PSA velocity, Gleason sum and TNM
External validation(n=660)
98% accuracy in detection of low risk of LNS with a MLP
Cervical cancer
Prismatic team, 1999
NNA Assessment as a primary screening tool for categorization of cervical smears as negative, mild, moderate or severe dyskaryosis, invasion, glandular neoplasia and borderline nuclear changes
External validation (n=21,700)
89.9% agreement across all classes was found between PAPNET and conventional primary screening.Similar sensitivity (82 cf. 83%), with PAPNET having improved specificity (77 cf. 42%) and faster processing (3.9 min. cf. 10.4 min)
Doornewaard et al 1999
NNA Assessment as a primary screening tool for the early detection of cervical dysplasia
External validation(n=6,063)
PAPNET testing has similar diagnostic value to conventional screening of Pap smears, with AUROC 95% CIs of 78-82% for control and 77-81% for PAPNET
Mango et al, 1998
NNA Comparison of yield in re-screening of node-negative PAP smears between NNA and conventional unassisted cytology
External validation (n=10,000)
PAPNET returned a yield of 6.2% versus 0.6% for manual re-screening
Reference No. of subjects
Clinical function Performance assessment
Results
Neonates
Zernikow et al, 1999
2,144 Predicting length-of-stay in preterm neonates from 40 first-day-of-life items
Train/test/validation
First-day-of-life data is predictive of length-of-stay of pre-term neonates with correlation CIs of 0.85-0.90 for MLR and 0.87-0.92 for MLP
Ischaemia
Polak et al, 1997 1,367 Prediction of transient ischaemia during ambulatory Holter monitoring, from a resting 12-lead ECG. Univariate t-tests were used to inform model selection
Train/test LDA and adaptive logic networks were superior to the MLP to predict the likelihood for the occurrence of ischaemic episodes
Selker et al, 1995 3,453 Clinical indicators available within 10 minutes of emergency department care were used to predict AMI and unstable angina pectoris, in a real-time clinical setting
External validation(n=2,320)
Limiting the inputs to 8 readily available variables, AUROCs for LogR, CART and MLP were 0.887, 0.858 and 0.902, respectively. Each is a clinically useful predictor of clinical outcome
Is there evidence of clinical benefit ?
Clinician performance patient outcome
Primary to secondary care referrals of patients with third molars:
Sens. Spec. Acc.1) Control group 0.97 0.22 0.83
2) Paper-based clinical algorithm 0.56 0.93 0.73
3) MLP-based recommendation 0.56 0.79 0.67
Which is the best performing system ?
Is there evidence of clinical benefit ?
Clinician performance patient outcome
Primary to secondary care referrals of patients with third molars:
Sens. Spec. Acc.1) Control group 0.97 0.22 0.83
2) Paper-based clinical algorithm 0.56 0.93 0.73
3) MLP-based recommendation 0.56 0.79 0.67
1) 1.2 2) 8.0 3) 2.7
ySpecificit
ySensitivit
1
Performance estimation
ROC framework
Boost factor:
PPV = True positives/Predicted positives
)Pr(~
)Pr(*
11 class
class
ySpecificit
ySensitivit
PPV
PPV
Predictive modelsDeficiencies in standard modelling methods:
(Altman & Royston, Stat. Med. 2000)
1. Overoptimistic assessment of predictive performance
2. Multiple regression using stepwise variable selection
3. EPV < 10 (samples/parameters)
4. Case-mix (cohort variations)
5. External evaluation (protocol changes)
• Retrospective vs. Temporal
• Prospective vs. External
Continuum of inference models
numeric to numeric symbolic to symbolicnumeric to symbolic
unsupervisedsupervised
statistical methodssignal processing data driven knowledge drivenneural networks
FFT
waveletsindependentcomponents analysis
logistic regression
kernel methods inc. SVM
radialbasisfunctions
multi-layerperceptron
ART
SOM/GTM CART
rule induction
axiomatic
productionrules
fuzzy logicreinforcementlearning
k-meansclustering
rule extraction
control
Software life-cycle
Requirements analysis & specification
Functional specification & data requirements
Model design
Implementation & training
Test of model predictions
Integration test
System evaluation
Unit level
Integration level
Systems level
Risk analysis
Extract from the FDA guidelines
Post-market surveillancePost-market surveillance
RCTRCT
Animal modelsAnimal models
Healthy humansHealthy humans
Clinical trialsClinical trials
Phase IV: Follow-upPhase IV: Follow-up
Phase II: Exploratory trialPhase II: Exploratory trial
Phase III: Definitive trialPhase III: Definitive trial
Pre-clinical: RationalePre-clinical: Rationale
Phase I: ModellingPhase I: Modelling
The continuum of evidence(Drug development)
The continuum of evidence(Campbell et al, 2000)
Phase IV: Follow-upPhase IV: Follow-up
Phase II: Exploratory trialPhase II: Exploratory trial
Phase III: Definitive trialPhase III: Definitive trial
Pre-clinical: RationalePre-clinical: Rationale
Phase I: ModellingPhase I: Modelling
Post-market surveillancePost-market surveillance
RCTRCT
MethodologyMethodology
Retrospective studiesRetrospective studies
Prospective studiesProspective studies
Ph I: Theory
Regularisation framework
Ph II: Performance optimisation
Complexity control
Ph III: Generalisation
HAZOP/FMEA
RCT: case-control study
Clinical effectiveness Post-market surveillancePost-market surveillance
RCTRCT
MethodologyMethodology
Retrospective studiesRetrospective studies
Prospective studiesProspective studies
The continuum of evidence
Ph I: Theory
Regularisation framework
Ph II: Performance optimisation
Complexity control
Ph III: Generalisation
HAZOP/FMEA
RCT: case-control study
Clinical effectiveness
Medical Devices Directives
Essential requirements
Performance validation
Doctrine of Substantially Equiv. Products
Model evaluationRequirement for Learned Intermediaries.
Risk assessment
Procedure for post-marketing surveillance
H & S requirements
The continuum of evidence
Methodological issues arising
Confidence Data Regularisation Calibration
Transparency Rule-extraction
Linear-in-the-parameters statistical inference Fuzzy or rule-based supervisory models
Effectiveness Performance estimation
Reliability Novelty-detection
Generalisation
Performance estimation
Power calculations
Bootstrap Ci:
test
C
testtrain
C
train
C
train
CAUROC
##*###~ 432
21
Rule extraction
Axis parallel boxes & network pruning
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1y4L
Rule extraction
11.5
22.5
3
1
1.5
2
2.5
30
0.5
1
1.5
2
S urface Map
11.5
22.5
3
1
1.5
2
2.5
30.8
0.9
1
1.1
1.2
1.3
Variance Map
Lisboa, P.J.G., Etchells, T.A and Pountney, D.C. ‘Minimal MLPs do not model the XOR logic’ Neurocomputing, Rapid communication, 48, 1-4, 1033-1037, 2002.
Axis parallel boxes & network pruning
Good practice
Embodying a safety-culture
Specification• Statistically significant vs. clinically useful
Transparency • Verify against clinical prior knowledge
HAZOP & FMEA• Reliability - novelty detection ?• Maintainability - incremental learning ?
Summary
Assuring confidence in complex is not a specific issue for neural networks but applies to all inference systems
A framework can be constructed based on a life-cycle model of safety-related software:
Good practice in the design data-based models Need to switch between evidence-based and knowledge-
based cultures for v & v.