Upload
beatrice-stanley
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
14 June [email protected]
Medical Informatics:
University of UlsterPaul McCullaghPaul McCullagh
University of UlsterUniversity of Ulster
10 June 2005
10 June 2005
Ulster Institute of eHealth
10 June 2005
www.uieh.n-i.nhs.uk
10 June 2005
Stroke Web Interface
Features:Features: Animation feedback Animation feedback
to patient: 3D to patient: 3D rendering of patient rendering of patient movement during movement during rehabilitationrehabilitation
Communication Communication tools for patients tools for patients and professionalsand professionals
Decision supportDecision support
Home-based system isHome-based system iscurrently undergoing currently undergoing
developmentdevelopment
10 June 2005
Application Of Multimedia To Nursing Education: A Case
Study Based On The Diagnosis Of Alcohol Abuse
• Culture of Binge Drinking• Multimedia as an Education
Tool • Interactive Learning• Self Assessment• Exemplars From Other Areas
Including: Diabetes, Testicular Cancer, Anorexia
10 June 2005
Diabetes Education
Type I & Type II Type I & Type II innovative innovative multimedia multimedia patient-centred patient-centred education education materialsmaterials
10 June 2005
Intelligent Consultant :
Interface Natural Language ProcessingNatural Language Processing
10 June 2005
Image or Region in an Image
Feature Extraction Classification
ImageLabeled
Texture Analysis and Classification
Methods for the Characterisation of
Ultrasonic Images of the Placenta
10 June 2005
The Grannum Classification
10 June 2005
Edge Detection Line Thinning
Threshold segmentation
Age Related Macular Disease
10 June 2005
Full Screen Image of the Application displaying an Occult lesion with edge detection, a pixel count and
histogram of the region inside the box
Full Screen Image of the Application displaying an Occult lesion with edge detection, a pixel count and
histogram of the region inside the box
10 June 2005
Body Surface Potential Mapping
10 June 2005
Observations Driving Lead System Development
Electrophysiology associated with disease is most Electrophysiology associated with disease is most
often localized in the heart:often localized in the heart: Infarction, ischemia, accessory A-V pathways, ectopic foci, Infarction, ischemia, accessory A-V pathways, ectopic foci,
“late potentials”, conduction or repolarization “late potentials”, conduction or repolarization
abnormalitiesabnormalities
ECG manifestations of localized disease are ECG manifestations of localized disease are
localized on the body surfacelocalized on the body surface Clinical lead systems are not optimized for Clinical lead systems are not optimized for
diagnostic information capture - often do not sample diagnostic information capture - often do not sample
regions where diagnostic information occursregions where diagnostic information occurs
10 June 2005
Lead selectionLead selection Best leads for estimating all surface Best leads for estimating all surface
potentialspotentials Best leads for diagnostic informationBest leads for diagnostic information Are these the same?Are these the same?
Data mining techniquesData mining techniques WrappersWrappers Sequential selectionSequential selection
Mining for Information In Body Surface Potential Maps
10 June 2005
Normal hearing five peak response using high stimulus level (70db)
Peak amplitudes reduce as stimulus level is reducedPeak amplitudes reduce as stimulus level is reduced
Only wave V remains at threshold - about 30db stimulus levelOnly wave V remains at threshold - about 30db stimulus level
Hearing Screening
10 June 2005
Clementine data mining software used to generate neural network and decision tree models for classification of ABR waveforms
The individual models will make use of:
• time domain data
• frequency domain data
• correlation of subaverages
10 June 2005
Wavelet Decomposition
0 50 100 150 200 250 300 350 400-8.5
-8
-7.5
-7
-6.5
-6
-5.5
-5
-4.5
-4
-3.5
0 2 4 6 8 10 12 14 16-5
-4
-3
-2
-1
0
1
2
3
4
5
Wavelet Decomposition
Pre-stimulus data Post-stimulus data
256 coefficients
Analysis of 16 D4 Coefficients
Analysis of 16 D4 Coefficients
0 2 4 6 8 10 12 14 16-5
-4
-3
-2
-1
0
1
2
3
4
5Ratio of the sum of absolute values: Pre-stimulus
Post-stimulus
Closer ratio is to 0 the higher probability of a response
10 June 2005
CBR for wound healing progress
Objective of researchObjective of research – Automated Wound – Automated Wound healing monitoring and assessmenthealing monitoring and assessment Determine size of whole woundDetermine size of whole wound Determine tissue types presentDetermine tissue types present Coverage of different types of tissuesCoverage of different types of tissues Automatically monitor healing over timeAutomatically monitor healing over time Remove subjectivityRemove subjectivity Improve decision making process and careImprove decision making process and care
Technologies usedTechnologies used Case-based reasoningCase-based reasoning Feature extraction/transformationFeature extraction/transformation
10 June 2005
Work to dateWork to date – classification for tissue types – classification for tissue types MethodMethod
Take an image overlap with a gridTake an image overlap with a grid Make prediction for each type of tissue Make prediction for each type of tissue Prediction made based on systems knowledge of Prediction made based on systems knowledge of
previous tissue types (cases) that have been previous tissue types (cases) that have been identified by professionalsidentified by professionals
Overall accuracyOverall accuracy – – 86% 86% PublicationsPublications
Zheng, Bradley, Patterson, Galushka, Winder, Zheng, Bradley, Patterson, Galushka, Winder, ““New Protocol for leg ulcer tissue classification from colour images”, Proc. 26th int. conf. of , Proc. 26th int. conf. of engineering in medicine and biology society engineering in medicine and biology society (EMBS 04) (EMBS 04)
Galushka, Zheng, Patterson, Bradley Galushka, Zheng, Patterson, Bradley “Case-Based Tissue Classification for Monitoring Leg Ulcer Healing, 18th IEEE Int. Symposium on , 18th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS 2005).Computer-Based Medical Systems (CBMS 2005).
10 June 2005
AnalysisAnalysis P
redictio
nP
redictio
n
ComparisonComparison
14 June [email protected]
Feature Selection and Classification on Type 2 Diabetic
Patients’ DataPaul McCullaghPaul McCullagh
University of UlsterUniversity of Ulster
10 June 2005
Diabetes
World’s situationWorld’s situation Around 194 million people with diabetes by WHO Around 194 million people with diabetes by WHO
studystudy 50% patients are undiagnosed50% patients are undiagnosed
Northern IrelandNorthern Ireland 49,000 diagnosed patients in NI49,000 diagnosed patients in NI Another 25,000 are unaware their conditionAnother 25,000 are unaware their condition
Type 2 diabetes (NIDDM)Type 2 diabetes (NIDDM) Diabetic complicationsDiabetic complications Blood Glucose controlBlood Glucose control HbA1c testHbA1c test
10 June 2005
Data Mining Large amounts of information gathered in Large amounts of information gathered in
medical databasesmedical databases Traditional manual analysis has become Traditional manual analysis has become
inadequateinadequate Efficient computer-based analysis are Efficient computer-based analysis are
indispensableindispensable Noisy, incomplete and inconsistent dataNoisy, incomplete and inconsistent data
Can we determine factors which Can we determine factors which influence how well the patients influence how well the patients progress? progress?
Are these factors under our control?Are these factors under our control?
10 June 2005
Relative Risk for the Development of
Diabetic Complications
Source: Rahman Y, Nolan J and Grimson J. E-Clinic: Re-engineering Clinical Care Process in Diabetes Management.
Healthcare Informatics Society of Ireland, 2002
10 June 2005
North Down Primary Care Organisation
Quality of data in Primary Care Data SetsQuality of data in Primary Care Data Sets Hba1c FundoscopyHba1c Fundoscopy
Bar Chart Showing Percentage Recording of HbA1c For Diabetic Patients in Practices
0 10 20 30 40 50 60 70 80 90 100
1
2
3
4
5
6
7
8
9
10
11
Prac
tice
Percentage %
Bar Chart Showing Percentage Recording Of Fundoscopy For Diabetic Patients in Practices
0 10 20 30 40 50 60 70 80 90 100
1
2
3
4
5
6
7
8
9
10
11
Percentage %
10 June 2005
Data Set
Ulster Community & Ulster Community & Hospitals TrustHospitals Trust
2064 type 2 patients, 2064 type 2 patients, 20876 records20876 records
1148 males, 916 females1148 males, 916 females 410 features reduced to 410 features reduced to
47 relevant features: 23 47 relevant features: 23 categorical, 24 numericalcategorical, 24 numerical
Average 7.8% missing Average 7.8% missing valuesvalues
Distribution of Patients' Age
563637
579
238
100200300400500600700
20-60 60-70 70-80 >80
Age
10 June 2005
Research Goals
Identify significant factors that influence Type 2 diabetes controlIdentify significant factors that influence Type 2 diabetes control Weight, Smoking status or Alcohol?Weight, Smoking status or Alcohol? Height, Age or Gender?Height, Age or Gender? Time Interval between two tests?Time Interval between two tests? Cholesterol level?Cholesterol level?
Classifying individuals at bad disease control in the populationClassifying individuals at bad disease control in the population Distinguish bad blood glucose control patients from good blood Distinguish bad blood glucose control patients from good blood
glucose control patients based on physiological and examination glucose control patients based on physiological and examination factorsfactors
Predict individuals in the population with poor diabetes control Predict individuals in the population with poor diabetes control status based on physiological and examination factorsstatus based on physiological and examination factors
Investigate the potential of data mining techniques in ‘real world’ Investigate the potential of data mining techniques in ‘real world’ medical domain and medical domain and evaluate different data mining approachesevaluate different data mining approaches
10 June 2005
Data Mining Procedure
Pre-Processing Clean Data Feature Selection
Target Data
Data Mining Schemes
Model/Patterns
Post-ProcessingKnowledge
Raw Data
10 June 2005
Data Preprocessing
Data IntegrationData Integration Combine data from multiple sources into a single data Combine data from multiple sources into a single data
tabletable Data TransformationData Transformation
Divide the patients into 2 categories (Divide the patients into 2 categories (Better ControlBetter Control and and Worse ControlWorse Control) based on the comparison HbA1c value ) based on the comparison HbA1c value of the laboratory test and the target valueof the laboratory test and the target value
BetterBetter: 34.33% ; : 34.33% ; WorseWorse: 65.67%: 65.67% Data ReductionData Reduction
Remove the attributes with more than 50% missing dataRemove the attributes with more than 50% missing data Keep the features recommended by the diabetic expert Keep the features recommended by the diabetic expert
and the international diabetes guidelinesand the international diabetes guidelines
10 June 2005
Feature Selection
Identify and remove irrelevant and Identify and remove irrelevant and redundant informationredundant information— Not all attributes are actually usefulNot all attributes are actually useful
• Noisy, irrelevant and redundant attributesNoisy, irrelevant and redundant attributes Minimize the associated measurement costsMinimize the associated measurement costs Improve prediction accuracyImprove prediction accuracy Reduce the complexityReduce the complexity Easier interpretation the classification Easier interpretation the classification
resultsresults
10 June 2005
Feature Selection
Information GainInformation Gain: delete less information : delete less information attributes, also adopted in ID3 and C4.5 as attributes, also adopted in ID3 and C4.5 as splitting criteria during the tree growing splitting criteria during the tree growing procedureprocedure A measure based on EntropyA measure based on Entropy
ReliefRelief: estimate attributes according to how well : estimate attributes according to how well their values distinguish among instances that are their values distinguish among instances that are near each other.near each other. An instance based attribute ranking schemeAn instance based attribute ranking scheme Randomly sampling an instance Randomly sampling an instance II from the data from the data Locate Locate I’I’ss nearest neighbournearest neighbour from the same and from the same and
opposite classopposite class Compare them and update relevance scores for Compare them and update relevance scores for
each attributeeach attribute
10 June 2005
Top 15 predictors
11 AgeAge 99 GlycosuriaGlycosuria
22 Diagnosis Diagnosis DurationDuration
1010 Complication Complication TypeType
33 Insulin TreatmentInsulin Treatment 1111 BP DiastolicBP Diastolic
44 Family HistoryFamily History 1212 Tablet Tablet TreatmentTreatment
55 SmokingSmoking 1313 LabTriglyceridesLabTriglycerides
66 LabRBGLabRBG 1414 General General ProteinuriaProteinuria
77 Diet TreatmentDiet Treatment 1515 BPSystolicBPSystolic
88 BMIBMI
10 June 2005
Design Experiments
Classification AlgorithmsClassification Algorithms Naïve Bayes – A Statistical Method for Naïve Bayes – A Statistical Method for
ClassificationClassification IB1 IB1 – Instance Based nearest neighbour – Instance Based nearest neighbour
algorithmalgorithm C4.5 C4.5 – Inductive learning algorithm using – Inductive learning algorithm using
decision treesdecision trees
Sampling strategy: 10-fold cross validationSampling strategy: 10-fold cross validation
10 June 2005
Classification Results - initial
Table 1: Classification accuracy (%) for Different Sizes Feature Subsets (10-CV/Training and Testing)
AttributeAttributeNumberNumber Naïve BayesNaïve Bayes IB1IB1 C4.5C4.5
DiscretizeDiscretized C4.5d C4.5
AverageAverage
55 69.3669.36 69.1469.14 76.3676.36 75.2375.23 72.5272.52
88 74.6074.60 70.4970.49 76.1276.12 75.7675.76 74.2474.24
1010 72.4772.47 71.5471.54 77.2177.21 77.4677.46 74.6774.67
1515 72.9272.92 70.3770.37 78.7378.73 78.1278.12 75.0475.04
2020 71.4871.48 69.3069.30 76.4276.42 76.7376.73 73.4873.48
2525 69.2469.24 67.8867.88 77.5277.52 77.7577.75 73.1073.10
3030 70.5370.53 67.7867.78 77.4377.43 77.5277.52 73.3273.32
4747 62.3562.35 63.4463.44 75.3875.38 76.3776.37 69.3969.39
AverageAverage 70.3770.37 68.7468.74 76.9076.90 76.8776.87 --------------------------------
10 June 2005
CA Based on 10-CV
62
67
72
77
82
5 8 10 15 20 25 30 47
Number of Features
Cla
ssifi
catio
n A
ccur
acy
Naïve Bayes
IB1
C4.5
D C4.5
Average
10 June 2005
Sensitivity and Specificity - initial
AttribAttributeute
NumbNumberer
NaïveBayNaïveBayeses IB1IB1 C4.5C4.5
DiscretizeDiscretized C4.5d C4.5
55 0.912/0.912/0.270.2766
0.892/0.892/0.300.3066
0.947/0.947/0.410.4133
0.938/0.938/0.390.3977
88 0.921/0.921/0.410.4111
0.883/0.883/0.360.3655
0.951/0.951/0.390.3988
0.942/0.942/0.400.4055
1010 0.782/0.782/0.610.6155
0.907/0.907/0.340.3499
0.962/0.962/0.400.4099
0.957/0.957/0.420.4266
1515 0.631/0.631/0.780.7811
0.912/0.912/0.300.3066
0.973/0.973/0.430.4322
0.987/0.987/0.380.3877
2020 0.685/0.685/0.770.7722
0.838/0.838/0.410.4166
0.940/0.940/0.420.4288
0.963/0.963/0.390.3933
2525 0.656/0.656/0.760.7622
0.821/0.821/0.400.4077
0.932/0.932/0.470.4755
0.972/0.972/0.400.4055
3030 0.708/0.708/0.700.7000
0.835/0.835/0.370.3777
0.935/0.935/0.460.4677
0.955/0.955/0.430.4311
4747 0.587/0.587/0.690.6933
0.810/0.810/0.290.2988
0.928/0.928/0.420.4211
0.964/0.964/0.380.3811
AveragAveragee
0.735/0.735/0.60.62525
0.862/0.862/0.30.35353
0.946/0.946/0.40.43030
0.960/0.960/0.40.40303
10 June 2005
Discussion C4.5 C4.5 decision tree algorithm decision tree algorithm had the best had the best
performance for classificationperformance for classification Discretization did not improve the performance Discretization did not improve the performance
of C4.5 significantly on our data setof C4.5 significantly on our data set On average, the best results can be achieved On average, the best results can be achieved
when the top 15 attributes were selected for when the top 15 attributes were selected for predictionprediction
IB1 and Naïve Bayes did benefit from the IB1 and Naïve Bayes did benefit from the reduction of the input parameters, C4.5 less soreduction of the input parameters, C4.5 less so
Naïve Bayes can classify both patients groups Naïve Bayes can classify both patients groups with a reasonable accuracywith a reasonable accuracy
Most classifiers tend to have better performance Most classifiers tend to have better performance to check the to check the bad controlbad control cases in the population cases in the population
10 June 2005
Relief Algorithms
A feature weight-based method inspired by instance-A feature weight-based method inspired by instance-based learning algorithmsbased learning algorithms
Key idea of original ReliefKey idea of original Relief— Estimate the quality of attributes according to how well Estimate the quality of attributes according to how well
their values distinguish among instances that are near to their values distinguish among instances that are near to each othereach other
— Does not make the assumption that the attributes are Does not make the assumption that the attributes are conditionally independentconditionally independent
ReliefF (Kononenko,1994): the extension of ReliefReliefF (Kononenko,1994): the extension of Relief— Applicable to the multi-class data setsApplicable to the multi-class data sets— Tolerant to noisy and incomplete dataTolerant to noisy and incomplete data
10 June 2005
Optimization of ReliefF
Data transformationData transformation— Frequency based encoding schemeFrequency based encoding scheme
Representing categorical code of a particular variable with a Representing categorical code of a particular variable with a numerical value derived from its relation frequency among numerical value derived from its relation frequency among outcomesoutcomes
Supervised Model Construction for Starter SelectionSupervised Model Construction for Starter Selection— Generate number of instances (Generate number of instances (m)m) automatically, automatically,
eliminating the dependency on the selection of a “good eliminating the dependency on the selection of a “good value” for value” for m m to improve the efficiency of the algorithmto improve the efficiency of the algorithm
— Basic idea: Group the “near” cases with the same class Basic idea: Group the “near” cases with the same class labellabel
— Similarity measurement: Euclidean distance functionSimilarity measurement: Euclidean distance function— Repeated until an instance with different class label is Repeated until an instance with different class label is
encounteredencountered
10 June 2005
Feature Selection via Supervised Model
Construction Improve efficiencyImprove efficiency Retain accuracyRetain accuracy Centre is a ‘good’ representation of Centre is a ‘good’ representation of
clustercluster Scope of local region?Scope of local region?
10 June 2005
Experiment Design
C4.5 as the classification algorithmC4.5 as the classification algorithm Nine benchmark UCI data setsNine benchmark UCI data sets
Number of cases varies from 57 to 8,124Number of cases varies from 57 to 8,124 Contains a mixture of nominal and Contains a mixture of nominal and
numerical attributesnumerical attributes 10-fold Cross Validation10-fold Cross Validation InfoGain and ReliefF were used for InfoGain and ReliefF were used for
comparisoncomparison
10 June 2005
Number of Selected
AttributesData SetData Set CasesCases After After
FSSMCFSSMCReductiReduction Rate on Rate (%)(%)
BreastBreast 699699 4545 93.693.6
CreditCredit 690690 159159 77.077.0
DiabetesDiabetes 768768 240240 68.868.8
GlassGlass 214214 8080 62.662.6
HeartHeart 294294 3939 86.786.7
IrisIris 150150 1313 91.391.3
LabourLabour 5757 1010 82.582.5
MushrooMushroomm
81248124 8989 98.998.9
SoybeanSoybean 683683 109109 84.084.0
10 June 2005
Data SetsData Sets C4.5C4.5
Before FSBefore FS InfoGainInfoGain ReliefFReliefF FSSMCFSSMC
BreastBreast 0/94.60/94.6 0.16/94.80.16/94.8 2.05/95.32.05/95.3 0.15/95.30.15/95.3
CreditCredit 0/86.40/86.4 1.15/86.71.15/86.7 3.58/86.43.58/86.4 1.20/86.81.20/86.8
DiabetesDiabetes 0/74.50/74.5 1.26/74.11.26/74.1 2.63/75.82.63/75.8 1.34/75.81.34/75.8
GlassGlass 0/65.40/65.4 0.44/69.20.44/69.2 0.56/69.60.56/69.6 0.67/69.60.67/69.6
HeartHeart 0/76.20/76.2 0.22/79.90.22/79.9 0.32/80.60.32/80.6 0.41/81.20.41/81.2
IrisIris 0/95.30/95.3 0.05/95.30.05/95.3 0.10/95.30.10/95.3 0.23/95.30.23/95.3
LabourLabour 0/73.70/73.7 0.22/75.40.22/75.4 0.41/75.40.41/75.4 0.51/75.40.51/75.4
MushrooMushroomm
0/1000/100 0.65/1000.65/100 446/100446/100 5.86/1005.86/100
SoybeanSoybean 0/92.40/92.4 0.34/90.20.34/90.2 5.92/92.45.92/92.4 1.73/93.21.73/93.2
AverageAverage 0/85.10/85.1 0.47/86.00.47/86.0 46.2/86.646.2/86.6 1.26/86.71.26/86.7
Processing Time (in sec.)/ Classification Accuracy(%)
10 June 2005
Discussion
1.1. InfoGainInfoGain— The fastest approachThe fastest approach
2.2. ReliefFReliefF— Long time to handle large data setsLong time to handle large data sets
3.3. FSSMCFSSMC — Takes longer time on small data sets than InfoGain Takes longer time on small data sets than InfoGain
and ReliefFand ReliefF— No significant classification accuracy improvementNo significant classification accuracy improvement— Achieves the best combined results (classification Achieves the best combined results (classification
accuracy and efficiency) on averageaccuracy and efficiency) on average— Overcomes the computational problem of ReliefF and Overcomes the computational problem of ReliefF and
preserves classification accuracypreserves classification accuracy
10 June 2005
10 June 2005
KNN imputation : Ulster
Hospital and PIMA data
20 Random SimulationsUlster Hospital
0
2
4
6
8
10
12
14
16
18
5% 10%
15%
20%
25%
30%
35%
Missing Values
Err
or
5-NN
10-NN
NORM
Meanimputation
EMImpute_Columns
LSImpute_Rows
20 Random SimulationsPIMA
0
2
4
6
8
10
12
14
16
18
5% 10%
15%
20%
25%
30%
35%
Missing ValuesE
rror
10-NN
NORM
Meanimputation
EMImpute_Columns
LSImpute_Rows
Comparison of different methods using different fractions of Comparison of different methods using different fractions of missing values in the imputation process and different missing values in the imputation process and different datasets. datasets.
10 June 2005
Classification System based on Supervised
Model
• Assessment of the risk of Cardiovascular Heart Diseases (CHD) in patients with diabetes type 2 is the main objective
• The k-Nearest Neighbour (kNN) classification algorithm will provide the basis for new decision support tools. It classifies patients according to their similarity with previous cases
• A knowledge-driven, weighted kNN (WkNN) method has been proposed to distinguish significant diagnostic markers
• A genetic algorithm (GA) that incorporates background knowledge will be developed to support such a feature relevance analysis task
10 June 2005
Background Knowledge
• User feedback • Constraints •Ontology • Annotation text
Patient 1
Patient 2
Patient 3
Patient N
GA
WkNN
Results
10 June 2005
W1W1 W2W2 WnWn n+1n+1
PatienPatientt
Feature Feature 11
Feature Feature 22
…… Feature Feature nn
ClassificatioClassificationn
11 11
22 00
..
..
NN 22
WkNNAnother problem
How can you choose the right weights?
GA0.5, 0.2, 0.9, …, 0.320.7, 0.1, 0.8,…, 0.60.4 , 0.5, 0.6,…, 0.30.38, 0.2, 07,…, 0.1.0.83, 0.34, 0.98, …,0.61
Initial Population
0.3, 0.4, 0.7, …, 0.30.7, 0.17, 0.5,…, 0.690.4 , 0.1, 0.8,…, 0.360.44, 0.2, 0.1,…, 0.89.0.61, 0.98, 0.34, …,0.83
New Population
0.30.3 0.40.4 0.90.9 0.30.3
0.70.7 0.170.17 0.50.5 0.690.69
0.40.4 0.10.1 0.80.8 0.360.36
0.440.44 0.20.2 0.10.1 0.890.89
0.610.61 0.980.98 0.340.34 0.830.83
99 lowerlower
1515 lowestlowest
55 lowerlower
high lesshigh less
22
11 highesthighest
WW11 WW22 WW33 …… WWnn N0. Miss- N0. Miss- classificationclassification
FitnessFitness
10 June 2005
Semi-supervised Clustering
Combines the benefits of supervised and unsupervised classification methods
To mTo make use of class labels or pairwise constraints on the data to guide the clustering process
To aTo allows users to guide and interact with the clustering process by providing feedback during the learning and post-processing stages
Goals To make clustering both more effective and meaningful To To support the selection of relevant, optimized
partitions for decision support
10 June 2005
Data Preprocessing Clustering Model
Background Information
Clustering Output
Data
Data: Diabetic Patients’ RecordsData Preprocessing: Normalization, Filtration, Missing Value EstimationBackground Information: Experts' Constraints and Feedback Clustering Model: Detection of Relevant Groups of Similar Data (patients) Using Different Statistical and Knowledge-Driven Optimization CriteriaClustering Output: Similar Groups of Data (patients) Associated with Common Characteristic (significant medical outcomes, conditions or coronary heart disease risk levels)
10 June 2005
Initial Test
Simple Model on the Proposed AlgorithmSimple Model on the Proposed Algorithm Original Class distribution from PIMA datasetOriginal Class distribution from PIMA dataset
Class 1 Class 2Class 1 Class 2
M set: (M set: (2,42,4) () (6,86,8) () (8,118,11) () (1,51,5) () (7,97,9))
C set: (C set: (1,21,2) () (4,74,7) () (6,96,9)) Preliminary Results Preliminary Results Outlier: 3, 10Outlier: 3, 10
Class A Class BClass A Class B
2,4,6,8,11,13 1,3,5,7,9,10,12,14,15
2,4,6,8 1,5,7,9,11,12,13,14,15
10 June 2005
Case Based Reasoning
Memory-based lazy problem-solverMemory-based lazy problem-solver System stores training data and waits until a new System stores training data and waits until a new
problem is received before constructing a solutionproblem is received before constructing a solution Differs from Differs from kkNNNN in that case attributes can be of any in that case attributes can be of any
type (i.e. not just numeric)type (i.e. not just numeric)
How do CBR systems solve problems?How do CBR systems solve problems? CBR systems store a set of past problem cases together CBR systems store a set of past problem cases together
with their solutions in a Case Base, e.g. a case could be a with their solutions in a Case Base, e.g. a case could be a set of patient symptoms + a diagnosis based on those set of patient symptoms + a diagnosis based on those symptomssymptoms
- when a new problem case is received, the system - when a new problem case is received, the system retrieves one or more similar past cases, and re-uses or retrieves one or more similar past cases, and re-uses or adapts their solutions to solve the new caseadapts their solutions to solve the new case
10 June 2005
Acknowledgements
Medical Informatics Recognised Medical Informatics Recognised Research GroupResearch Group
NIKELNIKEL North – South Collaboration TeamNorth – South Collaboration Team Roy Harper, Consultant at Ulster Roy Harper, Consultant at Ulster
HospitalHospital
10 June 2005
Thank You For Your Attention