56
14 June 2005 [email protected] Medical Informatics: University of Ulster Paul McCullagh Paul McCullagh University of Ulster University of Ulster

14 June 2005 [email protected] Medical Informatics: University of Ulster Paul McCullagh University of Ulster

Embed Size (px)

Citation preview

Page 1: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

14 June [email protected]

Medical Informatics:

University of UlsterPaul McCullaghPaul McCullagh

University of UlsterUniversity of Ulster

Page 2: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Page 3: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Ulster Institute of eHealth

Page 4: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

www.uieh.n-i.nhs.uk

Page 5: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Stroke Web Interface

Features:Features: Animation feedback Animation feedback

to patient: 3D to patient: 3D rendering of patient rendering of patient movement during movement during rehabilitationrehabilitation

Communication Communication tools for patients tools for patients and professionalsand professionals

Decision supportDecision support

Home-based system isHome-based system iscurrently undergoing currently undergoing

developmentdevelopment

Page 6: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Application Of Multimedia To Nursing Education: A Case

Study Based On The Diagnosis Of Alcohol Abuse

• Culture of Binge Drinking• Multimedia as an Education

Tool • Interactive Learning• Self Assessment• Exemplars From Other Areas

Including: Diabetes, Testicular Cancer, Anorexia

Page 7: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Diabetes Education

Type I & Type II Type I & Type II innovative innovative multimedia multimedia patient-centred patient-centred education education materialsmaterials

Page 8: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Intelligent Consultant :

Interface Natural Language ProcessingNatural Language Processing

Page 9: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Image or Region in an Image

Feature Extraction Classification

ImageLabeled

Texture Analysis and Classification

Methods for the Characterisation of

Ultrasonic Images of the Placenta

Page 10: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

The Grannum Classification

Page 11: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Edge Detection Line Thinning

Threshold segmentation

Age Related Macular Disease

Page 12: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Full Screen Image of the Application displaying an Occult lesion with edge detection, a pixel count and

histogram of the region inside the box

Full Screen Image of the Application displaying an Occult lesion with edge detection, a pixel count and

histogram of the region inside the box

Page 13: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Body Surface Potential Mapping

Page 14: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Observations Driving Lead System Development

Electrophysiology associated with disease is most Electrophysiology associated with disease is most

often localized in the heart:often localized in the heart: Infarction, ischemia, accessory A-V pathways, ectopic foci, Infarction, ischemia, accessory A-V pathways, ectopic foci,

“late potentials”, conduction or repolarization “late potentials”, conduction or repolarization

abnormalitiesabnormalities

ECG manifestations of localized disease are ECG manifestations of localized disease are

localized on the body surfacelocalized on the body surface Clinical lead systems are not optimized for Clinical lead systems are not optimized for

diagnostic information capture - often do not sample diagnostic information capture - often do not sample

regions where diagnostic information occursregions where diagnostic information occurs

Page 15: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Lead selectionLead selection Best leads for estimating all surface Best leads for estimating all surface

potentialspotentials Best leads for diagnostic informationBest leads for diagnostic information Are these the same?Are these the same?

Data mining techniquesData mining techniques WrappersWrappers Sequential selectionSequential selection

Mining for Information In Body Surface Potential Maps

Page 16: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Normal hearing five peak response using high stimulus level (70db)

Peak amplitudes reduce as stimulus level is reducedPeak amplitudes reduce as stimulus level is reduced

Only wave V remains at threshold - about 30db stimulus levelOnly wave V remains at threshold - about 30db stimulus level

Hearing Screening

Page 17: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Clementine data mining software used to generate neural network and decision tree models for classification of ABR waveforms

The individual models will make use of:

• time domain data

• frequency domain data

• correlation of subaverages

Page 18: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Wavelet Decomposition

0 50 100 150 200 250 300 350 400-8.5

-8

-7.5

-7

-6.5

-6

-5.5

-5

-4.5

-4

-3.5

0 2 4 6 8 10 12 14 16-5

-4

-3

-2

-1

0

1

2

3

4

5

Wavelet Decomposition

Pre-stimulus data Post-stimulus data

256 coefficients

Analysis of 16 D4 Coefficients

Analysis of 16 D4 Coefficients

0 2 4 6 8 10 12 14 16-5

-4

-3

-2

-1

0

1

2

3

4

5Ratio of the sum of absolute values: Pre-stimulus

Post-stimulus

Closer ratio is to 0 the higher probability of a response

Page 19: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

CBR for wound healing progress

Objective of researchObjective of research – Automated Wound – Automated Wound healing monitoring and assessmenthealing monitoring and assessment Determine size of whole woundDetermine size of whole wound Determine tissue types presentDetermine tissue types present Coverage of different types of tissuesCoverage of different types of tissues Automatically monitor healing over timeAutomatically monitor healing over time Remove subjectivityRemove subjectivity Improve decision making process and careImprove decision making process and care

Technologies usedTechnologies used Case-based reasoningCase-based reasoning Feature extraction/transformationFeature extraction/transformation

Page 20: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Work to dateWork to date – classification for tissue types – classification for tissue types MethodMethod

Take an image overlap with a gridTake an image overlap with a grid Make prediction for each type of tissue Make prediction for each type of tissue Prediction made based on systems knowledge of Prediction made based on systems knowledge of

previous tissue types (cases) that have been previous tissue types (cases) that have been identified by professionalsidentified by professionals

Overall accuracyOverall accuracy – – 86% 86% PublicationsPublications

Zheng, Bradley, Patterson, Galushka, Winder, Zheng, Bradley, Patterson, Galushka, Winder, ““New Protocol for leg ulcer tissue classification from colour images”, Proc. 26th int. conf. of , Proc. 26th int. conf. of engineering in medicine and biology society engineering in medicine and biology society (EMBS 04) (EMBS 04)

Galushka, Zheng, Patterson, Bradley Galushka, Zheng, Patterson, Bradley “Case-Based Tissue Classification for Monitoring Leg Ulcer Healing, 18th IEEE Int. Symposium on , 18th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS 2005).Computer-Based Medical Systems (CBMS 2005).

Page 21: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

AnalysisAnalysis P

redictio

nP

redictio

n

ComparisonComparison

Page 22: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

14 June [email protected]

Feature Selection and Classification on Type 2 Diabetic

Patients’ DataPaul McCullaghPaul McCullagh

University of UlsterUniversity of Ulster

Page 23: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Diabetes

World’s situationWorld’s situation Around 194 million people with diabetes by WHO Around 194 million people with diabetes by WHO

studystudy 50% patients are undiagnosed50% patients are undiagnosed

Northern IrelandNorthern Ireland 49,000 diagnosed patients in NI49,000 diagnosed patients in NI Another 25,000 are unaware their conditionAnother 25,000 are unaware their condition

Type 2 diabetes (NIDDM)Type 2 diabetes (NIDDM) Diabetic complicationsDiabetic complications Blood Glucose controlBlood Glucose control HbA1c testHbA1c test

Page 24: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Data Mining Large amounts of information gathered in Large amounts of information gathered in

medical databasesmedical databases Traditional manual analysis has become Traditional manual analysis has become

inadequateinadequate Efficient computer-based analysis are Efficient computer-based analysis are

indispensableindispensable Noisy, incomplete and inconsistent dataNoisy, incomplete and inconsistent data

Can we determine factors which Can we determine factors which influence how well the patients influence how well the patients progress? progress?

Are these factors under our control?Are these factors under our control?

Page 25: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Relative Risk for the Development of

Diabetic Complications

Source: Rahman Y, Nolan J and Grimson J. E-Clinic: Re-engineering Clinical Care Process in Diabetes Management.

Healthcare Informatics Society of Ireland, 2002

Page 26: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

North Down Primary Care Organisation

Quality of data in Primary Care Data SetsQuality of data in Primary Care Data Sets Hba1c FundoscopyHba1c Fundoscopy

Bar Chart Showing Percentage Recording of HbA1c For Diabetic Patients in Practices

0 10 20 30 40 50 60 70 80 90 100

1

2

3

4

5

6

7

8

9

10

11

Prac

tice

Percentage %

Bar Chart Showing Percentage Recording Of Fundoscopy For Diabetic Patients in Practices

0 10 20 30 40 50 60 70 80 90 100

1

2

3

4

5

6

7

8

9

10

11

Percentage %

Page 27: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Data Set

Ulster Community & Ulster Community & Hospitals TrustHospitals Trust

2064 type 2 patients, 2064 type 2 patients, 20876 records20876 records

1148 males, 916 females1148 males, 916 females 410 features reduced to 410 features reduced to

47 relevant features: 23 47 relevant features: 23 categorical, 24 numericalcategorical, 24 numerical

Average 7.8% missing Average 7.8% missing valuesvalues

Distribution of Patients' Age

563637

579

238

100200300400500600700

20-60 60-70 70-80 >80

Age

Page 28: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Research Goals

Identify significant factors that influence Type 2 diabetes controlIdentify significant factors that influence Type 2 diabetes control Weight, Smoking status or Alcohol?Weight, Smoking status or Alcohol? Height, Age or Gender?Height, Age or Gender? Time Interval between two tests?Time Interval between two tests? Cholesterol level?Cholesterol level?

Classifying individuals at bad disease control in the populationClassifying individuals at bad disease control in the population Distinguish bad blood glucose control patients from good blood Distinguish bad blood glucose control patients from good blood

glucose control patients based on physiological and examination glucose control patients based on physiological and examination factorsfactors

Predict individuals in the population with poor diabetes control Predict individuals in the population with poor diabetes control status based on physiological and examination factorsstatus based on physiological and examination factors

Investigate the potential of data mining techniques in ‘real world’ Investigate the potential of data mining techniques in ‘real world’ medical domain and medical domain and evaluate different data mining approachesevaluate different data mining approaches

Page 29: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Data Mining Procedure

Pre-Processing Clean Data Feature Selection

Target Data

Data Mining Schemes

Model/Patterns

Post-ProcessingKnowledge

Raw Data

Page 30: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Data Preprocessing

Data IntegrationData Integration Combine data from multiple sources into a single data Combine data from multiple sources into a single data

tabletable Data TransformationData Transformation

Divide the patients into 2 categories (Divide the patients into 2 categories (Better ControlBetter Control and and Worse ControlWorse Control) based on the comparison HbA1c value ) based on the comparison HbA1c value of the laboratory test and the target valueof the laboratory test and the target value

BetterBetter: 34.33% ; : 34.33% ; WorseWorse: 65.67%: 65.67% Data ReductionData Reduction

Remove the attributes with more than 50% missing dataRemove the attributes with more than 50% missing data Keep the features recommended by the diabetic expert Keep the features recommended by the diabetic expert

and the international diabetes guidelinesand the international diabetes guidelines

Page 31: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Feature Selection

Identify and remove irrelevant and Identify and remove irrelevant and redundant informationredundant information— Not all attributes are actually usefulNot all attributes are actually useful

• Noisy, irrelevant and redundant attributesNoisy, irrelevant and redundant attributes Minimize the associated measurement costsMinimize the associated measurement costs Improve prediction accuracyImprove prediction accuracy Reduce the complexityReduce the complexity Easier interpretation the classification Easier interpretation the classification

resultsresults

Page 32: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Feature Selection

Information GainInformation Gain: delete less information : delete less information attributes, also adopted in ID3 and C4.5 as attributes, also adopted in ID3 and C4.5 as splitting criteria during the tree growing splitting criteria during the tree growing procedureprocedure A measure based on EntropyA measure based on Entropy

ReliefRelief: estimate attributes according to how well : estimate attributes according to how well their values distinguish among instances that are their values distinguish among instances that are near each other.near each other. An instance based attribute ranking schemeAn instance based attribute ranking scheme Randomly sampling an instance Randomly sampling an instance II from the data from the data Locate Locate I’I’ss nearest neighbournearest neighbour from the same and from the same and

opposite classopposite class Compare them and update relevance scores for Compare them and update relevance scores for

each attributeeach attribute

Page 33: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Top 15 predictors

11 AgeAge 99 GlycosuriaGlycosuria

22 Diagnosis Diagnosis DurationDuration

1010 Complication Complication TypeType

33 Insulin TreatmentInsulin Treatment 1111 BP DiastolicBP Diastolic

44 Family HistoryFamily History 1212 Tablet Tablet TreatmentTreatment

55 SmokingSmoking 1313 LabTriglyceridesLabTriglycerides

66 LabRBGLabRBG 1414 General General ProteinuriaProteinuria

77 Diet TreatmentDiet Treatment 1515 BPSystolicBPSystolic

88 BMIBMI

Page 34: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Design Experiments

Classification AlgorithmsClassification Algorithms Naïve Bayes – A Statistical Method for Naïve Bayes – A Statistical Method for

ClassificationClassification IB1 IB1 – Instance Based nearest neighbour – Instance Based nearest neighbour

algorithmalgorithm C4.5 C4.5 – Inductive learning algorithm using – Inductive learning algorithm using

decision treesdecision trees

Sampling strategy: 10-fold cross validationSampling strategy: 10-fold cross validation

Page 35: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Classification Results - initial

Table 1: Classification accuracy (%) for Different Sizes Feature Subsets (10-CV/Training and Testing)

AttributeAttributeNumberNumber Naïve BayesNaïve Bayes IB1IB1 C4.5C4.5

DiscretizeDiscretized C4.5d C4.5

AverageAverage

55 69.3669.36 69.1469.14 76.3676.36 75.2375.23 72.5272.52

88 74.6074.60 70.4970.49 76.1276.12 75.7675.76 74.2474.24

1010 72.4772.47 71.5471.54 77.2177.21 77.4677.46 74.6774.67

1515 72.9272.92 70.3770.37 78.7378.73 78.1278.12 75.0475.04

2020 71.4871.48 69.3069.30 76.4276.42 76.7376.73 73.4873.48

2525 69.2469.24 67.8867.88 77.5277.52 77.7577.75 73.1073.10

3030 70.5370.53 67.7867.78 77.4377.43 77.5277.52 73.3273.32

4747 62.3562.35 63.4463.44 75.3875.38 76.3776.37 69.3969.39

AverageAverage 70.3770.37 68.7468.74 76.9076.90 76.8776.87 --------------------------------

Page 36: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

CA Based on 10-CV

62

67

72

77

82

5 8 10 15 20 25 30 47

Number of Features

Cla

ssifi

catio

n A

ccur

acy

Naïve Bayes

IB1

C4.5

D C4.5

Average

Page 37: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Sensitivity and Specificity - initial

AttribAttributeute

NumbNumberer

NaïveBayNaïveBayeses IB1IB1 C4.5C4.5

DiscretizeDiscretized C4.5d C4.5

55 0.912/0.912/0.270.2766

0.892/0.892/0.300.3066

0.947/0.947/0.410.4133

0.938/0.938/0.390.3977

88 0.921/0.921/0.410.4111

0.883/0.883/0.360.3655

0.951/0.951/0.390.3988

0.942/0.942/0.400.4055

1010 0.782/0.782/0.610.6155

0.907/0.907/0.340.3499

0.962/0.962/0.400.4099

0.957/0.957/0.420.4266

1515 0.631/0.631/0.780.7811

0.912/0.912/0.300.3066

0.973/0.973/0.430.4322

0.987/0.987/0.380.3877

2020 0.685/0.685/0.770.7722

0.838/0.838/0.410.4166

0.940/0.940/0.420.4288

0.963/0.963/0.390.3933

2525 0.656/0.656/0.760.7622

0.821/0.821/0.400.4077

0.932/0.932/0.470.4755

0.972/0.972/0.400.4055

3030 0.708/0.708/0.700.7000

0.835/0.835/0.370.3777

0.935/0.935/0.460.4677

0.955/0.955/0.430.4311

4747 0.587/0.587/0.690.6933

0.810/0.810/0.290.2988

0.928/0.928/0.420.4211

0.964/0.964/0.380.3811

AveragAveragee

0.735/0.735/0.60.62525

0.862/0.862/0.30.35353

0.946/0.946/0.40.43030

0.960/0.960/0.40.40303

Page 38: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Discussion C4.5 C4.5 decision tree algorithm decision tree algorithm had the best had the best

performance for classificationperformance for classification Discretization did not improve the performance Discretization did not improve the performance

of C4.5 significantly on our data setof C4.5 significantly on our data set On average, the best results can be achieved On average, the best results can be achieved

when the top 15 attributes were selected for when the top 15 attributes were selected for predictionprediction

IB1 and Naïve Bayes did benefit from the IB1 and Naïve Bayes did benefit from the reduction of the input parameters, C4.5 less soreduction of the input parameters, C4.5 less so

Naïve Bayes can classify both patients groups Naïve Bayes can classify both patients groups with a reasonable accuracywith a reasonable accuracy

Most classifiers tend to have better performance Most classifiers tend to have better performance to check the to check the bad controlbad control cases in the population cases in the population

Page 39: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Relief Algorithms

A feature weight-based method inspired by instance-A feature weight-based method inspired by instance-based learning algorithmsbased learning algorithms

Key idea of original ReliefKey idea of original Relief— Estimate the quality of attributes according to how well Estimate the quality of attributes according to how well

their values distinguish among instances that are near to their values distinguish among instances that are near to each othereach other

— Does not make the assumption that the attributes are Does not make the assumption that the attributes are conditionally independentconditionally independent

ReliefF (Kononenko,1994): the extension of ReliefReliefF (Kononenko,1994): the extension of Relief— Applicable to the multi-class data setsApplicable to the multi-class data sets— Tolerant to noisy and incomplete dataTolerant to noisy and incomplete data

Page 40: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Optimization of ReliefF

Data transformationData transformation— Frequency based encoding schemeFrequency based encoding scheme

Representing categorical code of a particular variable with a Representing categorical code of a particular variable with a numerical value derived from its relation frequency among numerical value derived from its relation frequency among outcomesoutcomes

Supervised Model Construction for Starter SelectionSupervised Model Construction for Starter Selection— Generate number of instances (Generate number of instances (m)m) automatically, automatically,

eliminating the dependency on the selection of a “good eliminating the dependency on the selection of a “good value” for value” for m m to improve the efficiency of the algorithmto improve the efficiency of the algorithm

— Basic idea: Group the “near” cases with the same class Basic idea: Group the “near” cases with the same class labellabel

— Similarity measurement: Euclidean distance functionSimilarity measurement: Euclidean distance function— Repeated until an instance with different class label is Repeated until an instance with different class label is

encounteredencountered

Page 41: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Feature Selection via Supervised Model

Construction Improve efficiencyImprove efficiency Retain accuracyRetain accuracy Centre is a ‘good’ representation of Centre is a ‘good’ representation of

clustercluster Scope of local region?Scope of local region?

Page 42: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Experiment Design

C4.5 as the classification algorithmC4.5 as the classification algorithm Nine benchmark UCI data setsNine benchmark UCI data sets

Number of cases varies from 57 to 8,124Number of cases varies from 57 to 8,124 Contains a mixture of nominal and Contains a mixture of nominal and

numerical attributesnumerical attributes 10-fold Cross Validation10-fold Cross Validation InfoGain and ReliefF were used for InfoGain and ReliefF were used for

comparisoncomparison

Page 43: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Number of Selected

AttributesData SetData Set CasesCases After After

FSSMCFSSMCReductiReduction Rate on Rate (%)(%)

BreastBreast 699699 4545 93.693.6

CreditCredit 690690 159159 77.077.0

DiabetesDiabetes 768768 240240 68.868.8

GlassGlass 214214 8080 62.662.6

HeartHeart 294294 3939 86.786.7

IrisIris 150150 1313 91.391.3

LabourLabour 5757 1010 82.582.5

MushrooMushroomm

81248124 8989 98.998.9

SoybeanSoybean 683683 109109 84.084.0

Page 44: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Data SetsData Sets C4.5C4.5

Before FSBefore FS InfoGainInfoGain ReliefFReliefF FSSMCFSSMC

BreastBreast 0/94.60/94.6 0.16/94.80.16/94.8 2.05/95.32.05/95.3 0.15/95.30.15/95.3

CreditCredit 0/86.40/86.4 1.15/86.71.15/86.7 3.58/86.43.58/86.4 1.20/86.81.20/86.8

DiabetesDiabetes 0/74.50/74.5 1.26/74.11.26/74.1 2.63/75.82.63/75.8 1.34/75.81.34/75.8

GlassGlass 0/65.40/65.4 0.44/69.20.44/69.2 0.56/69.60.56/69.6 0.67/69.60.67/69.6

HeartHeart 0/76.20/76.2 0.22/79.90.22/79.9 0.32/80.60.32/80.6 0.41/81.20.41/81.2

IrisIris 0/95.30/95.3 0.05/95.30.05/95.3 0.10/95.30.10/95.3 0.23/95.30.23/95.3

LabourLabour 0/73.70/73.7 0.22/75.40.22/75.4 0.41/75.40.41/75.4 0.51/75.40.51/75.4

MushrooMushroomm

0/1000/100 0.65/1000.65/100 446/100446/100 5.86/1005.86/100

SoybeanSoybean 0/92.40/92.4 0.34/90.20.34/90.2 5.92/92.45.92/92.4 1.73/93.21.73/93.2

AverageAverage 0/85.10/85.1 0.47/86.00.47/86.0 46.2/86.646.2/86.6 1.26/86.71.26/86.7

Processing Time (in sec.)/ Classification Accuracy(%)

Page 45: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Discussion

1.1. InfoGainInfoGain— The fastest approachThe fastest approach

2.2. ReliefFReliefF— Long time to handle large data setsLong time to handle large data sets

3.3. FSSMCFSSMC — Takes longer time on small data sets than InfoGain Takes longer time on small data sets than InfoGain

and ReliefFand ReliefF— No significant classification accuracy improvementNo significant classification accuracy improvement— Achieves the best combined results (classification Achieves the best combined results (classification

accuracy and efficiency) on averageaccuracy and efficiency) on average— Overcomes the computational problem of ReliefF and Overcomes the computational problem of ReliefF and

preserves classification accuracypreserves classification accuracy

Page 46: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Page 47: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

KNN imputation : Ulster

Hospital and PIMA data

20 Random SimulationsUlster Hospital

0

2

4

6

8

10

12

14

16

18

5% 10%

15%

20%

25%

30%

35%

Missing Values

Err

or

5-NN

10-NN

NORM

Meanimputation

EMImpute_Columns

LSImpute_Rows

20 Random SimulationsPIMA

0

2

4

6

8

10

12

14

16

18

5% 10%

15%

20%

25%

30%

35%

Missing ValuesE

rror

10-NN

NORM

Meanimputation

EMImpute_Columns

LSImpute_Rows

Comparison of different methods using different fractions of Comparison of different methods using different fractions of missing values in the imputation process and different missing values in the imputation process and different datasets. datasets.

Page 48: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Classification System based on Supervised

Model

• Assessment of the risk of Cardiovascular Heart Diseases (CHD) in patients with diabetes type 2 is the main objective

• The k-Nearest Neighbour (kNN) classification algorithm will provide the basis for new decision support tools. It classifies patients according to their similarity with previous cases

• A knowledge-driven, weighted kNN (WkNN) method has been proposed to distinguish significant diagnostic markers

• A genetic algorithm (GA) that incorporates background knowledge will be developed to support such a feature relevance analysis task

Page 49: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Background Knowledge

• User feedback • Constraints •Ontology • Annotation text

Patient 1

Patient 2

Patient 3

Patient N

GA

WkNN

Results

Page 50: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

W1W1 W2W2 WnWn n+1n+1

PatienPatientt

Feature Feature 11

Feature Feature 22

…… Feature Feature nn

ClassificatioClassificationn

11 11

22 00

..

..

NN 22

WkNNAnother problem

How can you choose the right weights?

GA0.5, 0.2, 0.9, …, 0.320.7, 0.1, 0.8,…, 0.60.4 , 0.5, 0.6,…, 0.30.38, 0.2, 07,…, 0.1.0.83, 0.34, 0.98, …,0.61

Initial Population

0.3, 0.4, 0.7, …, 0.30.7, 0.17, 0.5,…, 0.690.4 , 0.1, 0.8,…, 0.360.44, 0.2, 0.1,…, 0.89.0.61, 0.98, 0.34, …,0.83

New Population

0.30.3 0.40.4 0.90.9 0.30.3

0.70.7 0.170.17 0.50.5 0.690.69

0.40.4 0.10.1 0.80.8 0.360.36

0.440.44 0.20.2 0.10.1 0.890.89

0.610.61 0.980.98 0.340.34 0.830.83

99 lowerlower

1515 lowestlowest

55 lowerlower

high lesshigh less

22

11 highesthighest

WW11 WW22 WW33 …… WWnn N0. Miss- N0. Miss- classificationclassification

FitnessFitness

Page 51: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Semi-supervised Clustering

Combines the benefits of supervised and unsupervised classification methods

To mTo make use of class labels or pairwise constraints on the data to guide the clustering process

To aTo allows users to guide and interact with the clustering process by providing feedback during the learning and post-processing stages

Goals To make clustering both more effective and meaningful To To support the selection of relevant, optimized

partitions for decision support

Page 52: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Data Preprocessing Clustering Model

Background Information

Clustering Output

Data

Data: Diabetic Patients’ RecordsData Preprocessing: Normalization, Filtration, Missing Value EstimationBackground Information: Experts' Constraints and Feedback Clustering Model: Detection of Relevant Groups of Similar Data (patients) Using Different Statistical and Knowledge-Driven Optimization CriteriaClustering Output: Similar Groups of Data (patients) Associated with Common Characteristic (significant medical outcomes, conditions or coronary heart disease risk levels)

Page 53: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Initial Test

Simple Model on the Proposed AlgorithmSimple Model on the Proposed Algorithm Original Class distribution from PIMA datasetOriginal Class distribution from PIMA dataset

Class 1 Class 2Class 1 Class 2

M set: (M set: (2,42,4) () (6,86,8) () (8,118,11) () (1,51,5) () (7,97,9))

C set: (C set: (1,21,2) () (4,74,7) () (6,96,9)) Preliminary Results Preliminary Results Outlier: 3, 10Outlier: 3, 10

Class A Class BClass A Class B

2,4,6,8,11,13 1,3,5,7,9,10,12,14,15

2,4,6,8 1,5,7,9,11,12,13,14,15

Page 54: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Case Based Reasoning

Memory-based lazy problem-solverMemory-based lazy problem-solver System stores training data and waits until a new System stores training data and waits until a new

problem is received before constructing a solutionproblem is received before constructing a solution Differs from Differs from kkNNNN in that case attributes can be of any in that case attributes can be of any

type (i.e. not just numeric)type (i.e. not just numeric)

How do CBR systems solve problems?How do CBR systems solve problems? CBR systems store a set of past problem cases together CBR systems store a set of past problem cases together

with their solutions in a Case Base, e.g. a case could be a with their solutions in a Case Base, e.g. a case could be a set of patient symptoms + a diagnosis based on those set of patient symptoms + a diagnosis based on those symptomssymptoms

- when a new problem case is received, the system - when a new problem case is received, the system retrieves one or more similar past cases, and re-uses or retrieves one or more similar past cases, and re-uses or adapts their solutions to solve the new caseadapts their solutions to solve the new case

Page 55: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Acknowledgements

Medical Informatics Recognised Medical Informatics Recognised Research GroupResearch Group

NIKELNIKEL North – South Collaboration TeamNorth – South Collaboration Team Roy Harper, Consultant at Ulster Roy Harper, Consultant at Ulster

HospitalHospital

Page 56: 14 June 2005 pj.mccullagh@ulster.ac.uk Medical Informatics: University of Ulster Paul McCullagh University of Ulster

10 June 2005

Thank You For Your Attention