17
4IZ451 - Knowledge Discovery in Databases Project 2 – meningoencephalitis diagnosis University of Economics in Prague Oliver Gensky (xgeno00) 12-16-2017

Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

4IZ451 - Knowledge Discovery in Databases

Project 2 – meningoencephalitis diagnosis University of Economics in Prague

Oliver Gensky (xgeno00) 12-16-2017

Petr
Textbox
Page 2: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

1 of 16

Table of Contents 1. Introduction ................................................................................................................................ 2

2. “Business” understanding ........................................................................................................... 2

2.1. Background ............................................................................................................................. 2

2.1.1. Meningitis ........................................................................................................................... 2

2.1.2. Process modeling ................................................................................................................ 2

2.2. Objectives................................................................................................................................ 3

3. Data understanding .................................................................................................................... 3

3.1. Data ......................................................................................................................................... 3

3.2. Target variables ....................................................................................................................... 3

3.3. Input variables ........................................................................................................................ 5

4. Data preparation ......................................................................................................................... 6

4.1. Treatment of outliers and missing values ............................................................................... 6

4.2. Variable transformations ........................................................................................................ 6

4.3. Sampling and data partitioning ............................................................................................... 6

5. Modeling ..................................................................................................................................... 7

5.1. Candidate models ................................................................................................................... 7

5.1.1. Decision Trees ..................................................................................................................... 7

5.1.2. Neural Networks ................................................................................................................. 8

5.2. Model selection approach ...................................................................................................... 8

5.3. Final model .............................................................................................................................. 9

5.3.1. Overall predictive accuracy ............................................................................................... 10

5.3.2. Observed versus predicted target values ......................................................................... 11

5.3.3. Improvement over baseline .............................................................................................. 12

6. Discussion .................................................................................................................................. 12

6.1. Assessment of model performance ...................................................................................... 12

6.2. Contribution to the solution of the problem ........................................................................ 12

6.3. Deployment recommendations ............................................................................................ 13

Petr
Textbox
Page 3: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

2 of 16

1. Introduction The purpose of this work is to use data mining methods and tools in order to solve some given

objectives in the medical field of brain diseases. The analysis on medical data must be done

following the CRISP-DM methodology.

Methodology provides us with data mining process model that describes commonly used

approaches that data mining experts use to tackle problems. (wikipedia community, n.d.)

Little more about this methodology can by found in appendix.

2. “Business” understanding Before any data manipulation is initiated, we need to understand the terms used in particular

business (field), get to know it’s concepts and processes, which relate to our DM task.

2.1. Background First, it is necessary to define how terms and processes were understood from the assignment and

how they relate with given objectives.

2.1.1. Meningitis Neurological Infectious Diseases. Some bacteria or virus is invaded in the dura sheet (covered the

brain), which causes severe inflammation in dura. When the brain is inflamed, the patient is

diagnosed as "meningoencephalitis". Sometimes when bacteria forms abscess in the brain, he/she is

diagnosed as "brain abscess". (Tsumoto , 2000)

2.1.2. Process modeling Process of diagnosing and treating the patient starts with patient being accepted, ends with patient

being discharged. First in this process, doctor gathers basic information about patient’s present

history of health status and subsequently performs a physical examination. After these first two

steps, doctor is ready to draw up first diagnosis. In medical field it is a standard to always make

differential diagnosis using different information about patient to support or discard first

assumption. In our case, biological samples of patient’s body are used to elaborate laboratory

examination findings. According to these, second diagnosis is proposed. Both diagnosing steps of our

process are closer described in appendix. After having two diagnoses elaborated, doctor decides on

suitable therapy for patient. In data we received, the therapy part and status of patient after

discharged is represented by last few columns. The diagram of this process can be seen below.

Picture 1 - process of examining and treating the patient (source: Author, in draw.io)

Petr
Textbox
Page 4: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

3 of 16

2.2. Objectives 1.) Please find factors important for diagnosis (DIAG and DIAG2)

2.) Please find factors for detection of bacteria or virus (CULT_FIND and CULTURE)

3.) Please find factors for predicting prognosis / predict prognosis (C_COURSE and COURSE)

Original author of the assignment proposed 3 different tasks to be dealt with. For this work, only the

last one, finding factors for prognosis, will be elaborated. Author left us with freedom in approaching

the task, saying: “Any knowledge extraction is welcome!”. The most important predictors for

predicting final condition of the patient can be identified in different stages of the process through

which the patient goes. As the process proceeds, the doctor and also our datamining tools have

more and more information about patient’s health condition. Therefore, is assumed, that

proceeding in the process will provide us with an improvement of our predicting results, but

nevertheless, it is still expected to give some additional information to doctors even after the first

data gathering, which is physical examination at admission.

These three stages will be considered for analysis:

• After the physical examination – stage1

• After the laboratory examination – stage2

• After the therapy – stage3

3. Data understanding

3.1. Data We received the table of 140 rows, where one row represents health record of one patient. In one

row, all information about patient, gathered throughout the whole process is stored. Table was

received in .csv format and subsequently imported to SAS libraries as three SAS tables using IMPORT

and SAVE DATA nodes in Enterprise Miner. Before import, table was split into 3 tables, where each

splitted table represents all data gathered until the present stage. Table of last stage is equal to

received table, as in this stage we already have all the information from finished process. Received

table contains altogether 38 possible variables. In this work course(grouped) will be used as target

variable. First split will have 18 predictors (present history + physical examination) and second split

will have 14 additional ones resulting in 32 predictors (split 1 + laboratory examination). Data could

have been also imported in one piece. Problem here was, that if we wanted to filter the columns for

modeling, it would be very time consuming as SAS EM sorted attributes alphabeticaly. This made

filtering rather difficult task.

Picture 2 - source data - graphicaly divided into 3 stages as obtained in the process (source: Author, in Excel)

3.2. Target variables Original variable “C_COURSE” represents patient’s clinical course at discharge. This variable comes

with 11 different values possible for one record. Predicting variable like this (multiple values) is not a

common data mining task. In grouped course variable we already have this variable transformed into

THIRD STAGE SECOND STAGE FIRST STAGE

Petr
Textbox
Page 5: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

4 of 16

binary variable. Value “positive” means that patient at discharge is dead or not completely healthy.

Value “negative” is the opposite, meaning patient was successfully treated. Distribution of this class

variable is significantly uneven. It is 117 negatives to 23 positives. Proper operation might be to

resample distribution and make occurrences of variable even. Down-sampling would leave us with

only few records, up-sampling on the other hand, is not an implemented function inside of

Enterprise Miner. This problem can be partially solved by setting decision matrix in a way, that it will

favor hitting “Positive” values. After the model is built, we can move cut-off value in order to reach

preferred True Negative Rate or True Positive Rate. For example, default cut-off 0.5 can be moved to

0.4 which will ensure higher TPR, in our case more “Positives” hit.

C_COURSE – original variable

• negative: no symptoms

• EEG_abnormal: the patient had abnormality of EEG

• CT_abnormal: abnormality of CT

• frontal_sign: frontal sign is observed.

• attention: loss of attentions is observed.

• aphasia: the patient cannot speak.

• amnesia: retrograde amnesia

• ataxia: motor disturbance is observed.

• epilepsy: the patient suffered from epilepsy after discharge.

• memory_loss: memory disturbance.

• dead: death

COURSE(Grouped): Grouped attribute of C_COURSE

• n: negative – 117 cases (83,6%)

• p: positive – 23 cases (16,4%)

Picture 3 - Bar plot showing distribution of target variable (source: Author, in SAS EM)

Petr
Textbox
Page 6: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

5 of 16

3.3. Input variables Table 1 - input variables available for target prediction/classification (Source: Author, in MS Excel)

Stage Source Attribute number

Attribute Data type Possible Values

Stage1

Personal data

1 AGE numerical [10.000 ; 84.000]

2 SEX categorical M (82), F (58)

Diagnosis 3 DIAG categorical

ABSCESS (9), BACTERIA (24), BACTE(E) (8), TB(E) (1), VIRUS(E) (30), VIRUS (68)

4 DIAG2 categorical BACTERIA (42), VIRUS (98)

Present history

5 COLD numerical [0.000 ; 35.000]

6 HEADACHE numerical [0.000 ; 63.000]

7 FEVER numerical [0.000 ; 63.000]

8 NAUSEA numerical [0.000 ; 32.000]

9 LOC numerical [0.000 ; 26.000]

10 SEIZURE numerical [0.000 ; 6.000]

11 ONSET categorical SUBACUTE (7), ACUTE (130), CHRONIC (1), RECURR (2)

Physical examination

12 BT numerical [35.500 ; 40.200]

13 STIFF numerical [0.000 ; 5.000]

14 KERNIG categorical [0.000 ; 1.000]

15 LASEGUE categorical [0.000 ; 1.000]

16 GSC numerical [9.000 ; 15.000]

17 LOC_DAT categorical - (98), + (42)

18 FOCAL categorical - (105), + (35)

Stage 2

Laboratory examination

19 WBC numerical [1070.000 ; 90009.000]

20 CRP numerical [0.000 ; 31.000]

21 ESR numerical [0.000 ; 60.000]

22 CT_FIND categorical abnormal (39), normal (101)

23 EEG_WAVE categorical abnormal (117), normal (23)

24 EEG_FOCUS categorical - (104), + (36)

25 CSF_CELL numerical [0.000 ; 63350.000]

26 Cell_Poly numerical [0.000 ; 61520.000]

27 Cell_Mono numerical [0.000 ; 7840.000]

28 CSF_PRO numerical [0.000 ; 474.000]

29 CSF_GLU numerical [0.000 ; 520.000]

30 CULTURE_FIND categorical F (107), T (33)

31 CULTURE categorical

- (107), neisseria (2), strepto (9), staphylo (2), tb (1), influenza (1), measles (1), pi(B) (6), varicella (3), rubella (2), adeno (1), herpes (5)

Stage 3

Therapy and course

32 THERAPY2 categorical

multiple (10), ABPC+CZX (12), FMOX+AMK (1), ABPC (3), ope (2), Dara_P (1), ABPC+FMOX (4), LMOX (1), PCG (1), ABPC+LMOX (2), PIPC+CTX (1), no_therapy (58), ABPC+CTX (2), INH+RFP (1), ABPC+CEX (1), Zobirax (25), ARA_A (11), INH (1), globulin (3)

33 CSF_CELL3 numerical [8.000 ; 4860.000]

34 CSF_CELL7 numerical [0.000 ; 2137.000]

Petr
Textbox
Page 7: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

6 of 16

Diagnosis attributes were not suitable as input variables due to their dependence on other variables.

DIAG attribute can also be biased by doctor’s personal judgment. Attribute with almost the same

value added as DIAG2 can be created by subtracting cell_mono from cell_poly attribute (cell_poly-

cell_mono) and taking positive result of substraction as a BACTERIAL disease and negative result as a

VIRUS disease. Except „Therapy“ and „Risk“ attributes, all the other attributes are objective

measures of patient’s physical state. Therapy attribute represents an approach chosen for patient’s

treatment, therefore is objective and can be also taken into account. RISK belongs among final state

information about patient, is achieved at the same time as COURSE values, therefore makes no

sense using it in predicting the value of COURSE.

4. Data preparation

4.1. Treatment of outliers and missing values No outliers were observed in obtained dataset. Chosen modeling methods can handle those few

missing values in CSF_CELL3 attribute.

4.2. Variable transformations • In case of an interval inputs, if skewed, transformation to more normal distribution could be

considered for calculating parametric models. This is not a case using decision trees, which are

insensitive to skewed distributions of predictors. All transformations will be done in EM before

calculating specific models. Models will be also run without transformations and compared with

transformed ones in order to know, wether it did not make the results even worse.

• POLY-MONO attribute was considered to be created, nevertheless, as proven in other work, it

gives the same value added for purposes of our task as DIAG2. For this reason DIAG2 will by kept

and used from stage 2.

4.3. Sampling and data partitioning • In all models, except decision trees, a standard 70 to 30 partitioning will be used. 70% training

set and 30% validation set. The decision trees must also contain test sample as the pruned

subtrees are chosen based on their performance on validation sample.

• Where it is possible, a k-fold cross validation might be used while training the data, as the

number of observations is low.

Petr
Textbox
Page 8: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

7 of 16

5. Modeling

5.1. Candidate models

5.1.1. Decision Trees For purposes of having a comparison variety, all three splitting criteria (ProbChiSq, Gini, Entropy) are

chosen for both, binary tree and multiway tree. Multiway trees are set with maximum of 5 branches.

This is done in every stage, leaving us with 18 (6x3) models in total. Decisions (as a rules) were used

due to higher value of finding those few „positive“ values. Cross validation is very appreciated

function here as the dataset is small. Setting of models can be seen below.

Table 2 - tuning parameters of Decision Trees (Source: Author, in MS Excel)

Model Splitting rules Split search Subtree Cross

Validation

Stage1

Tree1 -binary

NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree2 -binary

NTC:Gini, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree3 -binary

NTC:Entropy, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree4 -multiway

NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree5 -multiway

NTC:Gini, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree6 -multiway

NTC:Entropy, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Stage2

Tree1 -binary

NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree2 -binary

NTC:Gini, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree3 -binary

NTC:Entropy, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree4 -multiway

NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree5 -multiway

NTC:Gini, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree6 -multiway

NTC:Entropy, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Stage3

Tree1 -binary

NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree2 -binary

NTC:Gini, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree3 -binary

NTC:Entropy, Sig.Level: 0.2, Max. Branch: 2

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree4 -multiway

NTC:ProbChisq, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree5 -multiway

NTC:Gini, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Tree6 -multiway

NTC:Entropy, Sig.Level: 0.2, Max. Branch: 5

Use decisions: Yes, Use priors: No

Method: Largest, Measure: Decision Yes - 10

Petr
Textbox
Page 9: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

8 of 16

5.1.2. Neural Networks In neural networks results tend to be best if the target variable is relatively evenly distributed in the

learning set. (Scholderer, 2017)

Our dataset is the opposite. Target variable is unevenly , 84:16, distributed. Furthermore, we possess

only small data set with not that many variables.

Although neural networks are known for being a powerful tool for modelling when there are not too

many input variables, they are effective when the amount of data is sufficient. (Scholderer, 2017)

Two different models will be built for each stage, resulting in total of 6 models. Each stage will have

built one model with 2 hidden units and one model with 5 hidden units. The interval variables on

input goes through binning afterwards dummy variables are created from all categorical variables.

This way, the neural network is promised to work better. Neurals were also tried to run without this

tdata transformation, but the results turned out to be unacceptable. Tuning parameters can be sen

below.

Table 3 - tuning parameters of neural networks (Source: Author, in MS Excel)

Model Tuning parameters

Stage1 Neural1 Hidden units: 2, Model selection criterion:Profit/Loss, archit.: MLP

Neural2 Hidden units: 5, Model selection criterion:Profit/Loss, archit.: MLP

Stage2 Neural1 Hidden units: 2, Model selection criterion:Profit/Loss, archit.: MLP

Neural2 Hidden units: 5, Model selection criterion:Profit/Loss, archit.: MLP

Stage3 Neural1 Hidden units: 2, Model selection criterion:Profit/Loss, archit.: MLP

Neural2 Hidden units: 5, Model selection criterion:Profit/Loss, archit.: MLP

Combination, activation and error functions of Neural networks are left as a default. They are auto-

picked by SW based on the other settings in model and input variables. (SAS Institute Inc., 2016)

5.2. Model selection approach For finding best fitting model, misclassification was chosen as a selection criterion. Model will be

also chosen with respect to average profit values. Average profit is made secondary criterion, as

improving this value can be later done by moving cut-off value, which is set to 0.5 (default). Values

of this measure in the table below are rates of misclassification on validation data using neural

networks and on testing data, using decision trees. We can clearly see on different stages, that any

additional information helped neural networks to improve their performance, meaning lowering

their misclassification rate. Trees were also enriched by additional inputs in the second stage, the

third stage inputs did not help and in some cases, even worsened the results. As a splitting criterion,

none of the inputs from third stage were chosen for any of 6 DT models. Although neural networks

worked better and better on correctly classifying as the process was proceeding, it did not make any

improvement on Average Profit criterion. In the second stage, best results are achieved on binary

Gini Tree taking misclassification rate and average squared error into account. It’s Average Profit

criterion offers also solid results compared to other models, therefore this can be considered as a

winning model in second stage. In first stage, binary ProbChiSq tree won in all criterions chosen.

Stage 3 will not have its winning model chosen as it did not give much additional information for

predicting the clas variable.

Petr
Textbox
Page 10: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

9 of 16

Table 4 - assesing of proposed models across 3 different stages (Source: Autor, in EXCEL)

Model / STAGE1 Misclass. Rate

Average Squared Error

ROC - index

Average Profit/Loss

Neural1 - hu2 0,256 0,224 0,578 0,930

Neural2 - hu5 0,256 0,195 0,645 0,930

Tree1 -binary - prob 0,172 0,140 0,580 1,020

Tree2 -binary - gini 0,310 0,212 0,485 0,960

Tree3 -binary - entropy 0,310 0,212 0,485 0,960

Tree4 -multiway - prob 0,172 0,171 0,435 0,793

Tree5 -multiway - gini 0,379 0,196 0,450 0,828

Tree6 -multiway - entropy 0,379 0,196 0,450 0,828

Model / STAGE2 Misclass. Rate

Average Squared Error

ROC - index

Average Profit/Loss

Neural1 - hu2 0,186 0,138 0,750 0,900

Neural2 - hu5 0,186 0,140 0,706 1,090

Tree1 -binary - prob 0,172 0,168 0,525 1,000

Tree2 -binary - gini 0,138 0,122 0,580 1,100

Tree3 -binary - entropy 0,138 0,126 0,535 1,000

Tree4 -multiway - prob 0,240 0,143 0,535 1,060

Tree5 -multiway - gini 0,206 0,144 0,620 1,380

Tree6 -multiway - entropy 0,206 0,144 0,620 1,380

Model / STAGE3 Misclass. Rate

Average Squared Error

ROC - index

Average Profit/Loss

Neural1 - hu2 0,163 0,130 0,722 1,000

Neural2 - hu5 0,163 0,144 0,798 1,000

Tree1 -binary - prob 0,241 0,183 0,525 1,000

Tree2 -binary - gini 0,138 0,122 0,580 1,100

Tree3 -binary - entropy 0,206 0,137 0,630 1,340

Tree4 -multiway - prob 0,241 0,156 0,550 1,030

Tree5 -multiway - gini 0,206 0,144 0,620 1,380

Tree6 -multiway - entropy 0,206 0,144 0,620 1,380

5.3. Final model Gini Binary Tree for the second stage and ProbChiSq Binary Tree for the first stage were chosen as

best adepts. In the picture below, splitting attributes of ProbChiSq Binary Tree and their hierarchy

can be seen:

Petr
Textbox
Page 11: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

10 of 16

Picture 4 - ProbChiSq Binary Tree - visualization (Source: author, in SAS EM)

5.3.1. Overall predictive accuracy If the “Gini - binary tree” from second stage would have a task to choose a half of all dataset, where

in this half, as many events as possible should appear, in his choice 63,48% of total cases would be

found. 36,52% of events would be left in the second half of a dataset. Without having a model and

predictors (random choice), half of all events would be found in a randomly halved dataset. For

“ProbChSq - binary tree” proportion would be 67,06% to 32,94%.

Model performance (binary tree – gini -stage2) showed below:

Picture 5 - Cumulative % Captured response - model Gini Binary Tree (stage2) (Source: Autor, in SAS EM)

Petr
Textbox
Page 12: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

11 of 16

Model performance (binary tree –ProbChiSq-stage1) showed below:

Picture 6 - Cumulative % Captured response - model ProbChiSq Binary Tree (stage1) (Source: Autor, in SAS EM)

5.3.2. Observed versus predicted target values From total of 5 positive cases in validation data, model would predict correctly 2 of them. That is

7,14% from all 28 cases of validation dataset and 40% from positive cases only. 3 positive cases

would be incorrectly classified as negatives (8,4 % from total).

67,86% from all cases, which is 19, would be negatives predicted correctly (it is 82,61% from

negative cases only). 14,28% of all data would be negatives classified as positives.

Picture 7 - Comparison of classification charts - model ProbChiSq Binary Tree (stage1) - train and validate (Source: Author, in SAS EM)

Petr
Textbox
Page 13: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

12 of 16

Accuracy, sensitivity, specificity,... achieved on a validation set are in a table bellow:

Table 5 - Measures of predicting binary target in Validation Data - model ProbChiSq Binary Tree (stage1) (Source: Author, in MS Excel)

Measure Value Derivations

Sensitivity 0.4000 TPR = TP / (TP + FN)

Specificity 0.8261 SPC = TN / (FP + TN)

Precision 0.3333 PPV = TP / (TP + FP)

Negative Predictive Value 0.8636 NPV = TN / (TN + FN)

False Positive Rate 0.1739 FPR = FP / (FP + TN)

False Discovery Rate 0.6667 FDR = FP / (FP + TP)

False Negative Rate 0.6000 FNR = FN / (FN + TP)

Accuracy 0.7500 ACC = (TP + TN) / (P + N)

5.3.3. Improvement over baseline Plot below shows the cumulative ratio of percent captured responses within each decile to the

baseline percent response. (SAS Institute Inc., 2016) Baseline is a ratio of 1.0. If the model checked

half of the cases (depth of 50), we would hit 1,34 times more positives cases as if we were searching

randomly. This fact can be also read from the plot displayed below.

Picture 8 - Cumulative lift - ProbChiSq Binari Tree (stage1), selected model (Source: Author, in SAS EM)

6. Discussion

6.1. Assessment of model performance Performing only limited data treatment might suggest, that better performance can be expected

even on the same model, if various data treatment methods and variety of model adjustments

would be conducted. (e.g. methods of input reduction were not applied). As we worked with very

small sample of data, data mining methods could not fully demonstrate their capabilities.

Nevertheless, also on tiny dataset, our mining tool was able to create models, that can in early stage

of patient admission predict patient’s status „positive/negative“ at discharge. Models evidently

outperform simple guessing with knowledge of distribution. Other problem, besides not that

convincing performance, that could be solved with larger dataset, is model stability.

6.2. Contribution to the solution of the problem With knowledge we obtained in this work, we might assume, that final status of patient, not just

positive/negative, but also resulting disease, treatment consequence and so forth, might be

predicted already in early stages of patient’s admission. By playing with cut-off value, our models

can reach such predicting qualities, that it will find every case which is looked for. In our case, we

Petr
Textbox
Page 14: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

13 of 16

could possibly capture all the cases, where patient will have resulting health issues after treatment.

All we need to do is to decrease cut off value to 0.2 as shown in picture below. That way, we will

reach true positive rate of 100%. This will also result in more false positives classified, however, that

is sacrifice, which is needed to be done. On the plot below, we se how “True negative rate” (brown

line) decreased.

Picture 9 - manipulating with cut-off value (Source: Author, in SAS EM)

6.3. Deployment recommendations After deploying these kinds of models, their recalculation on new data gathered every once in a

while is needed to be done. Models should be used only as an additional tool of doctor’s

professional working procedures.

Cut off decrease

Petr
Textbox
Page 15: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

14 of 16

List of Pictures Picture 1 - process of examining and treating the patient (source: Author, in draw.io) .................................................................................... 2

Picture 2 - source data - graphicaly divided into 3 stages as obtained in the process (source: Author, in Excel) .............................................. 3

Picture 3 - Bar plot showing distribution of target variable (source: Author, in SAS EM) .................................................................................. 4

Picture 4 - ProbChiSq Binary Tree - visualization (Source: author, in SAS EM) ................................................................................................ 10

Picture 5 - Cumulative % Captured response - model Gini Binary Tree (stage2) (Source: Autor, in SAS EM) ................................................. 10

Picture 6 - Cumulative % Captured response - model ProbChiSq Binary Tree (stage1) (Source: Autor, in SAS EM) ....................................... 11

Picture 7 - Comparison of classification charts - model ProbChiSq Binary Tree (stage1) - train and validate (Source: Author, in SAS EM) .... 11

Picture 8 - Cumulative lift - ProbChiSq Binari Tree (stage1), selected model (Source: Author, in SAS EM) ..................................................... 12

Picture 9 - manipulating with cut-off value (Source: Author, in SAS EM) ........................................................................................................ 13

List of Tables Table 1 - input variables available for target prediction/classification (Source: Author, in MS Excel) ............................................................... 5

Table 2 - tuning parameters of Decision Trees (Source: Author, in MS Excel) ................................................................................................... 7

Table 3 - tuning parameters of neural networks (Source: Author, in MS Excel) ................................................................................................ 8

Table 4 - assesing of proposed models across 3 different stages (Source: Autor, in EXCEL) .............................................................................. 9

Table 5 - Measures of predicting binary target in Validation Data - model ProbChiSq Binary Tree (stage1) (Source: Author, in MS Excel) .... 12

References Berka, P. & Kocka, T., 2003. Meningoencephalitis Data Analysis Based on the CRISP-DM Methodology, Prague: University of Economics in

Prague.

SAS Institute Inc., 2016. SAS Enterprise Miner Reference Help, s.l.: s.n.

Scholderer, J., 2017. Lecture 13 - Neural networks, Aarhus: Aarhus: BSS - Aarhus university.

Tsumoto , S., 2000. Guide to the meningoencephalitis Diagnosis Data Set. [Online]

Available at: http://www.ar.sanken.osaka-u.ac.jp/pub/washio/jkdd/menin.htm

[Cit. 29 December 2017].

wikipedia community, dátum neznámy Cross-industry standard process for data mining. [Online]

Available at: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

[Cit. 29 December 2017].

Petr
Textbox
Page 16: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

15 of 16

Appendix

Diagnose approaches:

Diagnosis 1: These symptoms are checked:

• high fever (present history data)

• severe headache (present history data)

• nausea (present history data)

• vomit (no info in data)

• neck stiffness (physical examination data)

• Kernig sign (physical examination data)

• Lasegue sign (physical examination data)

Diagnosis 2 (differential): The differential diagnosis is made as follows: 1.) Check the cell count in Cerebulospinal fluid(CSF). 2.) If polynuclear cells are dominant, bacterial meningitis is diagnosed. 3.) If mononuclear cells are dominant, viral meningitis is diagnosed. 4.) For diagnosis of brain abscess, CT will be used for confirmation of diagnosis.

CRISP-DM The CRISP-DM Methodology CRISP-DM (CRoss-Industry Standard Process for Data Mining) is a

European Commission funded project for defining a standard process model for carrying out data

mining projects. CRISP-DM addresses the needs of all levels of users in deploying data mining

technology to solve business problems.The project aim is to define and validate a data mining

process that is generally applicable in diverse industry sectors. According to CRISP-DM, the life cycle

of a data mining project consists of six phases shown in Fig.1. We will follow these phases during our

work with the meningoencephalitis data. (Berka & Kocka, 2003)

Appendix Picture 1 - phases of CRISP-DM

Petr
Textbox
Page 17: Project 2 – meningoencephalitis diagnosisberka/docs/4iz451/example-project2-menindata.pdf · Project 2 – meningoencephalitis diagnosis Oliver Genský 4 of 16 binary variable

Project 2 – meningoencephalitis diagnosis Oliver Genský

16 of 16

Diagrams

Appendix Picture 2 - diagram of DTs (Source: Author, in SAS EM)

Appendix Picture 3 - diagram of NNs (Source: Author, in SAS EM)

Petr
Textbox