122
Predicting conversion from Mild Cognitive Impairment to Alzheimer’s Disease using a Temporal Mining approach An exploratory study in real data Rita Pissarra Levy Thesis to obtain the Master of Science Degree in Biomedical Engineering Supervisor Professor Sara Alexandra Cordeiro Madeira Examination Committee Chairperson: Professor Patr´ ıcia Margarida Piedade Figueiredo Supervisor: Professor Sara Alexandra Cordeiro Madeira Members of the Committee: Professor Alexandre Val´ erio de Mendonc ¸a Professor Alexandra Sofia Martins de Carvalho December 2016

Predicting conversion from Mild Cognitive Impairment to ... · Thank you for putting out the fires and clearing out all my doubts, for helping me face my challenges and being there

Embed Size (px)

Citation preview

Predicting conversion from Mild Cognitive Impairmentto Alzheimer’s Disease using a Temporal Mining

approach

An exploratory study in real data

Rita Pissarra Levy

Thesis to obtain the Master of Science Degree in

Biomedical Engineering

Supervisor

Professor Sara Alexandra Cordeiro Madeira

Examination Committee

Chairperson: Professor Patrıcia Margarida Piedade FigueiredoSupervisor: Professor Sara Alexandra Cordeiro MadeiraMembers of the Committee: Professor Alexandre Valerio de MendoncaProfessor Alexandra Sofia Martins de Carvalho

December 2016

Ter a duvida e saber exatamente o que estou a dizer.Jose de Almada Negreiros

Acknowledgments

First, I would like to express my gratitude and acknowledgments to my supervisor Sara Madeira.

Thank you for the guidance and trust. Thank you for the opportunity to be part of this project. I also

express my gratitude to Dra Manuela Guerreiro, Dr Alexandre Mendonca and Sandra Cardoso for all

the medical feedback given throughout this thesis.

A very special thank you to Telma Pereira. You made this work a lot easier and very enjoyable.

Thank you for putting out the fires and clearing out all my doubts, for helping me face my challenges

and being there for me. I would like to thank the kdBio group, specially Andreia, Ricardo, Rui and

Sofia for all great team spirit and working environment.

To my dearest friends, Miguel, Tiago, Catarina, Rita, Joana, Mina, MJ, Aparıcio, Silva, Salome

and Flavio I know I wouldn’t be here if it wasn’t for you. You made me less afraid of failing and showed

me that everything is possible with hard work. You showed me that is possible to laugh even when

everything is going wrong. All the times spending working and studying were hard, but the memories

the leave are sweet because of you.

To my friends from GASTagus, thank you for opening my eyes and showing me another side of life

that I would never see if it wasn’t for you all. You changed me, my values and my dreams.

To Ines Machado, for inspiring me to always want more, to be in a band and showing me that all

obstacle can be overcame.

To Sal, thank you for the help, support and always showing the positive side in everything.

Margarida, thank you for being the one who knows me and who I feel safe tell everything.

I would like to thank my entire family for being there for me. Thank you for help me grow up.

My brother, for all the laughs and for challenging me intellectually and making me want to give my

best.

My grandmother for asking me every day when was I going to finish my thesis.

My dear Sister Sara, thank you for all the help, all the guidance in this work and in my life.

Finally, I would like to thank my mother, for all the support, unconditional love and patient.

For my uncle.

For my father.

iii

Abstract

Alzheimer’s Disease(AD) is neurodegenerative disease and the most common form of dementia.

AD prevalence increases each year and there is no efficient and universal treatment. Data min-

ing methods comprise pattern recognition from medical databases consisting on neuropsychological

data that provide information to the medical doctor, facilitating the prognosis from a Mild Cognitive

Impairment (MCI) state to the conversion to AD. Previous studies use independent classifiers to this

prognosis problem, ignoring any temporal information present in the dataset. This thesis focus on

contributing to the early detection of the conversion to AD prognosis with two approaches: prepro-

cessing techniques that transform the original dataset in order to capture this temporal information

and using a temporal classifier able to deal with this temporal information.

The first approach consists on feature extraction where temporal features that calculate the pro-

gression, define temporal patterns and statistically sum up the progression between two timepoints

are derived from the original values. This prepocessing worflow is followed by a classification task

carried out using the Naıve Bayes classifier. The second one relies on Hidden Markov Models (HMM)

to process internally the temporal information on the original dataset. First approach shows promising

results, where many models outperformed the baseline one. However, HMM were not able to provide

a good alternative as the environment was not able to successfully process the original data.

Overall, this exploratory work shows a future path to better understand the underlying AD mecha-

nisms and to improve the AD prognosis with Data Mining methodologies.

Keywords

Alzheimer’s Disease, Mild Cognitive Impairment, Temporal Mining, Data Mining

v

Resumo

A Doenca de Alzheimer (DA) e uma doenca neurodegenerativa e a forma mais comum de demencia.

A sua incidencia aumenta de ano para ano e ate a data nao existe nenhuma forma eficiente e uni-

versal de a combater. Mecanismos de mineracao de dados permitem, atraves de uma base de

dados de resultados de testes neuropsicologicos, identificar padroes e chegar a conclusoes sobre

estes dados que permitem ao medico prever com mais facilidade a progressao de um doente de um

estado de Defice Cognitivo Ligeiro para um estado de Alzheimer. Trabalhos anteriores usam classifi-

cadores e apenas a primeira consulta para esta previsao ignorando informacao temporal. Esta tese

pretende contribuir para a previsao do prognostico de evolucao para DA com duas abordagens: pre-

processamento da base de dados original de maneira a captar informacao temporal e usar a base de

dados original num classificador temporal preparado para a receber. A primeira consiste e na criacao

de atributos temporais a acrescentar a base de dados que consigam captar a informacao temporal

que um classificador independente nao consegue. Foram criados atributos temporais que calculam a

progressao, que definem padroes e que sumarizam estatisticamente o intervalo entre dois instantes

temporais. A segunda baseia-se no uso de Hidden Markov Models de maneira a conseguir proces-

sar a informacao temporal presente na base de dados. Os primeiros resultados mostram-se promis-

sores, ja que muitos conseguiram superar os resultados considerados como baseline, indicando que

a informacao temporal ajuda na detecao da conversao de DA. Contudo, a aplicacao dos Modelos

de Markov a este problema nao conseguiu provar ser uma boa alternativa ja que a plataforma onde

estes modelos foram usados raramente consegui fornecer resultados para as bases de dados em

questao. No geral, este trabalho exploratorio mostra um futuro caminho a seguir para a melhor com-

preensao dos mecanismos da DA e da capacidade de prognostico por modelos de mineracao de

dados usando informacao temporal.

Palavras Chave

Doenca de Alzheimer, Defice Cognitivo Ligeiro, Mineracao temporal de dados, Mineracao de da-

dos

vii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Neuropsychological tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Methods 21

3.1 Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Data Preprocessing: Independent Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.2 Learning Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.3 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.3.A Datasets without temporal features . . . . . . . . . . . . . . . . . . . . 31

3.2.3.B Progression Feature Datasets . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3.C Temporal Pattern Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3.D Statistics-Based Summarization Datasets . . . . . . . . . . . . . . . . . 32

3.2.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.5 Synthetic Minority Over-sampling Technique . . . . . . . . . . . . . . . . . . . . 34

3.3 Data Preprocessing: Relational Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Classification Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

3.4.1 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Results and Discussion 41

4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Independent Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Dataset without temporal features . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.2 Progression Feature Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.3 Temporal Pattern Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.4 Statistics-Based Summarization Datasets . . . . . . . . . . . . . . . . . . . . . . 68

4.3 Hidden Markov Models: Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Conclusions and Future Work 75

Bibliography A-1

Appendix A Complete List of Features A-1

Appendix B Feature Selection: Complementary Information B-1

B.1 Dataset without temporal progression attribute . . . . . . . . . . . . . . . . . . . . . . . B-2

B.2 Progression Feature Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3

B.3 Unsupervised Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4

B.4 Only Progression Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6

B.5 All timepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7

B.6 Temporal Pattern Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9

B.7 Statistics-Based Summarization Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . B-9

Appendix C Results: Complete Tables C-1

C.1 Independent Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2

C.1.1 Dataset without temporal features . . . . . . . . . . . . . . . . . . . . . . . . . . C-2

C.1.2 Progression Feature Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2

C.1.3 Temporal Pattern Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4

C.1.4 Statistics-Based Summarization Datasets . . . . . . . . . . . . . . . . . . . . . . C-4

x

List of Figures

3.1 Histogram of the number of observations for each patient. . . . . . . . . . . . . . . . . . 23

3.2 Preprocessing workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Histogram of number of Missing Values per Instance in the Original Database. . . . . . 26

3.4 Histogram of number of observations of the total number of patients after the data

cleaning process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Stable MCI Learning Example in a 4-year time window. . . . . . . . . . . . . . . . . . . 28

3.6 Patient that does not become a learning example in a 5-year time window. . . . . . . . . 29

3.7 Converter MCI Learning Example in a 5-year time window. . . . . . . . . . . . . . . . . 29

3.8 Patient that does not become a learning example in a 4-year time window. . . . . . . . . 30

3.9 Temporal Processing without temporal features. . . . . . . . . . . . . . . . . . . . . . . . 31

3.10 Temporal Processing with Progression Rate. . . . . . . . . . . . . . . . . . . . . . . . . 32

3.11 Temporal Processing with Temporal Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.12 Temporal Processing with different means and variance. . . . . . . . . . . . . . . . . . . 33

3.13 Class Imbalance on 3Y, 4Y and 5Y time windows datasets before applying SMOTE. . . 35

3.14 Class Imbalance on 3Y, 4Y and 5Y time windows datasets after applying SMOTE. . . . 35

3.15 Preprocessing workflow for the HMM datasets. . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Percentage of times attributes appeared in the classification model. . . . . . . . . . . . 43

4.2 Percentage of times Z-scores attributes appeared in the classification model. . . . . . . 44

4.3 Missing values percentage for every attribute. . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Missing values percentage for every Z-score attribute. . . . . . . . . . . . . . . . . . . . 47

4.5 Confusion Matrices for the cross validation results for datasets without temporal features. 49

4.6 Cross Validation results for the Accuracy and AUC for datasets without temporal features. 50

4.7 Cross Validation results for the Specificity and Sensitivity for datasets without temporal

features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.8 Confusion Matrices for the cross validation results for datasets without temporal features. 53

4.9 Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets

without temporal features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.10 Confusion Matrices for the cross validation results for discretized datasets with progres-

sion features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xi

4.11 Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for discretized

datasets with progression features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.12 Confusion Matrices for the cross validation results for datasets with progression fea-

tures and with the original set of features. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.13 Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets

without temporal features.with progression features and with the original set of features. 60

4.14 Confusion Matrices for the cross validation results for datasets with progression fea-

tures and without the original values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.15 Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for with pro-

gression features and without the original values. . . . . . . . . . . . . . . . . . . . . . . 62

4.16 Confusion Matrices for the cross validation results for datasets with progression fea-

tures, considering all observations within the time window. . . . . . . . . . . . . . . . . . 64

4.17 Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets

with progression features, considering all observations within the time window. . . . . . 65

4.18 Confusion Matrices for the cross validation results for datasets with temporal pattern

attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.19 Cross Validation results for the Accuracy and AUC for datasets with temporal pattern

attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.20 Cross Validation results for the Specificity ans Sensitivity for datasets with temporal

pattern attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.21 Confusion Matrices for the cross validation results for datasets with mean features. . . . 69

4.22 Confusion Matrices for the cross validation results for datasets with variance features. . 69

4.23 Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets

with statistics-based features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

xii

List of Tables

3.1 Demographic Information of the original database for every patients at their first obser-

vation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Number of Instances, Patients and Attributes from the original dataset and at the end

of the data cleaning processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Demographic Information of the preprocessed database for the first observation of the

patient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Number of Patients after the creation of Learning Examples for the different time Window. 30

4.1 Class distribution of the three time window datasets. . . . . . . . . . . . . . . . . . . . . 48

4.2 Cross Validation results for datasets without temporal features. . . . . . . . . . . . . . . 49

4.3 Number of instances for every time window in datasets with only one learning example

per patient and with more than one learning example per patient. . . . . . . . . . . . . . 51

4.4 Number of instances for every time window in datasets with only one learning example

per patient and with more than one learning example per patient. . . . . . . . . . . . . . 51

4.5 Cross Validation results for the datasets with progression features. . . . . . . . . . . . . 54

4.6 Cross Validation results for discretized datasets with progression features. . . . . . . . . 56

4.7 Cross Validation results for the datasets with progression features and with the original

set of features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 Cross Validation results for the datasets with progression features and without the orig-

inal values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.9 Accuracy and AUC comparison for progression feature datasets. . . . . . . . . . . . . . 63

4.10 Specificity and Sensitivity comparison for progression feature datasets. . . . . . . . . . 63

4.11 Cross Validation results for the datasets with progression features, considering all ob-

servations within the time window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.12 Validation results for the datasets with temporal pattern attribute. . . . . . . . . . . . . . 67

4.13 Cross Validation results for the datasets with statistics-based features. . . . . . . . . . . 69

4.14 Accuracy values for the Hidden Markov Classification Model. . . . . . . . . . . . . . . . 72

4.15 AUC values for the Hidden Markov Classification Model. . . . . . . . . . . . . . . . . . . 73

4.16 Specificity values for the Hidden Markov Classification Model. . . . . . . . . . . . . . . . 73

4.17 Sensitivity values for the Hidden Markov Classification Model. . . . . . . . . . . . . . . . 73

A.1 Complete List of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4

xiii

B.1 Set of Features for the dataset without temporal progression attribute and 3-year windowB-2

B.2 Set of Features for the dataset without temporal progression attribute and 4-year windowB-2

B.3 Set of Features for the dataset without temporal progression attribute and 4-year windowB-2

B.4 Set of Features for Progression Feature datasets and 3-year window . . . . . . . . . . B-3

B.5 Set of Features for Progression Feature datasets and 4-year window . . . . . . . . . . B-3

B.6 Set of Features for Progression Feature datasets and 5-year window . . . . . . . . . . B-4

B.7 Set of Features for Progression Feature datasets with Unsupervised Discretization and

3-year window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4

B.8 Set of Features for Progression Feature datasets with Unsupervised Discretization and

4-year window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

B.9 Set of Features for Progression Feature datasets with Unsupervised Discretization and

5-year window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

B.10 Set of Features for Progression Feature datasets with Only Progression Features and

3-year window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6

B.11 Set of Features for Progression Feature datasets with Only Progression Features and

4-year window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6

B.12 Set of Features for Progression Feature datasets with Only Progression Features and

5-year window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7

B.13 Set of Features for Progression Feature datasets with all timepoints and 3-year window B-7

B.14 Set of Features for Progression Feature datasets with all timepoints and 4-year window B-8

B.15 Set of Features for Progression Feature datasets with all timepoints and 5-year window B-8

B.16 Set of Features for Temporal Pattern datasets . . . . . . . . . . . . . . . . . . . . . . . . B-9

B.17 Set of Features for Temporal Pattern datasets with all timepoints . . . . . . . . . . . . . B-9

B.18 Set of Features for Statistics-Based Summarization and 3-year window . . . . . . . . . B-9

B.19 Set of Features for Statistics-Based Summarization and 4-year window . . . . . . . . . B-10

B.20 Set of Features for Statistics-Based Summarization and 5-year window . . . . . . . . . B-10

C.1 Cross Validation results for the datasets without temporal features. . . . . . . . . . . . . C-2

C.2 Cross Validation results for the datasets with Progression features. . . . . . . . . . . . . C-2

C.3 Cross Validation results for the datasets with Progression features witn unsupervised

discretization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2

C.4 Cross Validation results for the datasets with Progression features without Feature Se-

lection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3

C.5 Cross Validation results for the datasets with Progression features without the original

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3

C.6 Cross Validation results for the datasets with Progression features with all timepoints. . C-3

C.7 Cross Validation results for the datasets with temporal pattern features. . . . . . . . . . C-4

C.8 Cross Validation results for the datasets with statistics-based features. . . . . . . . . . . C-4

xiv

Abbreviations

AD Alzheimer’s Disease

ADNI Alzheimer’s Disease Neuroimaging Initiative

CCC Cognitive Complaints Cohort

CSF Cerebrospinal fluid

cMCI converter Mild Cognitive Impairment

FS Feature Selection

GDS Geriatric Depression Scale

MD Medical Doctor

MCI Mild Cognitive Impairment

MMSE Mini-Mental State Examination

MRI Magnetic Resonance Imaging

pre-MCI Pre Mild Cognitive Impairment

PET Positron Emission Tomography

sMCI stable Mild Cognitive Impairment

SMOTE Synthetic Minority Over-Sampling Technique

WEKA Waikato Environment for Knowledge Analysis

xv

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1

1.1 Motivation

We are what we can remember. Our lives, the experiences we have been through, the places

we visit and the people we meet make us who we are. Losing the ability to remember is losing our

identity.

Dementia is a wide term for several progressive diseases affecting memory, other cognitive abilities

and behavior, that interferes significantly with a person’s ability to engage in everyday activities of daily

living [1]. In 2015, dementia affected 47 million people worldwide (or roughly 5% of the world’s elderly

population), a number that is estimated to increase to 75 million in 2030 and 132 million by 2050 [2].

Alzheimer’s disease (AD) is the most common form of dementia and may contribute with up to 60

to 70% of the cases. Individuals with AD have trouble remembering recent events, become confused

and forgetful, often repeating questions and getting lost in familiar places. As the disease evolves,

the ability to remember past events is lost while disorientation and violent mood swings increase

[3]. Although dementia mainly affects elderly people, it is not a normal part of aging and it is one

of the major causes of disability and dependency among elderly people worldwide. Dependence,

sometimes referred to as need for care, is defined as the need for frequent human help or care beyond

that usually required by a healthy adult [2]. According to this definition around 5% of the world’s

population is dependent, rising from 13% among those aged 60 years and over. AD has physical,

psychological, social and economic impact on caregivers, families and society. Alzheimer’s Disease

International states in its 2016 World Alzheimer Report that the total estimated worldwide cost of

dementia is US$818 billion. This costs are mainly due to care needs as health care costs account

for a small proportion of the total limited therapeutic options. Costs associated with dependency will

increase as the number of people with dementia increases (assuming that the prevalence of dementia,

patterns of service use, and unit costs of care remain the same) [4].

Mild Cognitive Impairment (MCI) was once considered an initial stage of AD, as the symptomatic

profile is similar but less severe. However, MCI may originate from a variety of different etiologies and

pathologies, since there are cases where MCI subjects do not progress to AD, and other where there

is a reversion back to health. These cases may suggest that the clinical symptoms of MCI can occur

due to causes other than underlying AD pathology [5,6]. Nonetheless, the risk of dementia due to AD

in MCI subjects is higher when compared to cognitively normal subjects. The annual incidence rate

of healthy subjects that develop AD is 1% to 2%, while the conversion rate from MCI to AD is reported

to be approximately 10% to 15% per year [7].

1.2 Problem Formulation

To our knowledge, existing models of MCI progression are based in only one medical observation,

ignoring important temporal information concerning the progression to AD. The progression of a dis-

ease is usually accompanied by a progression of the symptoms. Hence, instead of looking only to

test results of one single medical observation, in this work we study the best way to combine several

the medical appointments of a patient, in order to answer two important clinical questions: Does the

2

outcome of previous neuropsychological tests and their temporal relation help predict the progression

from MCI to AD? If so, what is the best way to look at this temporal information?

These two questions were tackled in three distinct but related problems:

i Find the best way to evaluate the temporal information in several neuropsychological tests of

the same patient, by defining a temporal mining model;

ii Find the best time frame with new temporal datasets;

iii Find the best way to model the progression of MCI.

1.3 Contributions

The main goal of this thesis is to explore the usefulness of temporal information in the prognostic

prediction MCI to dementia.

While tackling this goal this thesis:

i Develops a framework of preprocessing techniques that lead to better results when analyzing

datasets with temporal data;

ii Introduces the creation of temporal features into the Alzheimer’s Disease prognosis problem;

iii Explores these temporal features to find the one yielding better results;

iv Introduces the use of Hidden Markov Models to the Alzheimer’s Disease prognosis problem;

1.4 Thesis Outline

After this simple presentation of this dissertation in Chapter 1, we begin this dissertation with the

necessary medical background concepts to understand Alzheimer’s Disease, Mild Cognitive Impair-

ment and the neuropsychological tests used in the Portuguese medical context. Next, we take a look

at the engineering side of the question, focusing on Data Mining and Temporal Mining concepts as

well as the classifiers used, highlighting their strengths and limitations. Chapter 2 ends with a related

work section where we present a comparative literature review of different studies dealing with the AD

prognosis problem.

Chapter 3 describes the methods. The dataset is described and so are the preprocessing meth-

ods used: Data Cleaning, Learning Examples, Temporal Features, Feature Selection and Synthetic

Minority Over-Sampling Technique (SMOTE).

Chapter 4 presents and discusses all the experimental results. Finally, Chapter 5 discusses the

conclusions to be drawn from this work and some further work possibilities.

3

4

2Background

Contents2.1 Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5

2.1 Alzheimer’s Disease

Alzheimer’s Disease (AD) is a chronic neurodegenerative disorder type of dementia, defined as

a premature aging of the brain. AD usually appears in mid-adult life and progresses rapidly to ex-

treme loss of mental capabilities. Dementia is a wide term that defines a permanent or progressive

general loss of intellectual abilities, including memory impairment, judgment, abstract reasoning and

personality alterations [3]. Alzheimer’s Disease affects each person in different ways, but there are

symptoms that are consistent on everyone affected:

i Temporal and spatial disorientation;

ii Amnesic type of memory loss;

iii Deterioration of language;

iv Behavior changes including violent and aggressive ones;

v Loss of ability to care for oneself [8].

Although these symptoms do not appear overnight, the disease’s early stage is often overlooked.

In the early stage, common symptoms are subtle, like forgetfulness and losing track of the time. Yet,

as the disease starts to progress, the symptoms become more distinguishable and more restricting.

When the disease starts to enter a middle stage, symptoms include being forgetful of recent events

and people’s names, and being lost at familiar places, and the presence of some form of memory

impairment becomes evident [1,2,4].

Mild Cognitive Impairment (MCI) is a common disorder characterized as cognitive decline greater

than that expected for an individual’s age and education level, but that does not interfere notably with

activities of daily life [9-15]. Although a consensual definition to describe MCI is yet to be achieved in

the research and clinical community, according to the European Consortium on Alzheimer’s Disease

[9], MCI diagnose is based on the following criteria:

• Cognitive complaints coming from patients or their caregivers;

• Reporting of a decline in cognitive functioning relative to previous abilities during the past year

by the patient or caregiver;

• Presence of cognitive impairment (1.5 standard deviations below the reference mean) in at least

one neuropsychological test;

• Absence of major repercussions on daily life (the patient may report difficulties concerning com-

plex day-to-day activities).

MCI was once considered an initial stage between normal aging and dementia, particularly AD.

However, recent studies note that many patients diagnosed with MCI do not progress to AD, creating

the need for a broader definition [10,11,12,13]. Moreover, a growing awareness of the importance of

correctly diagnosing MCI gave rise to new, more extensive criteria for MCI. These criteria now form the

6

foundation for the National Institute on Aging (NIA); sponsored Alzheimer’s Disease Centers Program

Uniform Data Set (USD) and the public-private neuroimaging/biomarker consortium, the Alzheimer’s

Disease Neuroimaging Initiative (ADNI) [14,15].

MCI is classified into two subtypes: amnestic MCI (aMCI), where the cognitive impairment is

mainly characterized as memory loss, and non-amnestic MCI (naMCI) [14,15]. A better understanding

of MCI and its subtypes is a step closer to understanding the progression of a healthy individual to

AD. MCI as a clinical diagnosis is used recurrently in many studies [5-7,10-20].

AD, being a neurodegenerative disease, is believed that even mild cognitive symptoms appear far

after the anatomic and physiologic changes have occurred [10]. Currently, there are no treatments for

AD, but changes in the daily routine towards healthier habits may prevent or delay the symptoms [10].

As the search for an effective treatment continues, the correct MCI diagnose and understanding of its

progression to an AD stage is very important.

One can only be sure of the correct AD diagnosis when made by neuropathological confirmation

in patients who had been studied in life and met the dementia criteria [20]. Consequently, unless

a biopsy or an autopsy is made, there is no way in telling unequivocally that a patient has AD. Pa-

tients are diagnosed as ”probable AD” when any other causes of the symptoms can be ruled out,

or ”possible AD” when symptoms can be due to other conditions that contribute to AD. However, the

correlation between neuropathological diagnoses of AD and clinical diagnosis of AD using the criteria

established by ADNI has been long proven [21].

AD is characterized by two types of lesions, senile plaques (or neuritic plaques) and neurofibrillary

tangles (NFTs). Senile plaques are microscopic foci of extracellular amyloid β-protein (Aβ) deposits

that occur principally in a filamentous form [3,22-30]. Regarding NFTs, immunocytochemical and

biochemical analyses of neurofibrillary tangles confirm that they are composed of the microtubule-

associated protein τ [25].

According to the amyloid hypothesis [26], accumulation of Aβ in the brain is the primary cause

leading to the underlying mechanisms that drive AD pathogenesis. In addition, the identification of

mutations in the Amyloid Precursor Protein (APP) gene showed that APP mutations could cause Aβ

deposition. Aβ is a normal product of APP metabolism and it can be measured in CSF, making it a

valid and important AD biomarker, as the presence of abnormal values indicates a progression to AD,

even if patients do not show other signs and symptoms of dementia. The Apolipoprotein E (APOE) can

encode different proteins, including APOE E4 that is proven to be a risk factor for AD. Pathologically,

the APOE E4 allele is strongly associated with increased brain Aβ deposition. Genetic studies showed

that the APOE E4 is overrepresented in AD patients comparing to the normal population, and that

genetic inheritance of one or two E4 alleles increases the likelihood of developing [22-30].

The structural changes that occur in connection to AD can be identified in neuroimaging tech-

niques such as PET scan using Pittsburgh Compound-B (PIB), prototypical benzo-thiazole amyloid

binding agent. PIB is used as a PET tracer, as it binds with amyloid, penetrating the brain so it can

be detected in PET scans, but also being rapidly removed from brain tissue [28,30]. Therefore, a

7

comparison of images of normal and AD brain shows significant differences as the last one has atyp-

ical concentrations of PIB in gray matter areas, but not on white matter ones, known to be relatively

unaffected by amyloid deposition. Besides PIB, fludeoxyglucose (18F -FDG), a common PET tracer,

can also be used to distinguish between AD brain and normal ones.

The number of techniques that can be used to study Mild Cognitive Impairment and Alzheimer’s

Disease is still rising. However, PET scans and MRI are quite expensive and often not available. This

underlines the importance of neuropsychological tests, especially if they can be used to predict if a

MCI state is going to progress to AD [11].

2.1.1 Neuropsychological tests

Neuropsychological tests are a tool used to assess the state of a patient’s cognitive impairment,

by using a set of questions and tasks explicitly designed to test specific cognitive domains [7,11].

Different cognitive domains and functions are affected in dementia and the symptomatic profile of

patients’ reports the impairments in these different domains with different aggravations [1]. The ability

to acquire and remember new information, changes in personality or behavior, visuospatial abilities,

language functions and reasoning and handling of complex tasks are assessed according to different

cognitive domains, including:

• Executive Functions: including problem solving and organizational skills are assessed through

tests like the Toulouse-Pieron Test;

• Memory: including short and long term memory, are assessed through tests like California

Verbal Learning Test;

• Attention, including sustained and divided attention, are assessed through tests like Cancella-

tion test;

• Language: including production and comprehension, are assessed through tests like Object

naming and Object Identification;

• Conceptual and Abstract Thinking: are assessed through tests like Raven Progressive Ma-

trices, Proverbs and the clock drawing test;

• Visual and spatial perception: are assessed through tests like the cube copying test;

• Orientation: including Spatial, Temporal and Personal orientation are assessed through tests

with the same names;

These tests and others constitute the Lisbon Test Battery for Dementia Evaluation, validated for

the Portuguese population. The complete list of tests is included in Appendix A. Every test has a

matching Z-score. This score measures the number of standard deviations from the mean test score

for the Portuguese population.

8

The outcome of a test depends on the patient’s willingness to do it and the judgment of the Medical

Doctor (MD). When it relies on information given by the caregiver, it depends on the caregiver’s

perception of the patient’s ability to have an independent life. It may be difficult to obtain test values

for every task for an AD patient may be difficult to obtain mainly because, as the dementia evolves to

severe stages, the patient becomes incapable of performing the test.

Despite of following established criteria that focus on standardizing, neuropsychological tests have

an inherent subjectivity, not only due to the MD perception but also due to the patients and caregivers’

complaints.

NINCDS-ADRDA [31] criteria and DXM-IV-TR [32] criteria are two examples of unified and stan-

dard criteria that aim to make the MCI and AD diagnose more unified across the clinical practice.

Nonetheless, neuropsychological tests have a great strength as a measurement of a patient cogni-

tive state: are very easy to apply, have virtually no costs and no need for high tech equipment. These

advantages over techniques like Magnetic Resonance Imaging (MRI), Cerebrospinal fluid (CSF) anal-

ysis and Positron Emission Tomography (PET) scans support the importance of developing tech-

niques and methods that can identify MCI stages that evolve to AD earlier in the disease’s pathologi-

cal mechanism, hence making AD detection easy, less expensive and widely available [11].

2.2 Data Mining

Data mining is the extraction of nontrivial and relevant information from large information reposito-

ries to find subtle trends and relationships in the data, hidden to the naked eye. To find these trends

and relationships, the data mining process uses machine learning algorithms and statistics to process

the data. Data mining is a crucial part of a broader process known as Knowledge Discovery (KD). KD

is composed by several steps that include: (i) data preprocessing, that cleans, selects and transforms

the dataset, (ii) data mining methods to extract data patterns, (iii) pattern evaluation where the pat-

terns found before are studied and finally (iv) knowledge presentation, where the final conclusions and

patterns are shown [33-36]. One of the main goals of data mining processes is to find data relations

in a small subset of data and turn it into a general rule to be applied in unseen datasets. Data mining

relies on some universal concepts. A data object is an entity that may be defined as an instance. An

instance is typically described by attributes (or features) and a data tuple is a data object identified

by a key (class) and described by a set of attribute values [33-36]. A complete set of instances is a

dataset. A dataset can be visualized as a table where rows correspond to instances and the columns

to the attributes. An attribute is a data field, representing a characteristic of an instance. Attributes

can be nominal, when comprising categorical values; binary, which is a nominal attribute with only two

states: 0 or 1 (false or true); or numeric, having measurable quantities represented in integer or real

values. The classification task present in every data mining application consists in extracting models

to describe data classes. The classification has two steps: the learning step, where a classification

model is constructed, and the classification step, where the model is applied into another dataset to

9

predict class labels. When the class label is known, the learning step is defined as supervised learn-

ing, contrasting with unsupervised learning (or clustering), when nor the class label or the number of

classes are unknown [34-36].

Datasets and instances may have several issues that have to be addressed before the classifi-

cation task. Instances may exceptionally deviate from the standard values and behavior, comprising

outliers and are usually characterized as noise. Class imbalance if not addressed correctly may lead

to overfitting. Overfitting occurs when the learning model contains rules that are only specific to that

dataset and not applicable to the general model, thus compromising the generalization used to build

the model [36].

2.2.1 Data Preprocessing

The success of the data mining process is highly dependent on the quality of the data, considering

that classification models are built by classifiers based on the data presented. Raw data tend to have

low quality, as they are susceptible to have missing, noisy and inconsistent data. Additionally, they

tend not to be in the most useful data format. Therefore, preprocessing techniques are necessary

to improve the efficiency and ease of the mining process. There are several data preprocessing

techniques: (i) cleaning, (ii) integration, (iii) selection or reduction, and (iv) transformation [33]. Data

cleaning methods are applied to remove missing and noisy data. Data integration is used when the

mining process must use information from multiple sources and a consistent dataset comprising all

the information is needed. Data Selection reduces the dimensionality of the dataset, this reduction

being in number of instances or attributes. Data transformation can comprise attribute construction,

instance or attribute aggregation, normalization and discretization. The use of all these methods is

problem dependent and an extensive analysis must be made [33-36].

Missing data

Missing values are a common problem in real data. The cause of the missing value is not always

clear. There can be missing values in random locations in the dataset (for example, if there are

external problems with the acquisition of information) or they can be systematic (incapability of a

subject to take a certain test), and these two cases should be handled differently. If values are missing

due to systematic problems or error, one way to handle them is to simply ignore them by deleting the

corresponding instance. This strategy compromises the number of instances in the dataset. Another

strategy is Missing Values Imputation, which consists of replacing the missing value with a specific

value, such as a global constant, the overall attribute mean (or median), the attribute mean (or median)

of the class, or even using the most probable value, relying on regressions (Bayesian or expectation-

maximization). Compared to other methods, Missing Values Imputation using the most probable value

makes the most use of the information in the dataset, while providing reliable values. On the other

hand, if the fact the values are missing have an important meaning to the dataset, they should be

considered and not ignored or replaced. In the case of nominal attributes, a new categorical value

should be given to them and with numeric values, they should be granted a value, according to the

10

scale they comprise [33,37,38].

Data Transformation

There is a wide range of data transformation techniques including (i) Attribute construction, (ii)

Instance Aggregation, (iii) Data Discretization and (iv) Data Normalization [33]. All these techniques

consolidate the information into convenient formats for the mining process. Attribute construction

techniques create new attributes that help the mining process. Data transformation by instance ag-

gregation relies in methods that compress all the information of different but related instances into

one, or even construct multidimensional datasets [33-37]. Data discretization consists in transforming

a numerical attribute into a nominal one. This may imply loss of information. However, the goal of data

discretization is to achieve data simplification while minimizing information loss. Data normalization

scales the attribute range from 0.0 to 1.0. This is an important data transformation step, especially if

data acquisition was made by different people or using different instruments [33,35,36].

Feature Selection

Feature Selection removes redundant attributes from the dataset, consequently reducing the

dataset size. With a reduced number of attributes, classifiers build models with only relevant informa-

tion and the discovered patterns are easier to interpret. Ideally, the resulting probability distribution

of the data classes, after Attribute Selection was applied, is as close as possible to the original dis-

tribution. However, the resulting attributes are highly predictive and the mining task is done easily

and successfully. Feature Selection methods can be divided in Wrapper or Filter methods. Wrapper

methods perform classification tasks in different subsets of attributes, and select the subsets that

maximize the classification. Filter methods decide between different attributes by using tests of statis-

tical significance, which assume that the attributes are independent of one another. This decision is

used while searching for the optimal subset of attributes. Typically, heuristic methods are used using

greedy searches that search through attribute space, looking for the best attribute at the time, as a

locally optimal choice usually leads to a globally optimal solution [37,39,40].

Imbalanced Data

A dataset is considered imbalanced if the classification categories are not approximately equally

represented, which is the case of most real-data datasets. Biomedical databases are especially

imbalanced as the disease class is rare, comparing with the number of healthy cases. Machine

learning algorithms perform poorly in imbalanced datasets, as it is cheap to assign all instances to

the majority class. There are two main ways of dealing with an imbalanced dataset: (i) defining

different costs to training examples from different classes by making it more expensive to classify

an instance as a given class; (ii) resampling the original dataset by oversampling the minority class,

undersampling the majority class, or both. Undersampling the majority class might clean up instances

with important information for the classification task, leading to information loss. This is not a good

11

approach when dealing with small datasets. However, there are methods that carry on informed

resampling methods as EasyEnsemble and BalanceCascade and NearMiss [40].

One strategy to oversample the minority class is called Synthetic Minority Over-sampling TEch-

nique (SMOTE) [41]. SMOTE creates synthetic instances of the minority class, using k-nearest neigh-

bors algorithms to create attribute values for the instances, where k is given as a parameter, according

to the percentage of oversampling needed [40-42].

2.2.2 Data Classification

The classification task, present in every data mining process, develops a set of rules that define

models. The classification task has two steps (learning and classification) and they are applied in two

different sets: the training and the test set, respectively. A training set is a collection of data tuples

that can be represented by (Xi, Y ) where Xi is the instance and Y the data label or class. In the

learning step, the training set is processed and the classifier outputs a model that can be composed

by a set of rules and/or parameters. The test set is used to evaluate the behavior of this model when

presented with unseen data. Here, the real class of the data tuple is compared with the predicted

class. If the behavior is considered acceptable, the rules and model created can be extrapolated to

new data [33-36].

Naıve Bayes Classifier

Bayesian classifiers, like Naıve Bayes, are statistical classifiers that predict the probability of a

given tuple belonging to a class. The naıvete of the classifier is due to two main reasons: (i) the

predictive attributes are conditionally independent and (ii) no hidden or latent attributes influence the

prediction process. Conditionally independence means that we assume the value of one attribute

has no influence in the values of other attributes. The output probability is calculated based on the

Bayes Theorem [33]. Let X be a data tuple. X is represented by an attribute vector of size n,

X = (x1, x2, ..., xn) where x1 to xn are attribute values for the attribute set A1, A2, ..., An. Let D be a

training set of tuples and their associated class labels. The training set tuples comprise m different

classes: C1, C2, ...Cm where m is the total number of class categories. Let Ci be defined as the

hypothesis that the data tuple X belongs to a specified class Ci.

P (X) is the prior probability of X, or a priori probability of X, and is constant for all classes. The

class prior probability P (Ci) is the prior probability of Ci with 1 ≤ i ≤ m. P (Ci) can be estimated

by dividing the number of instances belonging to Class i by the total number of instances. On the

other hand, P (Ci|X) is the posterior probability, or a posteriori probability, of Ci conditioned on X.

Likewise, P (X|Ci) is the posterior probability of X conditioned on Ci. Posterior probabilities are

based on more information than prior probabilities. Bayes’ theorem provides a way of calculating the

posterior probability, P (Ci|X), from P (Ci), P (X|Ci), and P (X), according to the following equation

2.1.

P (Ci|X) =P (X|Ci)P (Ci)

P (X)(2.1)

12

The classifier will predict that that X belongs to the class Ci if it has the highest posterior probability

P (Ci|X). Thus the main goal is to maximize P (Ci|X). The class Ci for which P (Ci|X) is maximized is

called the maximum posteriori hypothesis [33]. As P (X) is constant for all classes there is only need

to maximize P (X|Ci)P (Ci) and if the dataset is balanced there is only need to maximize P (X|Ci).

Computing the conditional probability of every attribute depending on another would be extremely

computationally expensive. However, the Naıve Bayes assumption of class-conditional independence

allows us to consider that, within each class, the attributes’ values are independent of one another.

Thus, the probability of one attribute conditioned on another can be calculated as follows:

P (X|Ci) =n∏

k=1

P (xk|Ci) = P (x1|Ci)× P (x2|Ci)× ...× P (xk|Ci)× ...× P (xn|Ci). (2.2)

To calculate P (xk|Ci) there are different methods to adopt, depending on if the attribute is nominal

or numeric. Nominal attributes are easy to deal with for Naıve Bayes, as P (xk|Ci) is the number of

tuples belonging to class Ci in D having the value xk for Ak, divided by the total number of tuples

belonging to class Ci in D. Numeric attributes, on the other hand, require further calculations. It is

assumed that the numeric attribute has a Gaussian distribution and the values for µCi (mean) and

σCi (standard deviation) are calculated with all values of attribute Ak in tuples belonging to Class Ci.

With these values, the probability is calculated from the Gaussian distribution [33,34]:

P (xk|Ci) = g(xk, µCi, σCi) =1√2πσ

e−(x−µ)2

2σ2 . (2.3)

The Gaussian function yields a value from 0.0 to 1.0, which is the probability density of a normally

distributed random variable.

2.3 Temporal Data Mining

Data mining can retrieve useful knowledge from big datasets. However, a temporal database con-

tains attributes related with temporal information and dependencies, and must be handled differently

from a database with no constraints [43]. Temporal information can be divided into two types: time

series and temporal sequences [43]. Time series are composed by measurements of real-valued

variables with regular intervals of time while temporal sequences are composed by ordered nominal

values. Furthermore, time series and temporal sequences can be univariate or multivariate, depend-

ing on if they contain one or more variables, respectively.

Back in Section 2.2 we considered a database as a table where rows were instances and the

columns were attributes. As we add the time dimension, we now have a 3-dimensional matrix where

a multivariate time series can be represented as a series of measurements [x1(t), x2(t), ..., xn(t)]

where 1 ≤ t ≤ T that form a T × n matrix. A set of these matrices is a Temporal Database T × n× s,

s being the number of multivariate time series

The main goal of temporal data mining is to discover hidden relations between sequences and

subsequences of events [43].

Time series can be studied through methods like spectral analysis and Autocorrelation and Au-

toregressive Integrated Moving Averages (ARIMA) [45]. Nevertheless, the kind of knowledge learned

13

from these methods and from temporal mining methods is quite different. The ARIMA model, for ex-

ample, aims to forecast future time points based on previous ones present in the data. On the other

hand, the scope of temporal data mining extends beyond forecasting since the variable correlations

may not be known [44].

While in data mining it is possible to see the relationship between two attributes by simply looking

at their values, in temporal data mining it is not that simple, since we are looking at events. Events are

temporal entities with an occurrence time and can be composed with more than one attribute value.

The match between sequences is not always trivial. Attributes may not be a perfect match and still

belong to the same event. Therefore, it is necessary to define a similarity measure [43-45].

Lets consider the example of an Electrocardiography (ECG). A typical ECG wave is composed

by a 5 events denominated P, Q, R, S, and T waves. Each event has characteristics and a specific

pattern and correspond to a polarization or depolarization of the atria and ventricles. Different hearts

may have different ECG values but if they are all healthy, they should present the same P, Q, R, S,

and T waves. However, they may not have a perfect mach considering all values and the similarity

measure as to overlook values mismatch and correspond different P waves and so on.

Furthermore, events may have intrinsic or extrinsic noise, outlier or missing values, offsets or

amplitude differences, and thus similarity measures must be prepared to rise above these inconve-

niences [43]. Therefore, the discovery of the relations between events needs three main steps: (i)

modeling of the data sequences in a suitable form, (ii) the definition of similarity measures between se-

quences, and (iii) the application of models and representations to the actual mining problems [43-48].

2.3.1 Similarity Measures

Similarity measures can be used to compare whole sequences or to locate a subsequence within

a sequence. The Euclidean Distance is the most commonly known distance. The Euclidean Distance

between two series X = x1, x2, ..., xn and Y = y1, y2, ..., yn is defined as

dE =√(x1 − y1)2 + (x2 − y2)2 + ...+ (xn − yn)2. (2.4)

However, this metric has limitations. Sequences must have the same baseline, scale and length, and

should not have any gaps.

One way to overcome these limitations is to use other metrics as Dynamic Time Warping.

Dynamic Time Warping (DTW) is a systematic and efficient method, based on dynamic program-

ming, that identifies which correspondence among feature vectors of two sequences is best, when

scoring the similarity between them [44]. DTW determines a number of warping paths w that map or

align the elements of the two time series in such a way that the distance between them is minimized

[49]. The DTW minimizes the Euclidean distance over a number of potential warping paths based on

the cumulative distance for each path, as defined by

DTW (X,Y ) = min[

p∑k=1

δ(wk)], (2.5)

14

where δ is the distance metric previously defined. DTW can be computationally heavy for long se-

quences and prone to overfitting [43].

Hence, the Longest Common Subsequence (LCSS) is also used to as similarity metric where the

output is a subsequence with all the values the two time series have in common. LCSS assumes the

same baseline and scale for the two time series and can easily deal with gaps [43].

2.3.2 Temporal Features

Extracting temporal features from a temporal dataset is a way of use the temporal information in

standard (non-temporal) classifiers, instead of dealing with the temporal constrains. This temporal

features must be obtained in such a way that the loss of information is minimal and the represent and

adapt to the problem we have in hands. When finding features that represent a time series there are

two main ideas that should be respected: if two time series are considered similar in one space they

should be considered similar in the transformation space and it should reduce the dimensionality of

the search problem [43].

Transformation Based Representations take the original time series and transform it into a different

domain. In this case, Discrete Fourier Transform (DFT) [43] can be used to transform a time series in

the frequency domain and Discrete Wavelet Transform (DWT) [43] transforms the temporal information

in wavelet coefficients. However, these transforms can only deal with information from one attribute,

failing to account for the influence of other attributes in a multivariate time series. A time series can be

represented with a limited vocabulary posing as nominal attributes that portrait the temporal evolution.

An example is the Shape Definition Language that comprises the nominal values Up, up, stable, zero

down, Down, defining the time of gradient that a certain feature presents [43-45]. Here, two temporal

sequences are as similar as the size of the longest common sequence.

Time Series Summarization methods constrict the temporal sequence into a value, reducing sig-

nificantly the dataset dimensionality. Summarization values can be obtained by calculating the mean,

median, mode, variance, slopes and others methods that can comprise global characteristics of a

sequence into a single value [43].

2.3.3 Hidden Markov Models

A Hidden Markov Model (HMM) is a stochastic process that comprises one Markov Chain of hid-

den states and another set of stochastic processes that produce the sequence of observed states

[52]. These statistical models have been used successfully in many applications in artificial intelli-

gence pattern recognition, handwriting and speech recognition, pattern recognition in molecular biol-

ogy and fault detection [51]. Before continuing with HMM, lets addressed the Markov Model definition.

A Markov Model is a stochastic model that obeys the Markov property, where its future states depend

only on the current state and not the states before, that is, the past states are not relevant. A Markov

model of order k is a probability distribution over a sequence with the following conditional indepen-

dence property:

P (qt, qt−11 ) = P (qt, q

t−1t−k). (2.6)

15

Meaning qt−1t−k contains all past information. The qt is defined as a state variable. Due to the property

descried on 2.10 we have:

P (qT1 ) = P (qk1 )T∏

k=t+1

P (qt, qt−1t−k), (2.7)

where T is the number of timepoints. P (qk1 ) is defined as initial state probabilities and P (qt, qt−1t−k) the

transition probabilities. In a n-order Markov sequence model, the probability distribution of the next

state depends on the previous n states generated.

Considering now a Hidden Markov Model, in this case is not assumed that the observed states

obey the Markov property. Instead, hidden states are defined as being the models true states com-

prise its true rules. It’s the Hidden states that have the Markov property, typically in a low order. The

observed sequence y and the hidden state sequence have the following joint distribution:

P (qt+1|qt1, yt1) = P (qt+1|qt) (2.8)

and

P (yt|qt)) = P (yt|qt1, qt−11 ) (2.9)

thus, the joint distribution is

P (yT1 , qT1 ) = P (q1)

T−1∏t=1

P (qt+1|qt)T∏

t=1

P (yt|qt), (2.10)

where P (q1) defines the initial probability of every state, meaning it is the probability of the hidden

model beginning in a particular state at time t = 1, P (yt|qt) defines the emission probability, meaning

the probability of observing a particular state given a particular hidden state and finally P (qt|qt−11 )

defining the transitional probability, meaning the probability of a hidden state given the previous hidden

states [50-55].

2.4 Related Work

Lee et al. [10] developed a prognostic index to classify MCI patients according to the predicted

progression to AD. This prognostic index relies only on information that can be obtained in most clini-

cal settings, such as demographic information, neuropsychologic tests, self-reported medical history

and symptoms, vital signs and questionnaires to the caregivers, and discards MRI and PET data,

genetic factors, blood based or CSF biomarkers as they are not easily and readily obtained in most

clinical settings. In their study, they used data collected from 382 ADNI subjects diagnosed with

MCI at the time, that presented memory impairment (with or without other domain) but with general

cognition and functional performance sufficiently preserved. Subjects within the ANDI-1 cohort were

reassessed regularly, at 6, 12, 18, 24 and 36 months and other follow-ups were performed annually

as part of ADNI-2. The methods were mainly statistical analysis of all potential predictors. Univariate

distributions were used to assess evidence of outlier values, while bivariate associations between

potential predictors and the outcome (conversion to AD) were then examined employing t-tests, vari-

ance analysis for continuous variables and Chi-square tests for categorical variables. Cox proportional

16

hazards regression analyses were carried out to identify factors associated with time to AD, first by

performing domain-specific Cox analyses to identify variables from each domain that were associated

with conversion to AD with p-value of less than 0.20. The authors also considered a less stringent

p-value (p < 0.05) to ensure consideration of a wide range of potential predictors. Then, these vari-

ables identified within each domain were considered together in a single Cox regression analysis and

all variables were assigned a point value, by dividing its model coefficient value by the absolute value

of the smallest coefficient in the model and rounding to the nearest integer. The main outcome of

the 3-year risk index was progression to probable AD, where 46.9% of the subjects converted to AD

in 2.9 years. Subjects with less than 3 years of follow-up data were censored as needed by the Cox

proportional hazards regression model. The main features that were assessed with the risk of con-

verting to AD were being female (1 point); caregiver report that a participant was stubborn/resistant to

help (2 points), became upset when separated from the caregiver (1 point), difficulty shopping alone

for household items (2 points) or remembering important appointments and events (2 points); and

poor performance on individual neuropsychological test items including 10-word recall (0 to 4 points),

orientation (0 to 2 points) or Clock Test (2 points). Afterwards, the points from this final model were

summed and a total point score was created that could range from 0 to 16. This prognostic index

successfully classified subjects into low, moderate and high risk groups.

The same group conducted a different study [56] that aimed to develop a point-based tool to predict

conversion from amnestic MCI to probable AD. However, even though subjects were from the same

cohort (ANDI), this time MRI data was also used including hippocampal subcortical volume, entorhinal

cortical volume, entorhinal cortical thickness, middle temporal cortical volume, middle temporal corti-

cal thickness, inferior temporal cortical thickness, and inferior parietal cortical thickness, and genetic

and blood-based biomarkers including APOE ϵ4 genotype as well as plasma levels of aβ40 and aβ42.

The study was conducted on 382 ADNI participants who were diagnosed with MCI at baseline. Cox

proportional hazards model was applied with some slight additions to the protocol. The authors first

performed a series of Cox proportional hazards analyses, in which all variables within a given domain

(MRI, Demographic, Neuropsychological measures, etc.) were considered altogether. This model

determined which domain variables were most strongly associated with the outcome conversion to

AD.

A single multivariate model was developed using all the most predictive variables for each of

the 6 domains. The main predictive features were: greater functional dependence based on the

Functional Assessment Questionnaire (2-3 points); MRI middle temporal cortical thinning (1 point);

MRI hippocampal subcortical volume (1 point); worse neuropsychological test performance on the

Alzheimer’s Disease Assessment Scale-cognitive subscale (2-3 points); and impaired Clock Test (1

point). Comparing these two studies, the results are consistent with current theoretical models that

states that functional decline appears latter and neuropathological evidence of AD is probably already

present in the subjects. Hence, memory impairment, functional and cognitive complaints character-

istic of MCI state appear closer to the time of progression to AD.[10] Therefore, while MRI, PET and

CSF data may comprise key information to predict the progression to AD in normal asymptomatic

17

patients, functional and neuropsychological data are more important when subjects already present

the MCI state and it is important to identify the ones converting to dementia due to AD. This latter case

can be based only in data easy to obtain in most clinical settings and is only slightly less accurate

than the former index, that includes MRI data.

Carreiro et al. [57] conducted a study on Amyotrophic Lateral Sclerosis using data mining tech-

niques and temporal feature extraction to predict the need of non-invasive ventilation k days after

the last observation. While the population was quite different from the one addressed in this thesis,

Carreiro et al. used real world data, time windows methodology and a temporal mining approach, us-

ing a Bayesian classier. After preprocessing the original data, the authors created snapshots. Since

a patient is not able to perform all prescribed exams in a single day. A snapshot represents the

patient’s condition at the time, comprising information from exams. Then Learning examples were

created, labeling the patients in the Evolution or noEvalution class (per time window). When creating

the learning examples, the authors decided a priori how many time points were to be included in the

models, and how the temporal information would be used. The patients were also divided in two sub-

groups: slow and fast progressors, considering their ALS Functional Rating Scale temporal evolution.

Focusing on the extraction of temporal features, Carreiro et al. created new variables that described

the temporal evolution of the features evaluated, which could then be used in a new dataset, alone

or in conjunction with the original variables. These new temporal variables were extracted by creating

a pattern that describes, in a nominal value, if a variable showed increase, decrease or stabilization

k days later. The classification task was performed by means of a Naıve Bayes classifier with a 5 ×

10-fold Cross Validation on the training set (75% of the original data) and then the best prognostic

model was applied on unseen data (remaining 25% of the original data). Results obtained using the

training set showed that using more temporal information presented better results that only using the

first or last time point, in the 90 days window, for the Naıve Bayes classifier. Considering the 180 and

365 days windows, AUC values were significantly better when using 3 time points and 2 time points

respectively, comparing to using information from only one time point. Adding the temporal dynamic

pattern increases AUC for the 365 days window using 3 time points, but decreased for the 90 and

180 days window. Carreiro et al. note that a model combining the new temporal dynamic pattern

and original features outperformed the original set of features with only 4 features. Test set results

showed AUC values up to 84.64%, 75.86% and 77.06% for, respectively, the windows of 90, 180 and

365 days, using Naıve Bayes.

In a recent study, Cabral et al. [58], developed a recent study that aims to investigated how the

disease stage impacts on the ability of machine learning methodologies to predict conversion, using

ADNI data, mainly FDG-PET and MRI images. In this study, Patients were diagnosed with MCI state

when they presented mini-mental state exam (MMSE) score between 24 and 30 (inclusive), a Clinical

Dementia Rating (CDR) of 0.5 and general cognition, memory complaints, abnormal memory function

and general cognition and functional performance sufficiently preserved such that a diagnosis of AD

cannot be made at the time of the visit. ADNI databases comprises information from observations at

baseline, after 6, 12, 18, 24 and 36 months, ensuring the same temporal frequency, very advanta-

18

geous for temporal studies. PET and MRI images need image preprocessing techniques in order to

be suitable for the classification task. The authors considered a time point named time of conversion

(TC) and constructed the learning examples backwards from this point. The TC point was defined

considering neuropsychological test scores for the MMSE and the CDR: MCI participants who have

undergone CDR changes from 0.5 to 1 and maintained are considered to have their TC at the first

visit in which the CDR scored 1. MCI participants that do not have this CDR change and their MMSE

score was, across all visits, 26 or higher, are considered as nonconverters (MCI-NC). Participants that

do not fit in these two categories were excluded from the study. The authors advocate that adding

the MMSE cut-off on the M CI-NC group, is reducing the possibility of assigning individuals that will

convert to a dementia state and into the MCI-NC group hence, making the groups more homoge-

neous. Aiming to predict the predictive capability of FDG-PET images acquired at 24, 18, 12 and 6

months before conversion and at the TC, Cabral et al. compared each separate datasets made of

the DG-PET images acquired at 24, 18, 12 and 6 months before the TC versus the MCI-NC group.

These learning example separation prooved to be effective and the authors were able to track AD

progression since the TC until 24 months before AD onset.

Hong-mei Yu et al. [59] developed a multi-state Markov Cox’s regression model with 10 medical

examinations of 600 MCI subjects. Participants with global impairment were impaired regarding global

cognitive ability, activities of daily living, or both. Each participants enrolled subject was scheduled to

be evaluated every six months over five year. Examinations comprise blood work to establish APOE

status, cognitive assessments, neurological and physical examinations. Results from the multivariate

analysis show that being female, increasing age, low educational level, history of hypertension or

diabetes, presence of an ApoE4 allele, and reading little were associated with increased risks of

converting to a state of severe cognitive deterioration. Considering the transient state of MCI, these

patients were more likely (67%) to remain in this state at the next cognitive assessment than to

transition to other state, with transition probabilities to global impairment or AD being approximately

17% each. The same transition probabilities for the global impairment state showed that probability

of remaining in the same state was only 18% of the probability of deterioration in cognitive status.

19

20

3Methods

Contents3.1 Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Data Preprocessing: Independent Datasets . . . . . . . . . . . . . . . . . . . . . . 243.3 Data Preprocessing: Relational Datasets . . . . . . . . . . . . . . . . . . . . . . . 363.4 Classification Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

21

3.1 Database Description

The dataset used throughout this thesis consists of information from the Cognitive Complaints Co-

hort (CCC), a database containing information from elderly and non-demented patients with cognitive

complaints,from 1999 to 2015. These patients were referred for neuropsychological evaluation at 3

institutions, the Laboratory of Language Studies at Santa Maria Hospital and Memoclınica (a Memory

Clinic), both in Lisbon, and the Neurology Department, University Hospital, in Coimbra.

Patients fulfill the inclusion criteria when they have:

• Presence of cognitive complaints;

• presence of at least one follow-up neuropsychological assessment or clinical reevaluation.

According to the criteria of the European Consortium on Alzheimer’s Disease(2006) [9], patients with

cognitive complaints and Neuropsychological testing on participating institutions were excluded from

the CCC if there is any of the following:

• Presence of neurological or psychiatric disorders that may induce cognitive deficits; patients

with major depression according to DSM-IV [32] or serious depressive symptoms (indicated

by a score in Geriatric Depression Scale (GDS) short version of more than 10 points) were

excluded;

• Presence of systemic illness with cerebral impact;

• History of alcohol abuse or recurrent substance abuse or dependence;

• Presence of dementia according to DSM-IV [32], or Mini-Mental State Examination (MMSE)

score below the cut-off for Portuguese population, or significant impairment on daily life activities

according to the Blessed Dementia Rating Scale.

Database structure

From the original database we consider the main unit of information to be an observation corre-

sponding to an instance with a time-stamp. An observation is a medical appointment where the state

of a patient is assessed and categorized into a class. In the original dataset, the observation can be

classified as Normal, Pre Mild Cognitive Impairment (pre-MCI), Mild Cognitive Impairment (MCI) and

Alzheimer’s Disease (AD).

There are 102 attributes, 59 being various tasks of neuropsychological tests ans scales applied

and 29 the correspondent Z-Score value. The demographic information features are Age, Gender

and the Number of Schooling Years of the patient. 11 auxiliary attributes are present in the database

and all the attributes are numeric.

An observation may be considered clinical or non-clinical. Clinical observations do not have val-

ues for any neuropsychological test while non-clinical observations do, even if not all attributes have

values. These attributes (also know as features) correspond to a task of a neuropsychological test

or scale. For example, if we consider the Cancellation test where patients are asked to strike all the

22

[H]

Class Number of Patients Age (Mean±SD) Gender(F/M) Years of Schooling (Mean±SD)

Normal 68 (11%) 64.22(± 9.97) 19 / 49 10.97 (± 4.56)

Pre-MCI 68 (11%) 66.03 (± 8.94) 25 / 42 10.02 (± 4.29)

MCI 480 (78%) 69.74 (± 8.46) 193 / 286 8.73 (± 4.89)

Table 3.1: Demographic Information of the original database for every patients at their first observation.

letters ”A” they can find in a word search like group of letters, three attributes arise from this test: the

number of letters correctly stroke, the time the patients took to complete the test and the total score of

these two attributes combined. Neuropsychological scales are measures of patients state assessed

with simple questions about their daily life, resulting into numbers comprising a defined and ordered

range. Scales can be assessed from patients themselves or their caregivers.

In terms of data structure, a patient is defined as a set of observations that belong to a same

person and it is this relation that allows us to have temporal information of the state of the disease (or

the state of the patient). The demographic information of patients at the first observation is presented

in Table 3.1.

Patients that are evaluated as AD in the first observation are not part of the CCC database, thus

there are only Normal, Pre-MCI and MCI first observations.

There are 616 patients that resolve into a total of 1604 instances. A patient can have more than

one observation as it is shown in Figure 3.1.

Figure 3.1: Histogram of the number of observations for each patient.

Tools

All the preprocessing and classification tasks throughout this thesis were carried out using Eclipse’s

Integrated Development Environment for Java (Java version jr7) and using the software package

23

Waikato Environment for Knowledge Analysis (WEKA) [63], version 3.6 and version 3.8 with the HMM

package, version 0.1.1 and MultiInstance Filters package, version 1.0.8.

3.2 Data Preprocessing: Independent Datasets

Raw data from the CCC database had to undergo a number of preprocessing techniques to be

suitable to the classification task. The preprocessing workflow is illustrated in Figure 3.2.

From the original data, we started by cleaning up the dataset discarding all unusable instances and

discriminatory attributes. The present database is build with a generous amount of information with

all of the neuropsychological results and the consequent observation diagnosis, which can be one of

four categories: Normal, pre-MCI, MCI, and AD. However, we are interested in predicting the patient’s

prognosis (the probability of the patients to evolve to a certain outcome) instead of the diagnosis. Due

to this fact, data transformation and reclassification processes are necessary. Hence, the creation

of learning examples, where patient’s information was compiled into an instance in order to obtain

one that can be categorized into the two prognostic classes. The prognostic task was set within a

time frame that we define as a time window. A patient is considered to be progressing to AD if the

progression happens inside the time window, rather than considering if a patient is going to progress

to AD or not in an undefined time frame. The creation of learning examples takes into account three

different time windows sized in 3, 4 and 5 years and consequently defining three different datasets.

Further, more datasets were created as we considered different approaches to relate time points.This

is the temporal processing step. Preprocessing task ends with Feature Selection and SMOTE. We

are now ready to present the classification Model.

Figure 3.2: Preprocessing workflow.

24

3.2.1 Data Cleaning

The cleaning process consists of a series of methods that alter or delete instances that are not

suitable to the data mining problem addressed. Some attributes were also deleted as they contained

discriminatory information about the class of the patients. Every step of this process is explained next.

Normal, Pre MCI and MCI classes

The CCC database included 161 observations where patients had complaints, but those com-

plaints reflected a level of cognitive impairment that was considered normal for the patient’s demo-

graphics, and therefore these patients were considered normal. The first step of data cleaning was to

delete these records, since they do not represent a stage of MCI to AD conversion/ evolution.

Second, we addressed records in which the patient was diagnosed with pre-MCI. These are pa-

tients that, although they present a slight cognitive impairment, that is, cognitive abilities below what is

considered normal for their age and education level, do not fulfill the MCI criteria described in Section

3.1. This pre-MCI diagnose represents a very early form of MCI and its existence is not yet consen-

sual across the literature, and so were altered to MCI. There were 116 pre-MCI observations in the

database. On the scope of this study, these observations were treated as if they were MCI.

Clinical Assessments and Missing Values

Missing Values are common to every real-data database. When the value for a certain attribute is

missing in a record it indicates that the test was not carried out on that occasion /appointment. There

may be several reasons for this: the MD did not consider the test necessary, because the patient

proved capable (or proved incapable) on a similar test; the patient may have been too tired; or simply

because lack of time or resources to carry out the test. Therefore, missing values can have different

meanings in different cases, but without further information it is not possible to distinguish between

those meanings. Moreover, there are tests that depend on one another, so that if one is not carried

out, the other can not be carried out either. There is a set of attributes - the Blessed Scale attributes

that depends on answers given by the patient’s caregiver. If a patient was not accompanied by their

caregiver to the medical appointment, the Blessed scale attributes had to marked as missing. The

missing value histogram of the original dataset is shown in Figure 3.3. The majority of instances has

less than 50% of missing values and instances with more than 90% are clinical observations. If these

clinical observations are on the first patient’s observations, they have to deleted as it is impossible to

determine the progression of an attribute if the starting point is unknown.

25

Figure 3.3: Histogram of number of Missing Values per Instance in the Original Database.

Follow-up Observations

As mentioned, to study the clinical history of a patient it is necessary to have temporal information,

that is, more than one observation for the same patient. The creation of learning examples makes

it possible to assess the evolution of the tests’ results of the same patient from one observation to

another. A minimum of 3 observations of the same patient are necessary. The first two observations

(as they are from two different time points) make us capable of studding the temporal progression.

The third observation sets the class of the learning example. The creation of learning examples is

described in detail in Section 3.2.2. Therefore, we have deleted all patient that have less than 3

observations. After this cleaning task, the number of patients in the database was 179, with a total

of 672 instances, showing a great decrease in the number of patients and instances compared to the

original dataset as can be seen in Table 3.2. The number of instances is now much smaller regarding

the initial number and not all patients will become into Learning Examples. The statistical description

of the database for patients observations and histogram of number of observations after data cleaning

process is shown in Table 3.3 and Figure 3.4.

Original Database Cleaned Database

Number of Instances 1604 672

Number of Patients 616 179

Number of Attributes 102 98

Table 3.2: Number of Instances, Patients and Attributes from the original dataset and at the end of the datacleaning processes.

Class Observations Age (Mean±SD) Gender(F/M) Years of Schooling (Mean±SD)

MCI 179 68.75(± 7.81) 79/99 9.4 (±4.9)

Table 3.3: Demographic Information of the preprocessed database for the first observation of the patient.

26

Figure 3.4: Histogram of number of observations of the total number of patients after the data cleaning process.

3.2.2 Learning Examples

The creation of learning examples is a necessary and important step of the preprocessing, trans-

forming the information in the database in useful information to be used in the mining step.

A learning example contains useful information of a patient state and temporal progression within

a time window. Classification models are built in accordance to these learning examples. Without

the creation of learning examples the classifier only would be able to categorize instances into their

observation classes (MCI and AD). As the diagnosis problem is not the scope of this work, different

datasets had to be created. Learning examples make the classifier able to build a prognosis model

as first intended.

In previous works [60,61] only two instances were necessary to determine the new class of the

patient. For each patient, the first observation is the baseline and sets the attributes values and the

other observation assesses the progression of the patient.

In order to study the temporal evolution of a patient towards AD, we have to take a look at more

than one observation. Where one observation is the assessment of a certain patient in a certain time

instant and stage of the disease. Comparing two observations it is possible to see how the disease

has progressed from one instant to another.

To create a learning example, the attributes must have the information of the neuropsychological

tests for the same patient in two different time points. The class is defined by a third observation

that vouches for the stable or converter state of the patient: stable Mild Cognitive Impairment (sMCI),

if the patient’s state remained stable and did not progressed to AD and converter Mild Cognitive

Impairment (cMCI) if the patient’s state progressed to AD during a given period of time. Thus, the

need to clean out patients that have less than 3 observations mentioned previously that led to a

decrease in the number of instances. We started with this minimum of two time points for two mains

reasons: having a uniform number of observations throughout the various patients and minimizing

the cutback in the number of instances.

The first time point chosen is the first observation of the patient. The second one is the non-

27

clinical observation closer to the end of the defined time window. As clinical observations have all the

neuropsycological attributes as missing values, it is impossible to see its progression.

Time Windows

In a clinical context it is important to define a time frame in the progression of a disease. Knowing

that a patient will end up progressing to an AD state is less useful than knowing the time frame for that

progression. A patient is classified according to the time window, depending whether the conversion

point happens inside or outside the time window. Therefore, a patient that does not convert to AD

in a 3-year window can convert in a 4 or 5-year window and the number of cMCI tends to increase

with the window. Patients that make it as learning examples in one time window may not make it in

another. However, a patient that can resolve into a learning instance in a time window will be present

in all datasets at window, independently of the temporal feature created.

To consider a patient stable, we have to define a time period during which the conversion to AD

does not occur. A sMCI learning example consists of two MCI observations within the time window

defined and a third observation outside this time window, that vouches for the stable state of the

patient, as shown in Figure 3.5. If the last MCI observation is inside the time window, there is nothing

ensuring that the patient is stable within the time window, as shown in Figure 3.6.

Figure 3.5: Stable MCI Learning Example in a 4-year time window.

28

Figure 3.6: Patient that does not become a learning example in a 5-year time window.

On the other hand, to state that a learning example is cMCI as outlined in Figure 3.7, the conver-

sion must happen inside the time window. So we need two MCI observations as well as a third one

diagnosed AD inside this window, assuring that the conversion occurred in the time period consid-

ered. The moment of conversion to AD is consider to be the first observation diagnosed as so, due

to the fact that is impossible to know the exact moment of conversion. Thus, if a patient has the first

AD observation outside of the time window, one may argue that it is possible that the patient was AD

inside the window, but we should not infer so. This leaves us with patients as outlined in Figure 3.8

not make it as a learning example.

Figure 3.7: Converter MCI Learning Example in a 5-year time window.

29

Figure 3.8: Patient that does not become a learning example in a 4-year time window.

At the end of the creation of learning examples process, a patient is resolved into an instance.

However, not every patient fulfills these criteria and many do not resolve into learning examples. This

translates to another reduction in the total number of instances that composes the training and test

sets of the classifiers as can be seen in Table 3.4.

Time Windows Number of Patients

3-year window 78

4-year window 69

5-year window 67

Table 3.4: Number of Patients after the creation of Learning Examples for the different time Window.

In some cases, datasets were constructed with more temporal information than only two obser-

vations. In these situations, the only difference when creating learning examples is all non-clinical

observations inside the time window were used as useful information. These cases are explained in

detail in Section 3.2.3.

3.2.3 Temporal Features

As mentioned before, within the scope of this work we consider a patient as a group of medical

observations of one individual in different time points. The creation of learning examples discussed

in Subsection 3.2.2 transforms into a patient as a single instance instead of multiple observations

as in the original data. This transformation implies the use of temporal information as a patient

is not only two observations. Since these learning examples are fed to a non temporal classifier

as is Naıve Bayes that considers all attributes to be independent. In order to preserve as much

temporal information as possible we proceeded to a temporal processing stage creating attributes

30

that held values relating two different time points. These temporal features are described in Figures

3.9, 3.10, 3.11 and 3.12. In these four figures, the first three gray lines represent different and ordered

observations (t1, t2 and t3) of the same patient that will resolve into a learning example with cMCI

class, as illustrated. The last gray lines show how the temporal relation was computed. As every

attribute has a corresponding temporal feature, the number of features grows.

Missing values had a special treatment in the creation of temporal features. Having a value for a

certain attribute in a time point t1, does not imply the presence of a values in the same attribute in

time point t2. This could mean that attributes that have the relation between two time points could

eventually have a much greater percentage of missing values that the two attributes that gave it origin

to. Consequently, these attributes would never be survive Feature Selection and we would not be

able to study the temporal relation between time points. To prevent this situation, we admitted a priori

that a patient state does not improve from one observation to another and at least stayed the same,

when information on the database did not proved otherwise. Thus, when relating the values of two

time points, if one is a missing value, it is considered the same, being the missing value in t1 or t2.

For every relation tested it was created a new dataset and they were tested in several ways, every

one specified next.

3.2.3.A Datasets without temporal features

The first dataset created did not contain temporal features. We start by simply adding the infor-

mation of two different time points as unrelated information. Here, for every test attribute X, it was

created two attributes: Xt1 and Xt2. These two attributes are considered independent from one

another as they just contributed with more information to the learning example.

The aim of this step was to see if two observations had more useful information than just one.

Figure 3.9: Temporal Processing without temporal features.

3.2.3.B Progression Feature Datasets

Next, we considered what we refer as a Progression Features. These progression features were

simple arithmetic operations between each timepoint as all attributes are numeric. This attribute is

normalized with the time (in days) between the two time points. Datasets with these new features

were tested in several ways. We created datasets that had the discrete values of the progression

rate, the progression rate of all non-clinical observations of the patient, and datasets only with the

values of the progression rate and not with the original scores of the neuropsychological tests.

31

Figure 3.10: Temporal Processing with Progression Rate.

3.2.3.C Temporal Pattern Datasets

We related two attributes defining a simple temporal pattern that specified if an attributes increased

(code letter U for Up), decreased (code letter D for Down) or stayed the same (code letter S for Stable).

This creates a nominal attribute for every two numeric ones. The Temporal Pattern was done with and

without Feature Selection, considering all time points. Considering all time points the nominal attribute

constituted a string of patterns that defined the overall pattern of the attribute for every non-clinical

observation. For example, the string ”SSUUU” indicates that a certain attribute was in 6 non-clinical

observations of the patient and that the attribute had the same values in the first three observations

and then increased ever since.

As Naıve Bayes treats numeric attributes as nominal ones, we created datasets that only contained

information of the temporal pattern, deleting the original values of the neuropsycological tests.

Figure 3.11: Temporal Processing with Temporal Pattern.

3.2.3.D Statistics-Based Summarization Datasets

The last temporal feature created was the mean of all the values of an attribute for a patient. The

mean was calculated using three different formulas: Arithmetic mean as in Eq. (3.1), Geometric mean

as in Eq. (3.2) and Harmonic mean as in Eq. (3.3) as well as Variance as in Eq.(3.4).

xa =1

n

n∑i=1

xi, (3.1)

xg =

( n∏i=1

xi

) 1n

, (3.2)

32

xh =n∑n

i=11xi

, (3.3)

and

σ2 =1

n− 1

n∑i=1

(xi − µ)2. (3.4)

Summarization datasets use global characteristics of a time series to describe it in a more concise

way. This information can be added to the original dataset or replace its values. Replacing the original

dataset values contributes to the dimensionality reduction task, quite useful in datasets with large

time series. The range of values all means provide do not contrast with the values from the original

attribute values, as means values are among the minimum and maximum values of the sequence.

Hence, adding up the mean values to the original database or replacing the original values for the

mean ones do not alter the range values of the attribute. On the other hand, variance gives us the

deviation of a sequence to its mean value and it also known as the second moment [45]. Small similar

sequences tend to have small variance values.

The mode of a sequence is the most frequent value of a sequence. This statistic can be useful

to ignore outlier values and small deviations from a standard test value of a patient. However, these

deviations are the first signal of a change in the cognitive state of patients and using the mode would

end up deleting this values. Moreover, the majority of temporal sequences in all year windows have

only two observations and the mode would not help in these cases.

The median ends up not being useful as well, considering all datasets used throughout this thesis.

If a sequence has an odd set of values is the middle, the median shows the middle value of a ordered

sequence whereas if the sequence has an even set of values the median is the average of the two

middle values. Has said before, the majority of temporal sequences in all year windows have only two

observations and the median values would end up having the same value as the arithmetic mean.

For these reasons, mode and median datasets were not built.

Mean and variance datasets created using means where done using every non-clinical obser-

vation of the patient as it can be seen in Figure 3.12, with A meaninh the arithmetic mean, G the

geometric, H the harmonic as VAR the variance .

Figure 3.12: Temporal Processing with different means and variance.

33

3.2.4 Feature Selection

Having a large number of features and a small number of instances may not be useful at all. As

the number of instances was cut down in the cleaning process and the number of features increased,

the algorithms used in the data mining tasks can suffer from high variance. Feature Selection (FS)

is very helpful in these cases. Feature Selection redesigns the dataset by choosing a subset of

attributes that are relevant when building a classifier. Fewer features also means that the algorithm

can run faster and the resulting model is more easy to interpret. Feature Selection picks out a reduced

number of features that have a low correlation with each other and a high correlation with the dataset’s

class. In this work we used Feature Selection from WEKA with a Correlation-based Feature Subset

Evaluator that assess the worth of a subset of attributes by considering the individual predictive ability

of each feature along with the degree of redundancy between them, alongside a Best First Search

that searches the space of attribute subsets by greedy augmented with a backtracking facility.

Feature Selection depends on the dataset and this technique was applied after the creation of

learning examples and temporal features. The way we define the temporal relation between attributes

build different datasets that will have different features selected.

The number of features as well as the feature set itself were stored every time Feature Selection

was applied, to be possible to evaluate and compare features selected across datasets and to have

further information about the datasets’ results.

3.2.5 Synthetic Minority Over-sampling Technique

As shown before, different patients resolve into different learning examples across the different

time windows. Thus, a patient that would not resolve into a learning example in a 4-year window can

do so in a 5-year window. Therefore, the percentage of instances that belong to one class or another

differs from time window to time window. In fact the number of cMCI tends to increase as the window

increases, since when we we start to consider larger periods of time between observations it is more

likely that the patients had already converted to AD. The class proportion in the three windows studied

is represented in Figure 3.13, where we observe a significant class imbalance.

34

Figure 3.13: Class Imbalance on 3Y, 4Y and 5Y time windows datasets before applying SMOTE.

To overcame the current class imbalance, focusing on the 3 and 4-year windows, Synthetic Mi-

nority Over-Sampling Technique (SMOTE) was applied from WEKA environment. The percentage of

minority class instances created is defined by getting a number that is close to invert the tendency of

the class asymmetry. For the 3-year window the class equality was found at 350% and for the 4-year

window was at 125%. No SMOTE was applied to the dataset of the 5-year window. SMOTE uses

a k-nearest neighbors method, parameterized with k = 5, to create synthetic instances. Using this

values we were able to obtain more balanced datasets, as depicted in Figure 3.14. The instances

created artificially are not accounted for in the results section.

Figure 3.14: Class Imbalance on 3Y, 4Y and 5Y time windows datasets after applying SMOTE.

35

Other techniques to deal with class imbalance were considered. However, they focused on under-

sampling the majority class and due to the small number of instances, the dataset would get too small

to be considered and therefore, these techniques were ruled out. Moreover, SMOTE already proved

to be a good technique when dealing with similar datasets [60, 61].

3.3 Data Preprocessing: Relational Datasets

Hidden Markov Models classifier natively supports temporal information. For this reason a different

proprocessing workflow was designed, illustrated in Figure 3.15. Despite the different proprocessing

workflow, HMM datasets were constructed under the same train of thought that independent datasets,

as is still need learning examples comprising patients with temporal information, from at least two

observations.

While the Naive Bayes classifier deals internally with missing values, the HMM algorithm imple-

mented at WEKA did not support missing data, hence all missing values were replaced by the attribute

mean of the training data. Feature Selection for Multivariate time series comprises very different tech-

niques considering Feature selection for independent datasets. Feature selection strategies cannot

be applied to temporal data due to the temporal constrains and relation between features. There

are few methods able to deal with these temporal constraints, as one approach commonly used is to

transform each multivariate time series into a row, and then apply methods that ignore the temporal

relation. CLeVer uses Principal Components Analysis to retain the correlation information among re-

lational features [62]. In the HMM preprocessing workflow, none of these methods could be applied

as WEKA package did not included them. Feature selection was applied when the dataset did not

have relation attributes and as a results, feature selection attributes ended up equal to the features

chosen with the independent dataset. Furthermore, WEKA was not able to support our datasets com-

prising numeric values so all attributes were altered to nominal ones as the discretization filter did not

succeed to built an usable dataset. Normalization of the dataset was also necessary to be able to

acquire a valid dataset. Finally, the relation datset was created resorting to an unsupervised attribute

filter that converts a propositional dataset into a multi-instance dataset (dataset comprising relational

attributes).

36

Figure 3.15: Preprocessing workflow for the HMM datasets.

3.4 Classification Methodology

The independent classifier used on datasets with temporal features was the Naıve Bayes. First, it

is the classifier with better results on this data, as shown by previous works by our group [60,61] and

second as it is a Bayesian Classifier it can be compared to the Hidden Markov Models Classifier as it

based in Bayesian networks as well.

The Naıve Bayes deals with missing values internally, by ignoring them from the classification

process. WEKA allows Naıve Bayes to be parameterized with three ways:

• Use Gaussian estimator for numeric attributes;

• Use a kernel estimator for numeric attributes;

• Use supervised discretization to convert numeric attributes to nominal ones.

For every dataset the three parametrizations were tested and grid search was used to find out the

best one for every model. To discern between several numbers of models, a grid search classification

method was used in this thesis. A grid search classification consists on comparing several models

built on different parameters for the classifier, filters and other algorithms that may be applied. While

several Model evaluation metrics were calculated, in order to do this comparing, a common measure

should be chosen and examined in all model’s results. In this thesis, the area under the receiver

37

operating characteristic curve (AUC) was the metric chosen to rank the different models in the grid

search.

As we have a very small dataset in every time window, we choose not two divide the dataset into

training and test set, as these two sets would be very small, compromising the results. Instead, only

Cross Validation results were obtained, using 5-fold Cross Validation in the whole dataset.

Hence, classification models present information for the best Temporal Relation used, the best

parameters for the Naıve Bayes and the time window.

3.4.1 Model Evaluation Metrics

The main task of a classifier is to correctly define a known class to an unseen instance. Cross-

validation process classifies a different set of unseen instances on every validation. Therefore, given

a test set we can know if it was:

• Correctly classified belonging to the positive class or True Positive (TP);

• Correctly classified belonging to the negative class or True Negative (TN);

• Incorrectly classified belonging to the positive class or False Positive (FP);

• Incorrectly classified belonging to the positive class or False Negative (FN).

Throughout this thesis, the class considered as Positive in all datasets is the cMCI class while

the sMCI class is considered the Negative. Hence, if a model missclassifies a instance as positive it

means a patient belongs to the sMCI yet the classifier considered as cMCI.

Several evaluation metrics can derive from these four measures. The Confusion Matrix compiles

this information and is the mainly outcome of every Cross Validation result.

Accuracy can be defined as the fraction of the the total number of correctly classified instances

Accuracy =TP + TN

TP + FN + FP + TN. (3.5)

This metric, although very useful and commonly used is biased of the class imbalance of the dataset.

For example, in dataset computed according to the 3-year time window, if a zero rule classifier was

applied (classifying every instance as belonging to majority class, the accuracy value would not be

that bad, implying the classifier builds a good model for the prognosis. This is wrong because non of

the minority examples were taken in consideration and identified.

Sensitivity (or Recall)(3.6) is the fraction of the number of instances correctly classified as positive

as

Sensitivity =TP

TP + FN, (3.6)

meaning the capability of identifying true progression cases in all instances classified as such. Trans-

lating this to the medical context, of all the patients that actually progressed to AD, what fraction did

we correctly detect as progressing?

38

On the other hand, specificity defined as

Specificity =TN

TN + FP, (3.7)

does the exact same fraction but for the negative class. It is defied as the fraction of the number of

instances correctly classified as negative in all that actually were.

Precision is defined as

Precision =TP

TP + FP(3.8)

and tell us of all patients we predicted where converting, what fraction actually. It is defined by the

faction of True Positive and the sum of True Positives and False Positives.

If we consider that a patient progress to AD when the outcome of the classifier gives us more that

90% of confidence measurement to screen every progression case, we have a confident prediction of

two classes, and doing this, we will have higher precision but lower recall.

In the other hand, if we want to predict progress to AD with a lower confidence measurement we

will get a very safe prediction. This will cause higher recall but lower precision.

Both cases are attractive and values for precision and sensitivity can bee balance in order to have

a model that has a good behavior in every case. To do so, we can calculate the F-measure that give

us the harmonic mean of these two metrics.

The Last metric considered is the Area Under the receiver operating characteristic (ROC) curve,

or AUC. The ROC curve plots the performance of a binary classifier, with sensitivity as a function of 1-

specificity. Thus, every point in this curve corresponds to a the classification (sensitivity,1-specificity)

for different classification thresholds. A classification threshold is the cut off probability of an instance

belonging to a certain class. This classification threshold can range from 0 to 1. Considering the

ROC plot, the (0,0) point means all instances were classified as negative and the (1,1) point means

all instances were labeled as positive. Hence, the ROC curve will be connecting these two points.

The straight line between this two points represents a random guess classifier. As the performance of

the classifier gets better, the ROC curve approaches (1,0), meaning all instances were truly labeled

(sensitivity =1, specificity =1, 1-specificity=0). To quantify this performance, the area under the ROC

curve (AUC) is obtained integrating this curve. A curve closer to (1,0) as a greater area, thus a greater

AUC value that the random guess classifier line.

39

40

4Results and Discussion

Contents4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Independent Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 Hidden Markov Models: Preliminary Results . . . . . . . . . . . . . . . . . . . . . 72

41

4.1 Feature Selection

Feature selection is an important tool to be used in real world data as ends up cleaning the dataset

from useless information. Having a small set of instances combined with a large set of features make

us more prone to an overfitting problem (or high variance) of the model to the dataset. Therefore, as

the increasing the number of instances is a time expensive task, reducing the number of features is a

crucial part of the solution to the overfitting problem.

Feature Selection was applied in almost every dataset, using a correlation based evaluator and

a best first search, as explained before in Section 3.2.4. By analyzing the frequency in which fea-

tures were chosen along different datasets, allowed us to discern the attributes that contribute with

more useful information to the classification model. This complementary analysis aimed to perceive

attributes that are seldom used and the ones most recurrent.

The different temporal relations created make rise to new attributes and consequently strengthen-

ing the overfitting problem since, adding more information is always the same as adding more useful

information. Hence, we attempted to identify not only the attributes with more relevant information by

identifying the ones that are chosen with more frequency but also go further and find if these temporal

attributes have relevant information despite the presence of the same attribute in the two time points,

t1 and t2.

For this analysis a simple count of how many times an attribute was selected to contribute to a

dataset was made. This count can go from zero to three in datasets using only two time points. For

example, if a dataset was made by having two time points of an attribute and the relation between

them and the attribute was chosen each and every time, it has a count of three, for Attt1, Attt2 and

Attt1t2. If it was not chosen, it counts as zero. Then a simple sum of the count of each attribute in

every dataset was made and turned to percentage to have a clear look on this numbers. Figures 4.1

and 4.2 illustrate a global count for every dataset and all three time windows. These figures make very

clear to see that there are attributes more relevant than others as their count is higher than others.

42

Figure 4.1: Percentage of times attributes appeared in the classification model.

43

Figure 4.2: Percentage of times Z-scores attributes appeared in the classification model.

LM-A-INTREFER (63%), OR-TOTAL (52%), Orientation-Z (44%), MPR-Z (10%), MVI-FREE and

Spacing (32%), Cancellation-Z(20%), PROVERBS(28%) and ORIENT-T(27%) were the most chosen

attributes considering all datasets. The Logic Memory Interference test (LM-A-INTREFER) consists in

44

telling the patient a logical series of events and, after some other tests or interference with the patient,

asking the patient to recall the same story. This attribute was the only to be chosen in every dataset

ate least one time. ORIENT-T, ORIENT-S and ORIENT-P are measurements of patients ability to

locate themselves in time, space and with personal information, respectively, and OR-TOTAL the sum

of all this components, yet can be missing if the the three orientation scores are not and vice versa.

Orientation-Z is the associated Z-score of the OR-TOTAL value. MRP (Raven’s Progressive Matrices)

are intelligence tests where images patterns are presented and the patients is asked to complete

the missing image according to this pattern. MVI-FREE consists in test for the verbal memory with

interference, where a group of words are said to patients and after some interference the patients is

asked to recall those words. Spacing is an attribute created in the work as the time (in days) between

the two observations used in the learning example. Cancellation-Z is the Z-score associated with the

As-cut, As-time and As-tot attributes, that consists in striking all letter ”A” form a word search like

group of letters. PROVERBS is another frequently chosen attribute consisting in a simple test where

the patients are asked to complete a popular saying.

Further observation of this counts shows us the presence of ORIENT-T despite the presence of

OR-TOTAL as highly chosen attributes. These two attributes are related as they derive from the same

tests, even if OR-TOTAL has more information than ORIENT-T. ORIENT-S and specially ORIENT-P

were poorly chosen, ORIENT-P only being chosen in 7% of the datasets create, meaning they do not

have a high correlation with the class attribute and/or they have a high correlation with each other.

Either way, they do not contribute with relevant information for the classification model.

Comparing the features chosen with the correspondent Z-scores it is possible to see there is not a

direct correspondence. Features with a high frequency can have a high frequency also in the Z-score

form, i.e. the Orientation scores and Proverbs. The opposite behavior is also present where a poorly

chosen attribute is highly chosen in its Z-score form, i.e MRP and the Cancellation Task ( Z-score for

the As-cut, As-time and As-tot attributes). These can be due to the fact that adding the first the Z-

score attribute makes the original value obsolete and vice versa for the attributes like MVI-FREE that

are highly chosen in the original form and poorly chosen in its Z-score form (WordRecallZ). These

facts show that while there is a obvious relation between an attribute value as its Z-score attribute

value, both can contribute with new and relevant information to the model.

Even though Figures 4.1 and 4.2 do not show causality between a raw attribute being chosen and

its z-score value being chosen, there is a causal relation between the number of missing values and

the frequency it is chosen. The missing value percentage is shown in Figures 4.3 and 4.4.

45

Figure 4.3: Missing values percentage for every attribute.

46

Figure 4.4: Missing values percentage for every Z-score attribute.

Attributes with high percentage of missing values do not comprise a great deal of information,

hence being more difficult for that information to be relevant and specially to have high correlation

with the class. These can be noticed the last 26 attributes on Figure 4.3 as examples. Nevertheless,

47

the opposite reasoning is not true. Having a low number of missing values does not mean the attribute

has relevant information. The demographic attributes were rarely chosen and they have no of missing

value.

With the feature selection information presented in Figures 4.1, 4.2, 4.3 and 4.4 it is impossible to

know if a feature was chosen mainly with its original value t1, with the value of its second time point

t2 or with the temporal feature t1,2. However, even if a feature has a high count it means it has useful

information, whether it was chosen from t1, t2 or t1,2.

HMM feature Selection results are not present due to the preprocessing workflow for this algorithm.

In this workflow, Feature Selection techniques were conducted yet they do not have the slight relation

with the HMM algorithm. The database where feature selection was applied was the original database,

without the creation of learning examples and temporal features.

4.2 Independent Datasets

In this section the main results obtained with temporal features datasets are presented. This

includes confusion matrices, accuracy, weighted AUC (AUC), sensitivity and specificity metrics for the

Cross-Validation results. As Cross Validation was carried out using ten times five folds, the results

comprise the mean of those ten seeds. Complete tables are shown in Appendix A, comprising not

only the values for accuracy, AUC, sensitivity and specificity but also for F-Measure and precision

along with values for the standard deviation for all these metrics. In every dataset and evaluation

metric, the positive class is considered the cMCI class and the negative class the sMCI one. This

means that sensitivity is calculated for the cMCI class and specificity for the sMCI, as descried in

Subsection 3.4.1.

As mentioned before in Subsections 3.2.1 3.2.2 three different datasets were created with different

patients categorized into two classes describing their progression (sMCI and cMCI). Every dataset for

the 3-year time window (3Y) has the same patients, despite comprising different temporal features.

The same happens for the 4 and 5-year time window (4Y and 5Y, respectively) datasets. The class

distribution is shown in Table 4.3.

Number of Learning Examples

sMCI cMCI

3Y 64 14

4Y 47 22

5Y 35 32

Table 4.1: Class distribution of the three time window datasets.

It is important to notice the low number of instances in every dataset. This low number can

induce different interpretation problems. First, a small change in the confusion matrix of a dataset

will translate in a considerable change in the evaluation metrics presented. For example, in the 3-

year window, a change of a misclassified instance of the positive class will translate into a change of

48

almost 1% in the sensitivity value. This means that with only one instance classified differently from

the baseline values we will observe a substantial change in the evaluation metrics.

4.2.1 Dataset without temporal features

The first step taken in order to perceive if information from more than one observation was in

fact useful was creating learning examples with two observation values of two different timepoints

for each test: the value for the baseline and the value for a second observation close to the end of

the time window. These learning examples were created according to the methodology described in

Section 3.2.3.A, were no temporal feature is created, yet new information is added to the traditional

learning example, being temporal information from a second timepoint. These two sets of values are

considered independent from the classifier perspective as they are not said to be related in any way.

The resulting confusion matrices are presented in Figure 4.5, cross validation results can be found

in Table 4.2 and Figures 4.6 and 4.7 show the cross validation results of all ten seeds.

Figure 4.5: Confusion Matrices for the cross validation results for datasets without temporal features.

Table 4.2: Cross Validation results for datasets without temporal features.

49

Figure 4.6: Cross Validation results for the Accuracy and AUC for datasets without temporal features.

Figure 4.7: Cross Validation results for the Specificity and Sensitivity for datasets without temporal features.

Table 4.2 presents values for Accuracy, AUC, sensitivity and specificity for every dataset as a

mean of all ten cross validations tasks carried out for every dataset. The first dataset labeled t1t2

contains information for the two timepoints. Contrasting, the dataset labeled as t1 contains only

the first timepoint and the dataset labeled as t2 contains information only for the second timepoint

considered in t1t2. The t1 dataset was built in the same way as previous works done by Lemos [60]

and Ferreira [61], although applied only contained patients with more than one follow up observation,

reducing the number of instances. The results are slightly different as the dataset is slightly different.

Therefore, t1 is considered the baseline to what we compare t2 and t1t2 values, as these two are the

new approaches. White table cells are the baseline values, green table cells are values better than

the baseline and red table cell are worst ones.

Exploring results presented in Table 4.2, we can see better values in almost every time window

and evaluation metric for t2 and t1t2 than t1. Specificity values for the t2 dataset are worse than the

baselines for every time window. Values within the t2 dataset are values very close to the end of the

time window, and consequently values very close to the conversion to AD timepoint.

The t2 dataset was created to validate that the t1t2 dataset did not have better values just because

the classifier gain with the addition of t2. By comparing t1 with t2 it is possible to see that only t2

gives better results, probably because it is an observation closer to the observation that defines the

progression state. However, the t2 model does not outperform the t1t2 model. It is interesting to see

50

that t2 and t1t2 have the same evaluation metric values in the 3-year window, yet, when the time

window starts to increase the t2 values start to drop while t1t2 actually increase.

However, looking only at this observation may construct a good model (better than the t1 model)

but it is a useful one. The last observation within the time window is very close to the real conversion

to AD (in the cases this happens) and predicting conversion to AD at this point will not do much for

patients and their family. Moreover, t1t2 result are still better that only using t2, which show us that

is not about how far on the disease progression the observation is chosen but these 2 observations

combined.

This is the start point for the sub sequential datasets as they all combined information from exactly

those two timepoints in various ways, aiming to explore ways to combine the information from the two

timepoints in order to obtained better models.

1 Patient : n Learning Examples

Before exploring ways to combine two timepoints, we tried to tackle the lower number of instances

comprising all datasets. Hence, a different approach was also tried when creating learning examples.

Let’s for now consider that a patient has more than one non-clinical observation within the time win-

dow. It is possible to create more than one learning example from the patient if we combined every

possible pair of observations. This process is similar to the previous one: the creation of time windows

is the same as well as the patients that comprise each dataset. The creation of learning examples is

done in the same way, yet, the number of instances that derive from one patient can be more than

one, depending on the number of observations inside the time window. Exemplifying, if we consider a

sMCI patient with three observations within the time window, three learning examples will derive from

this patient: t1t2, t1t3 and t2t3, whereas in the previous dataset only the first (t1t2) would be present.

Consequently, the number of instances within every dataset increases as shown in Table 4.3.

Time Windows 1 Patient : 1 Learning Example 1 Patient : n Learning Examples

3-year window 78 193

4-year window 69 166

5-year window 67 170

Table 4.3: Number of instances for every time window in datasets with only one learning example per patientand with more than one learning example per patient.

1 Patient : 1 Learning Example 1 Patient : n Learning Examples

sMCI cMCI sMCI cMCI

3Y 64 14 142 51

4Y 47 22 111 55

5Y 35 32 95 75

Table 4.4: Number of instances for every time window in datasets with only one learning example per patientand with more than one learning example per patient.

51

However, the 5-fold cross validation was conducted in a slightly different way. Learning examples

created from the same patient had to be separated, preventing their appearance in a training fold and

a test fold simultaneously. If the classifier is trained with the information of a patient it cannot be used

as a test as well. Cross Validation folds are created as subsets of the dataset presented, using a

random seed to make ever fold different, thus giving statistical significance to all results obtained, as

there were obtained in different random sets. Nevertheless, this folds try to be as similar as possible

to the whole dataset, specially maintaining the class proportion and the same number of instances

between folds. These constraints made impossible to obtain results statistically relevant as the folds

created contained the same instances, even using a great number of random seeds.

Considering another patient, this time a cMCI patient, with three observations within the time

window. The class of the learning examples that derive from this patient are: sMCI for the t1t2,

sMCI for the t1t3 and cMCI for the t2t3, as the original CCC dataset only in rare cases has more

than one AD observation for the same patient. This results in an extreme increase in sMCI learning

examples and low increase in cMCI learning examples. This ends up contributing immensely for the

class imbalance in every window, particularly because datasets were already imbalanced towards the

sMCI class as we can see in Table 4.4. Class imbalance was present in the datasets containing one

learning example for each patient. However, to consider the new constraint where learning examples

from the same patient cannot be in different folds, different java code was used instead of resorting

to the WEKA’s Cross Validation. As the SMOTE filter is applied simultaneously as the classification

task to prevent the classification of synthetic instances, the cross-validation folds were created on an

imbalance set. Due to these two facts, the 5 folds for each time window ended up the same, producing

results lacking statistical significance thus ending this analysis.

4.2.2 Progression Feature Datasets

Progression Feature are attributes calculated from two timepoints within the time window con-

sidered for that database. They are numeric attribute that can converted into nominal one using

supervised or unsupervised discretization techniques. These attributes, compress information from

the two time points as they relate the two values in ways considered as

PF+ =t2 + t1Spacing

PF− =t2 − t1Spacing

PF× =t2 × t1Spacing

PF÷ =t2 ÷ t1Spacing

(4.1)

where t1 is the first timepoint considered and t2 the second one. The Spacing attribute is the time,

in days, between the t2 and t1. Dividing the timepoints’ relation by the Spacing can be seen as

nothing more than a normalization of the values, as no patient has the same time interval between

observations.

Considering the first metric, PR+ it gives us a simple summation of the two timepoints normalized

in terms of time. PR− gives us the difference of the progression of the two timepoints. Being divided

by the spacing between them, we end up having the slope of the line those two points create. PR×

gives us the product of the two time points, which end up making slight differences in values much

more evident and this may resolve into a good predictive power. As here we are only dealing with

52

two timepoints, the PR÷ can give us the growth ratio of the second timepoint by the first one. These

relation attributes were added to the original dataset containing the values of the two timepoints. An

Unsupervised discretization method was tried out and other dataset was fed to the classifier without

being applied the feature selection step.

When considering all timepoints within a time window two other relations were considered: the

sum of all attribute values and the sum of all slopes. These attributes were added to the dataset

containing only the first timepoint considered.

Progression Feature with original values

The first four progression feature datasets delivered to the classifier contained t1 and t2 attributes,

combined with one temporal feature. Confusion Matrices are presented in Figure 4.8. Table 4.5 shows

the mean value of all ten seed of the cross validation results and the distribution of these same values

are presented in Figure 4.9.

Figure 4.8: Confusion Matrices for the cross validation results for datasets without temporal features.

53

Table 4.5: Cross Validation results for the datasets with progression features.

Figure 4.9: Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets without tem-poral features.

Accuracy and AUC results for all progression features show us the overall results of progression

Features attributes are better than the values for the baseline datasets. Progression Feature models

surpasses the baseline model for the 3-year window and the 4-year window. The 5-year window

progression feature model does not outperform the baseline model yet it is not significantly worst.

Moreover, if instead of comparing the progression feature models with the baseline yielding two time

points, we considered the results for just one time point, for the 5-year window, the progression

feature model outputs better results in almost every operation and evaluation metric (with exception

of accuracy and sensitivity values for PR+ and specificity values for PR−, PR÷).

Comparing the different progression feature models there is one that clearly surpasses the others.

PR× has basically every value better that the baseline (with exception of sensitivity values for the 4

54

and 5-year window) where other do not behave as well as this. However, PR× does not yield the

better value comparing with other in the same time window and for the same metric.

As mention before, PR× end up accentuating small differences, and considering values that are

very similar and this can be an advantage, especially in datasets comprising small time frames as

these differences are not accentuated. Hence, this progression feature can contribute to the dataset

with useful information and make rise to good classification models. This reasoning is validated with

good sensitivity values present for this model, standing out from the overall bad sensitivity results for

other models.

As the time window increases, so does the spacing between the two timepoints considered and the

progression feature that derived from this information loses predictive power. While a big accentuated

difference in two close timepoints might be predicting a fast progression, probably towards AD, the

same difference on a broader time frame loses importance and does not stand out. This fact makes

the negative instances less distinguishable and this is validated by the poor specificity values.

Progression Feature with Unsupervised Discretization

This dataset was constructed in a similar manner that the previous one, with a slightly difference,

an unsupervised discretization filter was applied to the dataset before the classification task. This

unsupervised attribute filter discretizes attribute values into bins, ignoring the class instances belong.

This preprocessing filter was applied to help the classification task of the Naive Bayes classifier has

this classifier operates on discrete values. The WEKA application enables different discretization

methods to be parametrized into the classifier if it deals with numeric values. In these datasets, both

original values and progression features were discretized. Confusion matrices are shown in Figure

4.10.

55

Figure 4.10: Confusion Matrices for the cross validation results for discretized datasets with progression features.

The evaluation metrics that derive from these values is shown in table 4.6 and the distribution of

these values are shown in Figure 4.11.

Table 4.6: Cross Validation results for discretized datasets with progression features.

56

Figure 4.11: Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for discretized datasetswith progression features.

Overall the results with discretized datasets are good, outperforming baseline values in almost

metric and progression feature. The 5-year window dataset comprising the PF+ progression fea-

ture has the highest values throughout this work, having a specificity value of 0.96%. Looking at the

confusion matrix for these dataset, we can see that of all negative instances only one was misclas-

sified. Moreover, all 5-year time window datasets have great specificity values, with a maximum of

three negative instances classified as positive. However, for these same datasets, sensitivity values

do not have a behavior as good as the specificity. These great values could indicate overfitting of

the classification model towards the positive class. However, the 5-year dataset is the only one nat-

urally balanced. Positive instances are worst classified in the 5-year window, comparing to all other

time window datasets. All these facts indicate that instances belonging to the sMCI class are well

discernible of the cMCI ones at a 5-year window, meaning, discrete values for the neuropsycholog-

ical test results and correspondent progression features of stable patients are strong indicates that

patients will not evolve to AD in a 5-year window. As these datasets comprise values for the first

and second timepoints as independent values, the discretization bins may be different for the same

attribute in different timepoints.

3-year window datasets build classification models that outperform baseline models with every

progression feature built. AUC values for these datasets are slightly worst that for 5-year window

datasets, implying the model is not as good as the one built with 5-year datasets. With a closer look

57

to the 3-year datasets, specially the classification of real positive classes, it is possible to see that

the values for TP (=10) and FN (=4) give rise to different percentage values of the same number

of positive instances. This happens due to the fact that these TP and FN values derive from mean

calculations of the ten cross validations seeds. While in the 3-year window the number of TP and

FN seem to be the same throughout different datasets, these numbers were not the same throughout

different seeds. This can happen all through the construction of confusion matrices.

Turning our attention to the 4-year window, it is possible to see that these datasets do not set up

better classification models than the ones built with baseline datasets. However, these models are

never worse than the models created from datasets with no temporal feature and only with the first

timepoint.

Comparing now the different progression features for the same time window, PF× ends up having,

once again, better overall values and PF− has the worst values. Looking at accuracy values for the

3-year time window, they indicate that the better value is the PF×. However, this metric is highly

sensitive to class imbalance and the 3-year dataset is very imbalanced towards the negative class.

Looking now to AUC values, the model that plots the best ROC curve is PF÷, for the 3-year window,

PF+ and PF÷ for the 4-year window and PF÷ for the 5-year window, comprising one of the highest

values for the AUC measure.

Overall, a discrete dataset produces models with better behaviors than continuous datasets, es-

pecially when using the Naive Bayes classifier.

Progression Feature without Feature Selection

As is not possible to control the features chosen by the feature selection algorithm, datasets with

progression features were built with the original set of features. To do this, datasets escaped the

preprocessing feature selection step, thus maintaining all features. These datasets have the 88 orig-

inal features plus the progression feature for each one, resulting in 176 features for 78, 69 and 67

instances in the 3, 4 and 5-year window datasets.

Confusion matrices are presented in Figure 4.12. The evaluation metrics that derive from these

values are shown in Table 4.7. Box and Whiskers plots are presented in Figure 4.13 for all ten seeds

of the 5 folds cross validation process.

The evaluation metrics presented in Table 4.7 show that the feature selection step is an important

part of the preprocessing workflow, as results are generally worse than baseline results. The table

shows few exceptions on accuracy values for the 3-year window datasets comprising PF+, PF− and

PF× progression features, all specificity values for the 3-year window datasets and sensibility values

for the 5-year window datasets.

58

Figure 4.12: Confusion Matrices for the cross validation results for datasets with progression features and withthe original set of features.

Table 4.7: Cross Validation results for the datasets with progression features and with the original set of features.

59

Figure 4.13: Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets withouttemporal features.with progression features and with the original set of features.

Comparing the confusion matrices for these classification models and the baseline ones, it is

possible to see that accuracy values are only better due to good classification of negative instances,

as positive ones are badly classified. The good performance of the classification of negative instances

leads to good specificity values for the 3-year window. As we mentioned, the 3-year window is highly

imbalanced and adding to the high variance of these datasets caused by the high number of features,

the model here built is not going to overcome the overfitting problem. With a high number of features

a much smaller number of instances, namely positive ones, the model cannot learn and create rules

to classify them. As the time window increases, the class imbalance gets smother and on the 5-year

window datasets are balanced. In these cases, the classification models built can adapt better to the

increasing number of positive instances and learn better. Although Sensitivity values seem better than

the baseline, looking at the respective confusion matrices, the number of TP and FN is only better for

the PF+, with only one instance better classified than the baseline values.

Progression Feature without the original values

As is not possible to control features selected throughout datasets, new datasets were created

comprising only values for the progression features, deleting all original attribute values of the neu-

ropsychological tests. Here, the progression features are the same as created on the first progression

feature datasets. Feature Selection is also applied, although progression feature won’t be competing

with original values but with each other. Confusion matrices results are present in Figure 4.14 and the

evaluations metrics that derive from these values are shown in Table 4.8 and the distribution of these

values are shown in Figure 4.15 .

60

Figure 4.14: Confusion Matrices for the cross validation results for datasets with progression features and withoutthe original values.

Table 4.8: Cross Validation results for the datasets with progression features and without the original values.

61

Figure 4.15: Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for with progressionfeatures and without the original values.

As Table 4.8 shows, the overall results for every dataset are worse than the baseline results. There

is one model that stands out form the poor performance, the PF× with a 3-year window, showing

accuracy, AU, specificity and sensitivity values better than the baseline, indicating that this model can

outperforms the one built without any temporal feature and just information from the two timepoints,

seen as independent features. The worst model here considered is the PF−, designed to calculate

the slope between the two timepoints considered. This model turns out to be the worst model for

every dataset, as it is not able to distinguish real negative instances, as it can be seen by the TP and

FN values of 32 in the confusion matrix for the 3-year dataset. Sensitivity results are better than the

baseline ones for almost every model, only being worst for PF−, PF× and PF÷ for the 5-year dataset.

All these contribute to the hypothesis that progression features do not have a good behavior to-

wards negative instances, as this is the main reason identified for the bad performance of these

models. The classification of positive instances is not bad, indicating that progression features are

able to differentiate positive instances. It is the negative instances they cannot distinguish.

As mentioned before, these datasets were submitted to the feature selection task of the prepro-

cessing workflow. The resulting features sets turn out to be very poor. For instance, the feature set

for the 3-year time window for the PF− model comprises only one attribute, the age attribute and as

explained before, demographic attributes do not have progression feature values. This shows that

progression features alone do not give rise to good classification models.

Progression Feature Comparison

The different progression features were constructed in the same way throughout these datasets,

while small changes in the preprocessing workflow were inserted to evaluate the resilience of these

features in different datasets, aiming to find the better combination and consequently, the better solu-

tion. Table 4.9 and Table 4.10 show all evaluation metrics (accuracy, AUC, sensitivity and specificity)

across all progression feature datasets (with these progression feature calculated from only two ob-

servations).

62

Table 4.9: Accuracy and AUC comparison for progression feature datasets.

Table 4.10: Specificity and Sensitivity comparison for progression feature datasets.

In these tables, all values are compared within each metric and the lightest color cell contains the

lowest value ring up to the darkest color cell with the highest value.

The worst accuracy and AUC values stands out, meaning that the PF− dataset for the 3-year time

window is the worst performing model. This bad performance is due to the incapability of classifying

negatives instances, as the sensitivity value does not correspond to a bad value. The worst sensitivity

values belong to datasets containing the original set of features, that is, that were nor submitted to the

feature selection process. This poor performance gets better with the increase of the time window.

On the other hand, the better evaluation metric values belong to the discrete datasets, especially

for the 5-year window. This good performance indicated by AUC values are validated by the good

accuracy, specificity and sensitivity values.

63

Considering those worst performance as outlier performances, it is possible to see that values

within each metric do not vary all that much, especially if we take into notice that, due to the small

number of instances, a misclassification of an instance translates into a substantial change in an

evaluation metric.

This does not mean that these datasets do not improve the baseline results, quite the contrary.

All these improvements can be small in these datasets but could do great good in a more sizable

dataset. All these improvements are contributing to a better classification of sMCI versus cMCI and

considering more information builds better models.

Progression Feature with all Time points

The following datasets were built considering all non-clinical observations within the time window

considered. The sum attribute was the same way as PF+ attribute, yet summing all values for all

non-clinical observations. Likewise, the slope attribute was built based on the PF− also summing

all values for all non-clinical observations. These attributes were added to datasets only comprising

the first timepoint t1. The confusion matrices for the cross-validation models are shown in figure 4.16

and the respective evaluation metrics are presented on table 4.11 as well as the distribution of these

values are shown in Figure 4.17.

Figure 4.16: Confusion Matrices for the cross validation results for datasets with progression features, consider-ing all observations within the time window.

64

Table 4.11: Cross Validation results for the datasets with progression features, considering all observationswithin the time window.

Figure 4.17: Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets with progres-sion features, considering all observations within the time window.

The overall results show that the classification model built from datasets for the 3 and 4-year

window outperform the baseline model. However, classification models built from the 5-year window

datasets do not have the same behavior, showing worst results than the ones in the dataset.

Comparing progression features, the sum dataset has better behavior than the slope dataset,

following the tendency of the previous datasets comprising PF+ and PF− progression features. This

behavior throughout the time windows shows that progression features can help the classification

model specially for small time frames. Progression features accompanied by only the original attribute

value for the first dataset do not outperform the baseline model for wider time frames. This may

65

indicate that progression features do not have the predictive power we hoped for, as the attribute

value of the second timepoint is more meaningful that the slopes between all timepoints within the

temporal window. This comparison between a dataset with t1, t2 and the correspondent progression

feature versus a dataset with t1 and the sum of all correspondent progression features is approximate,

as the original dataset comprises a majority of patients with only two timepoints.

4.2.3 Temporal Pattern Datasets

The temporal pattern created consist in assigning code letter to the temporal behavior of attribute,

as explained in Section 3.2. Having learning examples containing information from 2 time points, the

code letter was added to the dataset. The resultant dataset was fed to the classifier with and without

the Feature Selection step. Furthermore, the temporal pattern was also applied to learning examples

containing all Time Points. The results obtained are presented in Figure 4.18 and cross validation

results are displayed in Table 4.12 as well as the distribution of these same values are shown in

Figure 4.20.

Figure 4.18: Confusion Matrices for the cross validation results for datasets with temporal pattern attribute.

66

Table 4.12: Validation results for the datasets with temporal pattern attribute.

Figure 4.19: Cross Validation results for the Accuracy and AUC for datasets with temporal pattern attribute.

Figure 4.20: Cross Validation results for the Specificity ans Sensitivity for datasets with temporal pattern attribute.

Overall, the temporal pattern results do not show the improvement we were looking for when

adding the temporal pattern attribute. Of all these three models, the simple Temporal pattern for two

time points holds better results.

The temporal pattern model for two time points can outperform the baseline model in the 3 and

4 time windows (except for the value of specificity in the 3-year window) and seems to match the

performance of the baseline model in the 5-year window. Exploring this match a little further, feature

selection results show us that the features chosen by the algorithm are basically the same, adding one

67

temporal pattern attribute for the Graphomotor Initiative’s Z score. In fact, feature selection seldom

chooses temporal pattern attributes. This can be since the temporal pattern represents the behavior

between two time points but these two timepoints can have different time intervals. As there is no

possibility of doing a temporal regularization to have time points with the same spacing between them,

the temporal pattern is not able to distinguish between a slow of fast increase/decrease. Considering

two learning examples from the same dataset (and consequently the same time window). While the

first one can have the two timepoints two years apart the second one can have the time points 4 years

apart. An attribute that increases within 2 or 4 years can have very distinct meanings. Moreover, the

gradient of this increase can be very different and still be categorized into the same code letter. All

these facts contribute to a poor correlation between almost every temporal pattern attribute versus

the original one and therefore, being left out of feature selection optimal dataset.

Trying to surpass this, another dataset with temporal pattern was created, this time spiking the

features selection step (¬ FS). However, this led to a highly biased dataset, with a great number of

feature to a small number of instances and producing worst results than the previous temporal pattern

dataset.

Furthermore, another approach was carried out concerning temporal pattern datasets. This time,

the dataset was created using all time points available within a learning examples in order to capture

more information of the temporal behavior of an attribute. Here, we have a sequence more specific

than the ones before that were composed of only one letter. However, this created a bigger set of

possible nominal values to the same number of instances and the classifier was not able to create

rules for every nominal value, leading to worst results. In this dataset, two learning examples with two

and three timepoints both with simultaneous end time points, maintaining always the same value for

an attribute will have different values within the temporal pattern attribute (S and SS respectively) and

the classifier perceives as two completely different values. Concluding, temporal pattern can yield a

good representation of a time series yet not when this time series does not have regular temporal

intervals. A final dataset was created trying to divide patients with the same number of time points to

try to achieve more regular time intervals. Sadly, there was not sufficient data to do this analysis and

no results could be obtained.

4.2.4 Statistics-Based Summarization Datasets

Lastly, here are presented the results concerning the Statistics-Based Summarization Datasets.

In this case, for every temporal sequence for each attribute the mean and variance were calculated

and treated in different datasets. Concerning the mean attribute, the datasets were created relating

every observation into different means: arithmetic, geometric and harmonic. These attributes were

added maintaining the original value of the first observation. Confusion matrices are shown in Figure

4.22. Accuracy, weighted AUC, sensitivity and specificity results are presented in table 4.13 for the

mean datasets and for the variance dataset and the distribution of values for the ten seeds used are

present in Figure 4.23.

68

Figure 4.21: Confusion Matrices for the cross validation results for datasets with mean features.

Figure 4.22: Confusion Matrices for the cross validation results for datasets with variance features.

Table 4.13: Cross Validation results for the datasets with statistics-based features.

69

Figure 4.23: Cross Validation results for the Accuracy, AUC, Specificity ans Sensitivity for datasets with statistics-based features.

With a simple look, it is possible to observed that all means had a good performance when com-

pared to the baseline results in the 3-year window. Overall, the 4-year window has a better perfor-

mance as well, but not as good as the first window, whereas the 5-year window does not outperform

the baseline results. Comparing the different means altogether, we can see that it is not possible to

find a general rule that describe their behavior. The arithmetic mean, has better results for all metric

for the 5-year window and for the sensitivity and specificity for the 4-year window. The geometric

mean provides better results for AUC in the 3-year window and specificity for both 3 and 5-year win-

dow. The harmonic mean provides better Accuracy values for the 3 and 4-year window, AUC values

for the 4-year window and sensitivity for the 3-year window.

Every mean summarizes the global characteristics of a temporal sequence into a single value. If

a sequence is composed by equal values, the mean (being arithmetic, geometric or harmonic) is this

same value. If we consider a temporal sequence with values within a small range, the mean value

can represent the sequence truly. However, if a sequence consists of values within a large range and

containing outliers, the mean will not characterize well the sequence. Due to these facts, datasets

comprising the mean attribute only outperform the original dataset in small time windows and loses

quality as we increase the size of the time window. The 3-year window comprise less time points than

the 4 and 5 year windows, thus sequences tend to be more similar than the sequences found in the

4 and 5 year windows. Moreover, as the dataset created with 3-year window learning examples is

70

highly imbalanced having a great number of sMCI examples comparing with cMCI. sMCI examples,

as they remain stable within the MCI condition, maintain their tests scores stable as well, at least more

stable than cMCI examples consequently contributing for the good performance of the model in this

time window.

The good sensitivity values obtained for basically every window, can prove there is a slight over-

fitting of the model, due to class imbalance, towards the cMCI class. As the number of instances

belonging to the cMCI class is lower than the ones belonging to sMCI, means of sequences belong-

ing to cMCI class will be closer than the mean of a wide variety of sMCI sequences.

All these facts can lead to results obtain and shown in table 4.13.

Moreover, looking at the different type of mean calculated, the combination of using the harmonic

mean applied to the 3-year window dataset gives us the best results. The harmonic mean is less sus-

ceptible to outlier values than the other means and this is another reason why it behaves well in the

3-year window Between all means, the harmonic means is the one that gives the smaller value for a

same sequence[Medias]. As the MCI values tend to be smaller than the AD ones, the harmonic mean

tends to represent better sMCI examples rather than sMCI ones. On the other hand, the arithmetic

mean gives the greater value for the same sequence, which can lead to the good performance values

for the 5-year window, comparing with the other means.

Variance was also tried out as a summarization method based on a statistical measure. Variance

results show overall results are worse than the baseline, with few exceptions. Nonetheless, we can

see a similar trend to the mean datasets where the best behavior is found on the 3-year window

and it gets worst as we increase the size of the temporal window. Despite being comparable to

mean datasets, the results are not quite what we expected. Variance is a measure of how far off the

sequence deviates from its mean value. This deviation measure should be a good predictor of cMCI

classes as they tend to deviate from the stabilized values of sMCI.

Furthermore, there is one more particularity in these results that should be noted. specificity tend

to be the smaller values in all datasets, probably due to overfitting ass seen before. However, this

is not seen in table 4.13. The variance metric gives us good specificity results and has the smaller

values on the 5-year window, especially if compared to the baseline values that present a specificity

for the 5-year window of 0.91. Baseline specificity for the 5-year window is 0.73 and contrasts with

the 0.89 in the variance dataset.

On the other hand, sensitivity values have the opposite tendency comparing to specificity and

decrease when the size of the time window increases. AUC values are close to the values obtain ate

the baseline, showing slight increases in the 3 and 4-year window and a slight decrease in the 5-year

window (Baseline: 0.89 vs Variance: 0.87).

Other statistics-based summarization datasets could be built calculating the median and mode

of the temporal sequences. However, as most temporal sequences in this dataset contain only two

values, media and mode attributes do not add useful information, meaning the median ends up having

the same values as the arithmetic mean.

71

4.3 Hidden Markov Models: Preliminary Results

Hidden Markov Models made possible to create evaluation models with datasets comprising ten-

poral information. These information is compiled into relation attributes that the HMM algorithm de-

picts as temporal relations and classified according this information.

Independent datasets used with the Naive Bayes have the advantage of being easy to deal with,

and all prepocessing and classification steps can be checked and validated.

However, the same do not happen when dealing with relational attributes. WEKA allows temporal

mining classification with relation datasets through the use of a preprocessing filter that converts

instances with the same nominal ID into a multi-instance containing all the information from instances

belonging to the same ID. However, these instances have to be presented in a very specific format

and this is not well documented trough the WEKA documents and literature. Instances could not

comprise missing values and had to be normalized. All features had to be nominal: the classification

step was not possible to be carried out with numeric datasets (despite said to be possible in the HMM

documentation).

There are various set backs that rise from the fact that our datasets have a small number of

instances. However, Naive Bayes could always classify the dataset and output the classification

model and evaluation metrics. The same did not happen with the HMM pakage for WEKA.

The main problem comprising all these facts, were that all problems encountered were not iden-

tified trough errors messages as they were extremely vague. Whenever an error message would

appear, the dataset had to be changed. All the changes made are accounted for and described in

Section 3.3.

Usable datasets were obtained for all three time windows. With these datasets two main param-

eterizations were carried out: changing the number of hidden states and changing the format of the

transitional probability matrix. This matrix could be a full matrix where all attributes were uncon-

strained, diagonal where no correlation was set between attributes and spherical where all attributes

were considered to have the same variance.

Preliminary results were obtained and their evaluation metrics are shown in Tables 4.14, 4.15,

4.17 and 4.16.

Table 4.14: Accuracy values for the Hidden Markov Classification Model.

72

Table 4.15: AUC values for the Hidden Markov Classification Model.

Table 4.16: Specificity values for the Hidden Markov Classification Model.

Table 4.17: Sensitivity values for the Hidden Markov Classification Model.

An overall look of this values can show us they can compete with independent dataset models,

as their accuracy, AUC sensitivity and specificity range within the same values. However, it is im-

portant to notice that, due to the incapability of building a classifier with SMOTE to only classify the

real instances, these values represent evaluation metric for the real dataset with synthetic instances.

Even the balanced dataset (5-year window) has synthetic instances as the model did not allowed the

dataset to be classified, hence datasets were discussed are not have more patients as independent

datasets.

Considering the different transitional probability matrices, spherical ones (where attribute are as-

sumed to have the same variance) result into better models. This can be seen in all metrics except

specificity, indicating that these models do not behave well when facing negative instances.

Regarding different the number of hidden states, as these numbers increases models did not

showed better results. Instead it is possible to see a plateau these values, sometimes with slight

variances.

HMM classification for WEKA is not yet mature algorithm. For this reason, the result here pre-

sented may be due to some underlying mechanism that still need adjustments in order to output

meaning full results.

73

74

5Conclusions and Future Work

75

Conclusions and Future Work

Although there is not an effective course of action for an AD diagnosis, an early prognosis is of

upmost importance. An AD prognosis has a deep impact on the patient and their caregivers, being

a social economical or emotional impact. This thesis yields an exploratory study on real world data,

questioning and examining new approaches to deal with neuropsychological data of a group of elderly

and non-demented subjects. These new approaches we aim to include temporal information often

disregarded in other studies where the outcome of the patient’s progression is forecasted using only

information from the first medical appointment. These inclusion of temporal information tries to mimic

the analysis made by a MD, where the evolution of the neuropsychological test scores are appraised.

However, improving the prognosis classification is not a trivial task. Most data mining tasks comprise

algorithms that consider all features to be independent, meaning there is no relation between them.

Temporal information has constraints and it must be processed according to them. Two temporal

points with neuropsychological information are not independent nor commutative and there is infor-

mation far beyond values of the neuropsychological. It is this information we try to collect and process

in the most helpful way in order to obtain better results. This prognosis improvement contributes to a

better understanding of the underlying disease’s pathological mechanisms and consequently encour-

ages advances in the prognosis and delivery of medical care and reduce the disease’s impact on the

patient and respective caregivers and even society.

In this work, we tried to reach out for a better prognosis with two lines of work. The first one,

consists in developing temporal preprocessing workflow to compile temporal information and provide

to an independent classifier. The second one, comprises a temporal mining workflow laying on the

use of Hidden Markov Models and the results are preliminary due to problems raised when using the

approach in WEKA. The first approach consists of a series of data transformation and cleaning tech-

niques that alter the provided database in order to have the most beneficial format to the classification

task. The focus of this alterations are the creation of learning examples and creation of the temporal

features. Learning examples are developed according to time windows that restricts the progression

to a known time frame, in our case, 3, 4 and 5 years. Temporal features are obtained using summa-

rization techniques where new features are created containing information from the feature temporal

progression. Temporal Features were created according to progression rates, temporal patterns and

statistical summarizations.

Part of the preprocessing workflow was the Feature Selection task. This task should be done un-

der a filter classifier to prevent the subset of features to be overfitted to the dataset, whereas the filter

classifier applies feature selection only to the cross-validation’s training folds. In the work, to be able

to output and interpret feature selection results, this was not done and feature selection was applied

beforehand to the whole dataset.

Feature Selection results showed us the highly chosen neuropsychological test features, indicating

76

those are the most useful, as they were chosen across all datasets, temporal features and time win-

dows. As a feature work task, feature selection could be applied with the filter classifier to compare

results and see if there was in fact an overfitting situation. Throughout this work, feature selection

proved to be an especial tool when dealing with real world data. Not only the chosen subsets make

models more interpretable but also reduces the high variance of the dataset and cleans useless fea-

tures, reducing the percentage of missing values in the whole dataset. Regarding missing values, as

they can comprise different meanings, these meanings should be interpreted and introduced into the

database.

There are general problems that affect all databases and consequently model’s evaluation met-

rics: Imbalanced data, high number of features and reduced number of instances. Imbalance data is

a characteristic of real world data, especially medical databases, as the positive cases (conversion

to AD) will be more rare than negative ones, especially in short time frames. This can cause the

overfitting of the model towards the rare class, as the model might learn specific rules of the positive

instances in the dataset instead of learning rules for the positive class. The large number of fea-

tures was due to the creation of temporal features. For every pair of timepoints, a temporal feature

was created holding temporal information of those two timepoints. The high number of features can

lead to high variance of the dataset. Moreover, adding up to this high variance problem, there was

a reduce number of instances. These reduced number was caused by the preprocessing methods

applied. One way to prevent this was to eliminate features at the starting point. Tests could be carried

out using only Z-score features or only on feature per test. Features could be grouped by cognitive

domains and chosen between them or tests can be carried out with only features of one cognitive

domain and explore the best results. All these suggestions decreased the number of features and

might lead to interesting and maybe better results, especially considering that increasing the number

of instances is a time expensive task. However, the low number of instances did not make us able

to divide the dataset into a training set and test set, thus having more confidence in the results, as

models might overfit small training models. Test sets helped us validate these results and give more

real ones, as they present to the model true instances.

The first dataset created using new leaning examples showed us an important conclusion. More

information matters, even as independent features. This was a excellent starting point, especially

considering these learning examples contained features from the two time points evenly and give rise

to better results than only using the second timepoint. The second timepoint gave better results than

using the first one, but lets not forget that the second timepoint is very close to the point where the

patient converts to AD, making it easy to detect these instances and tell them apart from negative

ones. In the future, more timepoints can be added to learning examples and other preprocessing

techniques can be added to this step, for examplethose resulting from unsupervised discretization.

This good start opens a path to create temporal features and study the best ones to supply to the clas-

sification model and the baseline was set with these t1t2 learning examples. Progression features had

77

better results in a discrete dataset leading to AUC results of 0.97% in a 5-year window with the PF÷

feature. Discretization helped the NB algorithm as it works with discrete values by default. Consid-

ering the 5-year window, neuropsychological values have a wider range and are better distributed in

the discrete bins. Results get worse than the baseline with datasets created without feature selection

and comprising only temporal progression features, showing an AUC value of 0.70% and 0.60%, re-

spectively. Excluding results obtained from the discrete model, progression features leads to better

results on 3-year time window datasets, getting worse when the time frame was increased. Different

progression features originate different models with different behaviors. PF− defined a slope of the

line created by the two timepoints. If this slope was very accentuated, meaning a great change in

the values, it could indicate a fast decrease in patients’ performance. Intuitively this would be a good

feature to add to model. However, results did not agree with this reasoning. PF− was found to be

the worst progression feature, accompanied by PF÷. Not only these models outputted bad results

but also features calculated with metric were poorly chosen. The first might be caused by the latter

reason, however, feature selection search algorithm creates the best local subset of features building

up to be the best global subset. The better model turned out to be PF× as it makes slightly different

changes in the timepoints values more visible and this fact might lead to better results.

Another characteristic of a real word database, namely the one used in this work is the lack of pe-

riodic timepoints. Patients do have the same spacing between timepoints and this has consequences

in the results obtained. First the learning examples were built according time-windows, comprising

two timepoints within them. The first one sets the starting point and the second is the closest to the

end of the window. However, considering a 5-year time window, the closest point to the end may

be 2, 3, 4 and close to 5 years after the start point, creating a very heterogenic group. Separating

learning examples was possible due to the small number of instances in the datasets. In the future,

considering datasets with more instances, the temporal pattern attribute would be an interesting fea-

ture to try out, having timepoints with the similar temporal frequency. Hence, temporal pattern had

to be carried out in this aperiodic set and three models were tested, even with far from perfect con-

ditions. Within these models, we could see that Feature Selection continued to be essential. Using

all timepoints continues no to be an answer to the aperiodicity problem, as produced different strings

to similar progressions, being very difficult for a model to learn for this dataset. As future work, other

alphabets could be used to create this temporal pattern attribute and make them more complete. The

CAPSUL alphabet [43] for instance, as it comprises more nominal values (up, Up, down, Down, zero

and stable) can be a good option. To try and minimize the aperiodicity problem, nominal values could

be arranged according to the spacing of attributes, differentiating a stable value where it is for 2 years

or 5 years.

Lastly, statistical summarization models were obtained through the calculation of different means

and the variance. All these models had best results for the 3-year window. This is easy to understand

as a mean represents better a sequence of number within a small range than a sequence containing

78

outlier values. However, outlier values are one of the things models should look out for, as they rep-

resent behaviors different from the usual ones. Therefore, calculating the mean value does not work

well in larger sequences as it suppresses outstanding values yet was a good feature to add to small

time window datasets. The variance was calculated to correct the suppression and see how distant

the neuropsychological test scores are from the mean. Yet, this did not translate into a good model

as results show. As future work, this statistical measure should be further explored. As future work,

other summarization techniques could be explored, as fractal dimensional and run-length based sum-

marizations [45]. However, this kind of summarizations rely on a bigger number of timepoints than the

ones we have on our dataset.

Classifications using HMM models brought us very preliminary results as the environment used

did not made possible further exploring. Datasets were often unable to be classified and error mes-

sages were not helpful. Despite all this, the models show promising results and CCC dataset should

be tried out on other classifier, not only Temporal ones as Dynamic Bayesian networks, but also other

independent ones as Random Forests, Support Vector Machines, Neural Networks and others. A

different preprocessing task for the HMM methods could be design in order to be more suitable to the

temporal mining pipeline. Considering Feature Selection, this method was completely different when

applied to an independent dataset and applied to a relational one. As future work, this line of work

could be explored, combining different feature selection techniques and temporal algorithm to obtain

better results. There are other temporal mining techniques that could be explored, for example, us-

ing Dynamic Bayes Networks with multivariate time series and compare results with the HMM ones.

These temporal models could also be used as interpolation models where the value of the next time-

point would be predicted. The goal of this prediction was to create a future instance and from that,

use an independent classifier to resolve the instance into the MCI and AD classes, as a prognosis

model.

As future work, this study could have an additional goal and answer a different clinical problem:

What is the best time interval for medical appointments? The CCC database is an aperiodic dataset,

where timepoints do not have a discrete frequency. This was mainly due to external factors as we are

dealing with a patient’s availability for the medical appointments. However, different datasets should

be built with patients with the same time interval between appoints to explore time frames for the AD

progression. Does a model built with information from patients that have test values in every 6 months

have a higher predictive power than a model built with information from patients that have test values

in every 12 months? This can also help to identify different points the disease progression, as maybe,

in an early stage the spacing between medical appointments does not have to be as thigh as in a later

stage. The current learning example creation process tell us if a patient is predicted to to convert to

AD in a 3, 4 and 5-year time frame. However, an assemble model could be created to try to predict the

conversion time, where a test set or single instance could be handed over to this assemble model and

considering different probabilities outputted by the NB the model would give the time of conversion.

79

80

Bibliography

[1] World Health Organization. ”World health statistics 2016: Monitoring health for the SDGs, sus-

tainable development goals.” (2016).

[2] Prince, M., M. Prina, and M. Guerchet. ”World Alzheimer Report 2013: journey of caringanal-

ysis of long-term care for dementia. London: Alzheimer’s Disease International;” (2015).

[3] Tortora, G. J., and B. Derrickson. ”Introduction to the human body: the essentials of anatomy

and physiology, ed 9, Hoboken, NJ, 2011.”

[4]Prince, Martin, et al. ”World Alzheimer report 2016: improving healthcare for people living with

dementia: coverage, quality and costs now and in the future.” (2016).

[5] Ewers, Michael, et al. ”Prediction of conversion from mild cognitive impairment to Alzheimer’s

disease dementia based upon biomarkers and neuropsychological test performance.” Neurobiology

of aging 33.7 (2012): 1203-1214.

[6] Ritchie, K., Artero, S., Touchon, J., 2001. Classification criteria for mild cognitive impairment: a

population-based validation study. Neurology 56, 3742.

[7] Maroco, Joao, et al. ”Data mining methods in the prediction of Dementia: A real-data com-

parison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression,

neural networks, support vector machines, classification trees and random forests.” BMC research

notes 4.1 (2011): 299.

[8] Hall, John E. Guyton and Hall textbook of medical physiology. Elsevier Health Sciences, 2015.

[9]Portet F, Ousset PJ, Visser PJ, Frisoni GB, Nobili F, Scheltens P, Vellas B, Touchon J; MCI Work-

ing Group of the European Consortium on Alzheimer’s Disease (EADC): Mild cognitive impairment

(MCI) in medical practice: a critical review of the concept and new diagnostic procedure. Report of

the MCI Working Group of the European Consortium on Alzheimer’s Disease. J Neurol Neurosurg

Psychiatry 2006;77:714-718.

[10] Lee, Sei J., et al. ”A clinical index to predict progression from mild cognitive impairment to

dementia due to Alzheimer’s disease.” PloS one 9.12 (2014): e113535.

[11] Bennett DA, Wilson RS, Schneider JA, Evans DA, Beckett LA, et al. Natural history of mild

cognitive impairment in older persons. Neurology 59 (2002): 198205.

81

[12] Brodaty H, Heffernan M, Kochan NA, Draper B, Trollor JN, et al. Mild cognitive impairment in

a community sample: the Sydney Memory and Ageing Study. Alzheimers Dement 9 (2013) : 310317

e311.

[13] Ganguli M, Snitz BE, Saxton JA, Chang CC, Lee CW, et al. (2011) Outcomes of mild cognitive

impairment by definition: a population study. Arch Neurol 68: 761767.

[14] Petersen, Ronald C. ”Mild cognitive impairment as a diagnostic entity.” Journal of internal

medicine 256.3 (2004): 183-194.

[15]Petersen, Ronald C., et al. ”Mild cognitive impairment: a concept in evolution.” Journal of

internal medicine 275.3 (2014): 214-228.

[16] Fisk JD, Merry HR, Rockwood K. Variations in case definition affect prevalence but not out-

comes of mild cognitive impairment.Neurology 61 (2003): 117984.

[17] Chapman, Robert M., et al. ”Predicting conversion from mild cognitive impairment to Alzheimer’s

disease using neuropsychological tests and multivariate methods.” Journal of clinical and experimen-

tal neuropsychology 33.2 (2011): 187-199.

[18] Hinrichs, Chris, et al. ”Predictive markers for AD in a multi-modality framework: an analysis

of MCI progression in the ADNI population.” Neuroimage 55.2 (2011): 574-589.

[19] Langbaum, Jessica B., et al. ”An empirically derived composite cognitive test score with

improved power to track and evaluate treatments for preclinical Alzheimer’s disease.” Alzheimer’s &

Dementia 10.6 (2014): 666-674.

[20] The National Institute on Aging and Reagan Institute Working Group on Diagnostic Criteria

for the Neuropathological Assessment of Alzheimer’s Disease. Consensus Recommendations for the

Postmortem Diagnosis of Alzheimer’s Disease: Neurobiol Aging. (1997); S1S2.

[21] Lim, Alfredo, et al. ”Cliniconeuropathological correlation of Alzheimer’s disease in a commu-

nitybased case series.” Journal of the American Geriatrics Society 47.5 (1999): 564-569.

[22] Selkoe, Dennis J. ”Alzheimer’s disease: genes, proteins, and therapy.” Physiological reviews

81.2 (2001): 741-766.

[23] Forman, Mark S., John Q. Trojanowski, and Virginia MY Lee. ”Neurodegenerative diseases:

a decade of discoveries paves the way for therapeutic breakthroughs.” Nature medicine 10.10 (2004):

82

1055-1063.

[24] Chintamaneni, Meena, and Manju Bhaskar. ”Biomarkers in Alzheimer’s disease: a review.”

ISRN pharmacology 2012 (2012).

[25] Pooler, Amy M., Wendy Noble, and Diane P. Hanger. ”A role for tau at the synapse in

Alzheimer’s disease pathogenesis.” Neuropharmacology 76 (2014): 1-8.

[26] Hardy, John, and Dennis J. Selkoe. ”The amyloid hypothesis of Alzheimer’s disease: progress

and problems on the road to therapeutics.” Science 297.5580 (2002): 353-356.

[27] OBrien, Richard J., and Philip C. Wong. ”Amyloid precursor protein processing and Alzheimers

disease.” Annual review of neuroscience 34 (2011): 185.

[28] Shaw, Leslie M., et al. ”Biomarkers of neurodegeneration for diagnosis and monitoring thera-

peutics.” Nature reviews Drug discovery 6.4 (2007): 295-303.

[29] Jack Jr, Clifford R., et al. ”Update on hypothetical model of Alzheimers disease biomarkers.”

Lancet neurology 12.2 (2013): 207.

[30] Craig-Schapiro, Rebecca, Anne M. Fagan, and David M. Holtzman. ”Biomarkers of Alzheimer’s

disease.” Neurobiology of disease 35.2 (2009): 128-140.

[31] G. McKhann, D. Drachman, M. Folstein, R. Katzman, D. Price, and E. M. Stadlan, ”Clinical

diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of

Department of Health and Human Services Task Force on Alzheimers Disease.,” Neurology, vol. 34,

no. 7, pp. 939-944, 1984.

[32] American Psychiatric Association: DSM-IV-TR (4th Ed, text revision). APA, Washington DC,

2000.

[33] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Else-

vier, 2011.

[34] Vercellis, C., Business Intelligence: Data Mining and Optimization for Decision Making, John

Wiley & Sons, Ltd, 2009.

[35] Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. ”From data mining to knowl-

edge discovery in databases.” AI magazine 17.3 (1996): 37.

[36] Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques.

Morgan Kaufmann, 2005.

83

[37] Farhangfar, Alireza, Lukasz Kurgan, and Jennifer Dy. ”Impact of imputation of missing values

on classification error for discrete data.” Pattern Recognition 41.12 (2008): 3692-3705.

[38] Lavrac, Nada. ”Selected techniques for data mining in medicine.” Artificial intelligence in

medicine 16.1 (1999): 3-23.

[39] Chandrashekar, Girish, and Ferat Sahin. ”A survey on feature selection methods.” Computers

& Electrical Engineering 40.1 (2014): 16-28.

[40] Nunes, Cecılia M. C B de S., Learning from Imbalanced Neuropsychological Data (2012),

Master degree, Instituto Superior Tecnico.

[41] Chawla, Nitesh V., et al. ”SMOTE: synthetic minority over-sampling technique.” Journal of

artificial intelligence research 16 (2002): 321-357.

[42] Chawla, Nitesh V. ”Data mining for imbalanced datasets: An overview.” Data mining and

knowledge discovery handbook. Springer US, 2005. 853-867.

[43] Antunes, Claudia M., and Arlindo L. Oliveira. ”Temporal data mining: An overview.” KDD work-

shop on temporal data mining. Vol. 1. 2001.

[44] Laxman, Srivatsan, and P. Shanti Sastry. ”A survey of temporal data mining.” Sadhana 31.2

(2006): 173-198.

[45] Mitsa, Theophano. Temporal data mining. CRC Press, 2010.

[46] Fu, Tak-chung. ”A review on time series data mining.” Engineering Applications of Artificial

Intelligence 24.1 (2011): 164-181.

[47] Esling, Philippe, and Carlos Agon. ”Time-series data mining.” ACM Computing Surveys

(CSUR) 45.1 (2012): 12.

[48] Fu, Tak-chung. ”A review on time series data mining.” Engineering Applications of Artificial

Intelligence 24.1 (2011): 164-181.

[49] Berndt, Donald J., and James Clifford. ”Using Dynamic Time Warping to Find Patterns in

Time Series.” KDD workshop. Vol. 10. No. 16. 1994.

[50] Bilmes, Jeff A. ”What HMMs can do.” IEICE TRANSACTIONS on Information and Systems

84

89.3 (2006): 869-891.

[51] Bengio, Yoshua. ”Markovian models for sequential data.” Neural computing surveys 2.1049

(1999): 129-162.

[52] Rabiner, Lawrence, and B. Juang. ”An introduction to hidden Markov models.” ieee assp mag-

azine 3.1 (1986): 4-16.

[53] Blunsom, Phil. ”Hidden markov models.” Lecture notes, August 15 (2004): 18-19.

[54] Eddy, Sean R. ”Hidden markov models.” Current opinion in structural biology 6.3 (1996): 361-

365.

[55] Rabiner, Lawrence R. ”A tutorial on hidden Markov models and selected applications in

speech recognition.” Proceedings of the IEEE 77.2 (1989): 257-286.

[56] Barnes, Deborah E., et al. ”A point-based tool to predict conversion from mild cognitive im-

pairment to probable Alzheimer’s disease.” Alzheimer’s & Dementia 10.6 (2014): 646-655.

[57] Carreiro, Andre V., et al. ”Predicting Non-Invasive Ventilation in ALS Patients using Time Win-

dows.”

[58] Cabral, Carlos, et al. ”Predicting conversion from MCI to AD with FDG-PET brain images at

different prodromal stages.” Computers in biology and medicine 58 (2015): 101-109.

[59]Yu, Hong-mei, et al. ”Multi-state Markov model in outcome of mild cognitive impairments

among community elderly residents in Mainland China.” International Psychogeriatrics 25.05 (2013):

797-804.

[60] Lemos, Luis M. L., A data mining approach to predict conversion from mild cognitive impair-

ment to Alzheimer’s Disease (2012), Master degree, Instituto Superior Tecnico.

[61] Ferreira, Andreia L. D. F., Predicting the Conversion from Mild Cognitive Impairment to Alzheimer´s

Disease using Evolution Patterns (2014), Master degree, Instituto Superior Tecnico.

[62] K. Yang, H. Yoon, and C. Shahabi, ”CLeVer : A Feature Subset Selection Technique for

Multivariate Time Series,” pp. 516522, 2005.

[63] Hall, Mark, et al. ”The WEKA data mining software: an update.” ACM SIGKDD explorations

newsletter 11.1 (2009): 10-18.

85

86

AComplete List of Features

A-1

A-2

A-3

Table A.1: Complete List of Features

A-4

BFeature Selection: Complementary

Information

B-1

B.1 Dataset without temporal progression attribute

Table B.1: Set of Features for the dataset without temporal progression attribute and 3-year window

Table B.2: Set of Features for the dataset without temporal progression attribute and 4-year window

Table B.3: Set of Features for the dataset without temporal progression attribute and 4-year window

B-2

B.2 Progression Feature Datasets

Table B.4: Set of Features for Progression Feature datasets and 3-year window

Table B.5: Set of Features for Progression Feature datasets and 4-year window

B-3

Table B.6: Set of Features for Progression Feature datasets and 5-year window

B.3 Unsupervised Discretization

Table B.7: Set of Features for Progression Feature datasets with Unsupervised Discretization and 3-year window

B-4

Table B.8: Set of Features for Progression Feature datasets with Unsupervised Discretization and 4-year window

Table B.9: Set of Features for Progression Feature datasets with Unsupervised Discretization and 5-year window

B-5

B.4 Only Progression Features

Table B.10: Set of Features for Progression Feature datasets with Only Progression Features and 3-year window

Table B.11: Set of Features for Progression Feature datasets with Only Progression Features and 4-year window

B-6

Table B.12: Set of Features for Progression Feature datasets with Only Progression Features and 5-year window

B.5 All timepoints

Table B.13: Set of Features for Progression Feature datasets with all timepoints and 3-year window

B-7

Table B.14: Set of Features for Progression Feature datasets with all timepoints and 4-year window

Table B.15: Set of Features for Progression Feature datasets with all timepoints and 5-year window

B-8

B.6 Temporal Pattern Datasets

Table B.16: Set of Features for Temporal Pattern datasets

Table B.17: Set of Features for Temporal Pattern datasets with all timepoints

B.7 Statistics-Based Summarization Datasets

Table B.18: Set of Features for Statistics-Based Summarization and 3-year window

B-9

Table B.19: Set of Features for Statistics-Based Summarization and 4-year window

Table B.20: Set of Features for Statistics-Based Summarization and 5-year window

B-10

CResults: Complete Tables

ContentsB.1 Dataset without temporal progression attribute . . . . . . . . . . . . . . . . . . . . B-2B.2 Progression Feature Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3B.3 Unsupervised Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4B.4 Only Progression Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6B.5 All timepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7B.6 Temporal Pattern Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9B.7 Statistics-Based Summarization Datasets . . . . . . . . . . . . . . . . . . . . . . . B-9

C-1

C.1 Independent Datasets

C.1.1 Dataset without temporal features

Table C.1: Cross Validation results for the datasets without temporal features.

C.1.2 Progression Feature Datasets

Table C.2: Cross Validation results for the datasets with Progression features.

Table C.3: Cross Validation results for the datasets with Progression features witn unsupervised discretization.

C-2

Table C.4: Cross Validation results for the datasets with Progression features without Feature Selection.

Table C.5: Cross Validation results for the datasets with Progression features without the original values.

Table C.6: Cross Validation results for the datasets with Progression features with all timepoints.

C-3

C.1.3 Temporal Pattern Datasets

Table C.7: Cross Validation results for the datasets with temporal pattern features.

C.1.4 Statistics-Based Summarization Datasets

Table C.8: Cross Validation results for the datasets with statistics-based features.

C-4