Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Outlier Detection in Survival Analysis
João Diogo Pinto
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Prof. Alexandra Sofia Martins de Carvalho
Prof. Susana de Almeida Mendes Vinga Martins
Examination Committee
Chairperson: Prof. Nuno Cavaco Gomes Horta
Supervisor: Prof. Susana de Almeida Mendes Vinga Martins
Members of the Committee: Nuno Luıs Barbosa Morais
May 2015
To my parents
ii
Acknowledgments
Firstly, I would like to thanks my supervisors Prof. Alexandra Carvalho, and Prof. Susana Vinga, for their
continuous attention and support, they gave me freedom to do research and kept me on track when was
needed. My grattitude also goes to the Post-Doc and PhD students at CSI/IDMEC and IT, for their support
and useful feedback, a special thanks goes to Andre Verıssimo, whose help was essential to produce the
simulation results. I would like to give my sincere thanks to the medical team from Hospital de Santa Maria
and Instituto de Medicina Molecular for their support and availability, a special thanks in particular to Prof.
Luis Costa, and Dr. Irina Duarte. I also would like to express my gratitude to Fundacao para a Ciencia e
Tecnologia who supported this work, under the CancerSys project (EXPL/EMS-SIS/1954/2013).
I would like to thanks my dear girlfriend Luegi for all her patience, understanding, and for being such a
great source of inspiration. I would also like to give my goddaughter Benedita, my sister Catarina and her
husband Pedro a special thanks, even abroad, you are a daily inspiration to me. I would like to give my sincere
thanks to my grandparents who were always interested and supportive. A special thanks goes to my parents,
this thesis is dedicated to them.
iii
Abstract
The goal of outlier detection methods is to identify observations that are dissimilar or inconsistent with the
data. The nature of what constitutes an outlier is subjective, and it commonly depends on the application.
Outlier detection is a fundamental task in many fields, since financial fraud detection, computer network
intrusion detection, and in the diagnosis of clinical diseases. Outliers can have extreme influence on data
analysis, and for this, their presence must be taken into account. Additionally, outliers may be interesting
observations themselves, they can provide insights about certain structures in the data or particular events
that occurred in the sample. In this thesis we popose three novel methods methods aiming to perform outlier
detection in a survival context. The methods proposed are model-based and rely on the measurement of
the concordance c index (Harrell et al., 1982). The first method named One Step Deletion (OSD) relies on
backward search for the subset of the k most outlying observations. The second method named Bootstrap
Hypothesis Testing (BHT) is a stochastic method that obtains several measures of the concordance c-index
using the bootstrap (Efron, 1979) resampling scheme. The observation under test is removed from the original
dataset, then the concordance c index is bootstrapped on the remaining data, using the resulting histogram
of concordances, an hypothesis test on the improvement of concordance is made. The higher the tendency to
improve concordance when removed from the data, the more outlying the observation under test is considered
to be. The third method named Dual Bootstrap Hypothesis Testing (DBHT) is an extension of BHT but
where two di↵erent kinds of bootstrap schemes are used: one where the observation being tested is never
present in the bootstrap samples, other where the observation under test is present in all samples at least once.
The more significant the di↵erence between the two generated histograms, the more outlying we consider the
observation to be. The last two methods (BHT and DBHT) are single-step methods, meaning they output
an outlying score for each observation, while OSD just returns the set of k most outlying observations, with
k given as parameter. In the results chapter the merits of the proposed methods are assessed performing a
comparative analysis with several existent methods. The performance is first assessed on a set of simulated
scenarios and then applied to real clinical datasets. On the simulation scenarios tested, the DBHT method
outperformed the remaining methods in most of the scenarios. On the real clinical datasets, the predictive
ability of the Cox regression presented improvements when trimming a certain level of outliers from the fit.
Keywords: outlier detection, survival analysis, model-based outlier detection, Cox proportional hazards,
bootstrap, robust estimation, concordance c-index.
iv
Resumo
Os metodos de deteccao de outliers tem como objectivo a identificacao de indivıduos que apresentam inco-
sistencias ou diferencas extremas em relacao aos demais indivıduos de uma amostra, estes sao usualmente
denominados de outliers. A definicao do que constitui um outlier tem uma natureza subjectiva, e normal-
mente a sua definicao depende da aplicacao em causa. Sao varias as areas de aplicacao onde se aplicam
metodos de deteccao de outliers: deteccao de fraudes financeiras, deteccao de intrusoes em redes de com-
putadores, e no diagnostico de doentes. A presenca de outliers numa amostra pode influenciar a analise de
forma desproporcionada, a sua presenca tem normalmente de ser levada em conta.
O trabalho aqui apresentado propoe tres novos metodos para a deteccao de outliers em dados de so-
brevivencia. Os tres metodos desenvolvidos utilizam uma metrica de performance especıfica de analise de
sobrevivencia: o ındice de concordancia c index introduzido por (Harrell et al., 1982). O primeiro metodo,
de nome One Step Deletion (OSD) efectua uma pesquisa sequencial pelo sub-conjunto que maximiza a con-
cordancia, as k observacoes eliminadas serao consideradas as mais outlying da amostra. O segundo metodo
proposto, denominado Bootstrap Hypothesis Testing (BHT) e baseado no esquema de reamostragem Boot-
strap (Efron, 1979). Varias amostras bootstrap sao geradas a partir do dataset com a observacao a ser testada
excluıda, daı calcula-se o histograma da variacao de concordancia em relacao a concordancia no dataset orig-
inal, quanto mais o histograma produzido apresentar valores maiores que zero, mais outlying e considerada
a observacao sob teste. O terceiro metodo Dual bootstraps hypothesis testing (DBHT) e uma extensao do
metodo BHT, mas em vez de bootstrap, sao utilizadas duas versoes antagonicas do bootstrap: na primeira
versao, a observacao sob teste esta ausente de todas as amostras geradas, enquanto que na versao dual, a
observacao e incluıda uma vez em todas as amostras. O merito dos metodos desenvolvidos vai ser avaliado
recorrendo a um conjunto de dados simulados, nestes o metodo DBHT demonstrou uma performance superior
na maioria dos cenarios. Nos dados reais, a remocao de um certo nıvel de outliers dos dados revelou aumentar
a performance do model de Cox em termos de predicao.
Palavras-chave: deteccao de outliers, analise de sobrevivencia, deteccao de outliers baseada em modelos,
Cox proportional hazards, concordance c-index, bootstrap, estimacao robusta.
v
Notation
B number of generated bootstrap samples
BHT Bootstrap Hypothesis Testing (proposed method)
DBHT Dual Bootstraps Hypothesis testing (proposed method)
�i
event indicator for subject i
DEV deviance residuals
DFB DFBETAS
h(t) hazard function
H(t) cumulative hazard function
LD Likelihood Displacement statistic
MART Martingale residuals
OSD One step Deletion (proposed method)
S(t) survival function
T survival time
Ti
survival time for subject i
TC
i
the follow-up time for individual i
X
i
covariates vector for subject i
vi
Contents
1 Introduction 1
2 Survival Analysis 3
2.1 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Continuous-time Survival Function S(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Continuous-time Hazard Function h(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Discrete-time Survival Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Discrete-time Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Cox proportional hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Other Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.10 Performance Metrics for Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10.1 Somers’ D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10.2 Harrell’s concordance c index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10.3 Time-dependent ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.11 Counting Processes Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Survival outlier detection and robust estimation 21
4 Proposed Methods 26
4.1 Motivation for the use of Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 One Step Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Bootstrap Hypothesis Test Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Dual Bootstraps Hypothesis Testing Outlier Detection . . . . . . . . . . . . . . . . . . . . . . 36
5 Results 40
5.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Worcester Heart Attack Study dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Bone Marrow Transplant dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 CancerSys Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
vii
6 Conclusions and Future Work 63
viii
List of Tables
2.1 Example of data with censored observations (�i
= 0). . . . . . . . . . . . . . . . . . . . . . 5
2.2 Example of the subject replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Example of coding time-dependent covariates. . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Linear regression example: observations sorted by residuals. . . . . . . . . . . . . . . . . . . . 23
4.1 Evolution of the OSD algorithm when applied to an example dataset. . . . . . . . . . . . . . 33
4.2 Example of a BHT output, sorted by p values . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Outlier scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Average of TPR grouped by outlier scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Average of AUC grouped by outlier scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Average TPR of the proposed methods grouped by outlier scenario and outlier amount k. . . 46
5.4 TPR by outlier scenario and outlier amount for the alternative methods . . . . . . . . . . . . 47
5.6 Average AUC of the alternative methods grouped by outlier scenario and outlier amount k. . . 47
5.7 Average AUC of proposed methods by outlier scenario and outlier amount k. . . . . . . . . . 48
5.8 Average TPR of the alternative methods grouped by outlier scenario and level of censoring. . 48
5.9 Average TPR of the proposed methods grouped by outlier scenario and level of censoring c. . 49
5.10 Average AUC of the alternative methods grouped by outlier scenario and censoring amount c. 49
5.11 AUC of BHT and DBHT by scenario and censoring amount . . . . . . . . . . . . . . . . . . . 50
5.12 Average TPR of the alternative methods by outlier scenario and baseline hazard(�, ⌫). . . . . 50
5.13 Average TPR of the proposed methods by outlier scenario and baseline hazard(�, ⌫). . . . . . 51
5.14 Average AUC of alternative methods by outlier scenario and baseline hazard type (�, ⌫). . . . 51
5.15 Average AUC of proposed methods by outlier scenario and baseline hazard(�, ⌫). . . . . . . . 52
5.16 Top-15 outliers detected by the methods on the WHAS100 dataset. . . . . . . . . . . . . . . 54
5.17 Cox model estimated with all WHAS observations. . . . . . . . . . . . . . . . . . . . . . . . 55
5.18 WHAS: Cox estimation after 5% outlier trimming . . . . . . . . . . . . . . . . . . . . . . . . 55
5.19 WHAS: Cox estimation after 10% outlier trimming . . . . . . . . . . . . . . . . . . . . . . . 55
5.20 Cox model on the WHAS data with 10% outlier trimming for alternative methods. . . . . . . 55
5.21 Cox model fit on the WHAS dataset with 10% outlier trimming, using the proposed methods. 56
5.22 Top-10% outliers detected by the methods on the BMT dataset. . . . . . . . . . . . . . . . . 56
5.23 Cox model estimation with all BMT data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
ix
5.24 Cox model estimations with 5% outlier trimming using the alternative methods. . . . . . . . . 57
5.25 Cox model estimations with 5% outlier trimming using the proposed methods. . . . . . . . . . 58
5.26 Cox model estimations with 10% outlier trimming using the alternative methods. . . . . . . . 58
5.27 Cox model estimations with 10% outlier trimming using the proposed methods. . . . . . . . . 59
5.28 Top-15 outliers detected by the methods on the CSYS dataset. . . . . . . . . . . . . . . . . . 59
5.29 Cox model fitted to all CSYS data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.30 Cox model fit with 5% outlier trimming using the alternative methods. . . . . . . . . . . . . . 60
5.31 Cox model fit with 5% outlier trimming using the proposed methods. . . . . . . . . . . . . . 60
5.32 Cox model with 10% outlier trimming for alternative methods. . . . . . . . . . . . . . . . . . 61
5.33 Cox model with 10% outlier trimming for proposed methods. . . . . . . . . . . . . . . . . . . 61
5.34 Leave-one-out c-indexes for the BHT method. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.35 Leave-one-out c-indexes for the DBHT procedure. . . . . . . . . . . . . . . . . . . . . . . . 62
5.36 Leave-one-out c-indexes for the OSD procedure. . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1 TPR of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1; ⌫ = 1. . . . . . . . . 67
6.2 AUC of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1; ⌫ = 1. . . . . . . . . 68
6.3 TPR of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1; ⌫ = 1. . . . . . . . 68
6.4 AUC of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1; ⌫ = 1. . . . . . . . 69
6.5 TPR of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1; ⌫ = 1. . . . . . . . . 69
6.6 AUC of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1; ⌫ = 1. . . . . . . . . 70
6.7 TPR of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1; ⌫ = 1. . . . . . . . 70
6.8 AUC of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1; ⌫ = 1. . . . . . . . 71
6.9 TPR of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 0.5; ⌫ = 1.5. . . . . . . 71
6.10 AUC of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 0.5; ⌫ = 1.5. . . . . . . 72
6.11 TPR of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 0.5; ⌫ = 1.5. . . . . . 72
6.12 AUC of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 0.5; ⌫ = 1.5. . . . . . 73
6.13 TPR of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 0.5; ⌫ = 1.5. . . . . . . 73
6.14 AUC of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 0.5; ⌫ = 1.5. . . . . . . 74
6.15 TPR of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 0.5; ⌫ = 1.5. . . . . . 74
6.16 AUC of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 0.5; ⌫ = 1.5. . . . . . 75
6.17 TPR of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1.5; ⌫ = 0.5. . . . . . . 75
6.18 AUC of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1.5; ⌫ = 0.5. . . . . . . 76
6.19 TPR of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1.5; ⌫ = 0.5. . . . . . 76
6.20 AUC of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1.5; ⌫ = 0.5. . . . . . 77
6.21 TPR of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1.5; ⌫ = 0.5. . . . . . . 77
6.22 AUC of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1.5; ⌫ = 0.5. . . . . . . 78
6.23 TPR of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1.5; ⌫ = 0.5. . . . . . 78
6.24 AUC of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1.5; ⌫ = 0.5. . . . . . 79
x
List of Figures
2.1 Example of left, right, and interval censoring. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 The e↵ect of changing the time scale on h(t). . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Example of S(t) KM estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Example of ROC(t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Example of linear regression fit on a data set with outliers. . . . . . . . . . . . . . . . . . . . 23
4.1 Bootstrapping a test statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Bootlier multi-modal e↵ect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 BHT: histograms for inlier and outlier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 DBHT: poison and antidote bootstraps for an outlier. . . . . . . . . . . . . . . . . . . . . . . 39
4.5 DBHT: poison and antidote bootstraps for an inlier. . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 A 2-D example of a general trend �G with examples of outliers sources. . . . . . . . . . . . . 41
5.2 Baseline hazards used in the simulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Evolution of TPR with parameter B, DBHT on blue and BHT on red. . . . . . . . . . . . . . 52
5.4 Evolution of AUC with parameter B, DBHT on blue and BHT on red. . . . . . . . . . . . . . 53
xi
Chapter 1
Introduction
Survival analysis is a body of statistical methods which aim to study time to event data. Its applications range
from social sciences, industrial reliability, to clinical studies. One of its main innovations was the introduction
of the Proportional Hazards model by Cox in 1972 (Cox, 1972) also known as Cox regression. The simplicity
and semi-parametric nature of the estimation procedure led it to be one of the most used tools in survival
analysis, and the model of choice when studying the relationship between explanatory variables and survival
time of individuals. Despite its popularity, it has been shown that the model lacks robustness, having a
breakdown point of zero meaning that the estimation can be compromised by a single corrupt observation
(Kalbfleisch and Prentice, 2011). To deal with this lack of robustness, survival-specific tools can be employed,
several of these are from Cox regression diagnostics, these include: martingale residuals, deviance residuals
(Therneau et al., 1990), besides other methods that are not specific to the Cox model, such as: the likelihood
displacement statistic (David Collett, 2003) and regression DFBetas (Harrell, 2001). This thesis is concerned
with algorithms able to identify the k most likely outlier observations. In this way, those k observations
could be excluded prior to the estimation of a survival model thus potentially increasing the robustness of
the Cox fitting process, or the fitting of other survival models such as the Buckley-James regression (Buckley
and James, 1979) which is also known for its lack of robustness (Stare et al., 2000). Our approach to
survival outlier detection is model-based, meaning what will be evaluated are not the explanatory variables
nor the survival outcomes of the subjects but the relationship between them. The rationale behind this, is
that extreme values on the outcome can be considered regular or common when looking at the explanatory
variables, additionally extreme values of a given explanatory variable can be considered normal or regular
when also an extreme outcome is observed. So we trade assumptions about the distributions of covariates
and outcomes, with the assumption that the model is able to capture the relationship between covariates and
respective survival outcomes.
Contribution
We propose three new methods to perform outlier detection in survival data. The methods are implemented
in R in the form of an R package, that can be made available under request. Some of the results presented in
1
this thesis were already published in an article which was nominated for Best Paper Award at the international
conference BIOSTEC - BIOINFORMATICS 2015 in Lisbon: Joao Diogo Pinto, Alexandra M. Carvalho and
Susana Vinga, Outlier Detection in Survival Analysis based on the Concordance c-index, In Proceedings of
BIOINFORMATICS, pages 72-82, 2015
Thesis Outline
Chapter 2 starts with a literature review, including the basic concepts in survival analysis and their mathe-
matical definitions, then we explore the Cox proportional hazards model, its features and estimation process.
Additionally, an overview of alternative survival models is made. Chapter 3 concerns outlier detection and
robust estimation. Starting with a revision of general outlier detection literature, we then focus on outlier
detection in a survival context, including Cox residual analysis and methods that aim to increase the robust-
ness of Cox regression. In chapter 4 we present our three proposed methods. In chapter 5 we present the
experimental results where the di↵erent methods were tested over simulated data. Concerning real data, three
datasets from clinical studies were used.
2
Chapter 2
Survival Analysis
Survival Analysis consists on a set of statistical and machine-learning procedures addressing the study of data
where the outcome of interest is the time until a given event of interest occurs. The event of interest depends
on the application. In the medical field this event can be the death of a patient, organ rejection, relapse
or remission from disease; examples of such studies are (Crowley and Hu, 1977; Ojo et al., 2000). In social
sciences the time to rearrest of drug o↵enders can be studied using survival analysis techniques (Rocha, 2011).
Given the fact that survival analysis literature is mostly applied to the clinical and biomedical fields, the time
until the event of interest is usually named survival time. In general, when undertaking a survival analysis
task we can consider three major goals (David G. Kleinbaum, Mitchel Klein, 2005):
1. Estimate survival distributions from the data.
2. Compare survival distributions between groups of patients.
3. Evaluate the relationship between the explanatory variables and respective survival time.
The first goal is important to characterize the survival time of the overall sample, in particular obtaining
statistics as the mean and median survival times. The second aims at discovering statistical significant
di↵erences in survival time among several groups. The third one concerns the fact that in most applications
the event of interest is associated with an individual with its own particular characteristics. For instance in
a clinical study, the researchers obtain several biological measures from the patients, in order to be able to
associate biological phenomena and survival time. For convenience, we henceforward denote the survival time
by T and very often the event of interest will referred as death. This is due to the fact, that most datasets
used for analysis, simulation and testing come from clinical studies where in the majority of the cases the
event of interest is death or relapse of disease. There is a an observed survival time Ti
for each individual
i. Individuals are represented by a vector of explanatory variables, also called covariates. This vector will be
denoted by X, with X
i
being the covariate vector representing the i-th individual.
3
2.1 Censoring
It is mainly due to censoring that survival analysis is considered a field on its own, the most part of statistical
procedures and machine learning algorithms are not designed in order to incorporate it. The occurrence of
censoring is very frequent in survival data, it occurs when the exact individual’s time of failure Ti
is unknown.
When a subject is censored, we only know Ti
lies in a certain interval around the censoring time TC also
known as follow-up time. The follow-up time TC
i
is the only observed time of the i-th individual, if there is
no censoring then TC
i
= Ti
. More formally:
Ti
2 [TC
i
��
left
, TC
i
+�
right
] with �
left
,�right
� 0.
Regarding the characteristics of this uncertainty interval, we can classify censoring in three categories:
�
left
= 0 ^�
right
= 0 the event time is uncensored.
�
left
= 0 ^�
right
> 0 the event time is right censored;
�
left
> 0 ^�
right
= 0 the event time is left censored;
�
left
> 0 ^�
right
> 0 the event time is interval censored.
In figure 2.1 we have the censoring mechanism illustrated for the four types of censoring described above,the
dotted-lines represent the time interval when the event might have occurred.
t1
Event
t3 t2
Left Censoring
Right Censoring
Interval Censoring
t
Event
Event
Event Uncensored
Figure 2.1: Example of left, right, and interval censoring.
Right censoring typically occurs by one of the following reasons: 1) a person experiences the events only
after the study has ended; 2) a person is lost to follow-up during the study; 3) the subject gets removed from
the study by unrelated events. Left censoring may be caused by a subject entering a study with an unknown
disease onset time, for instance, if an examination to evaluate cancer recurrence gives a positive result, we
can only assume that the recurrence time is less or equal than the time of the examination. To illustrate
interval censoring we can resume the cancer example but this time with two examinations one at 3 months
after surgery and another at 6 months, if the patient gets a negative result in the first one and a positive
result in the second examination, we only know that the exact time of recurrence is between 3 and 6 months.
4
Right censoring is by far the most common in clinical studies, and it will the only type of censoring considered
in this work.
A typical data format to model right censored data is the addition of an event indicator variable �i
for
each individual i. When the individual experiences the event then � = 1; otherwise it means the individual is
right censored and it will have � = 0. In Table 2.1 we have data codified in this format. For individuals A and
C we know the experienced the event occurred respectively after 44.5 and 10 months, for individual B, as its
survival time was censored we only know it survived at least for 30 months. Regarding right censoring another
Table 2.1: Example of data with censored observations (�i
= 0).
Individual TC
i
(months) �
A 44.5 0
B 30 1
C 10 0
kind of categorization can be done that concerns the design of the study and the dependence between the
existence of censoring and individuals survival time:
Type I censoring :, a survival time Ti
is observed if is no larger than a pre-specified censoring time Tstudy
,
otherwise we just know the event happened after Tstudy
.
Type II censoring : this kind of censoring occurs when the study is ceased after a pre-specified number
of events is registered;
Random censoring : also called noninformative censoring, the survival time T and censoring � are
random variables independent from each other.
2.2 Continuous-time Survival Function S(t)
Survival time is typically treated as a non-negative real random variable T . Instead of characterizing T by
its distribution F (t) = Pr(T < t), in survival analysis the focus is on the complementary of F (t), given
by S(t) = Pr(T � t) = 1 � F (t). S(t) is called the Survival Function (David Collett, 2003) also know as
Survivor Function (Lawless, 2003). As the name indicate, S(t) gives the probability of an individual living
longer than a given time t. We may write:
S(t) = Pr(T > t) =
Z 1
t
f(x)dx = 1� F (t). (2.1)
Where f(t) is the p.d.f function of T . S(t) has some properties worth notice:
S(t) : R+0 ! [0, 1].
S(t) is a monotone decreasing continuous function.
S(0) = 1 and S(+1) = 0.
5
2.3 Continuous-time Hazard Function h(t)
Another fundamental concept in survival analysis is the hazard function h(t), this function also called hazard
rate, force of mortality, conditional failure rate or even instantaneous death rate (David G. Kleinbaum, Mitchel
Klein, 2005; David Collett, 2003; Kalbfleisch and Prentice, 2011). With a di↵erent perspective from S(t)
which represents the probability of not failing until a given time t, h(t) represents the failure rate at a given
time t given the person has survived until t, so the higher the values of h(t) the shorter will be the survival
time. It also expresses a rate rather than a probability like S(t), so its values are non-negative but can exceed
unity. The meaning of this failure rate can be thought as the number of events occurring at a given time
instant t given individuals surviving up to time t. Formally h(t) is defined by:
h(t) = lim
�t!0
Pr(t T < t+�t|T � t)
�t(2.2)
= lim
�t!0
Pr(t T < t+�t, T � t)
Pr(T � t)�t
= lim
�t!0
Pr(t T < t+�t)
Pr(T � t)�t
= lim
�t!0
Rt+�t
t
f(x)dx
Pr(T � t)�t
= lim
�t!0
�tf(t)
�tS(t)! h(t) =
f(t)
S(t)(2.3)
The meaning of h(t) is not so intuitive as S(t), (David G. Kleinbaum, Mitchel Klein, 2005) provided the
following conceptual interpretation: the hazard function h(t) gives the instantaneous potential per unit time
for the event to occur, given that the individual has survived up to time t.
In (Lawless, 2003) the concept instantaneous death rate is used, and the author notes that h(t)�t gives
the approximate probability for the event to occur in the interval [t, t+�t), given the individual has survived
until t.
This function is particularly useful, since it describes the way, how the failure rate of the population evolves
along time. In many applications there may be clinical and statistical information that can be used to define
the hazard function, which in turn can help in selecting a lifetime distribution model. For example, in some
situations there are reasons to only consider an increasing monotone hazard function, that translates an “wear
out” e↵ect, for example, a mechanical part is subject to an aging phenomenon that degrades the part as time
passes. Some properties of h(t) include:
Continuous function that maps h(t) : R+0 ! R+
0 .
Is generally not monotone.
The scale of h(t) depends on the time unit used.
An important relation between h(t) and S(t) is the ability of obtaining S(t) from a product of probabilities of
the form P (T 2 [t, t+ dt]|T > t) = h(t)dt , such probabilities vary infinitesimally so integrating this product
6
we can write S(t) as the product of each probability of not dying until time t (Cox, 1972) :
S(t) = P (T � t) (2.4)
=
tY
0
(1� h(t)dt) (2.5)
= lim
Y
⌧k<t
[1� h(⌧k
)(⌧k+1 � ⌧
k
)](⌧k+1�⌧k)!0 (2.6)
= exp
�Z
t
0h(u)du
�. (2.7)
From this we get:
h(t) = �d logS(t)
dt
In the exponential in 2.7, the function being integrated, is known as Cumulative Hazard function, and it
is given by:
H(t) =
Zt
0h(x)dx (2.8)
As we can conclude from the definitions so far, the functions f(t), F (t), S(t), H(t) and h(t) contain the exact
same information about the distribution of the random variable T .
2.4 Discrete-time Survival Function
Sometimes when the individuals lifetimes are grouped, quantized, or the available data is not appropriate for
a continuous interpretation of time, T becomes a r.v with values on a finite set t1, t2, ..., tn. The survivor
function is defined very intuitively as the sum of all probabilities for failing in all times prior to t:
S(t) = Pr(T � t) (2.9)
=
X
j:tjt
p(tj
) (2.10)
Where p(tj
) = Pr(T = tj
). The above calculation is not able to incorporate censoring in each of the
probabilities p(tj
). In the next section we introduce the Kaplan-Meier estimator that solves precisely this
problem.
2.5 Discrete-time Hazard Function
The hazard function is easily defined as a failure rate (Klein and Moeschberger, 2003):
h(ti
) = Pr(T = ti
|T � ti
) =
p(ti
)
S(ti�1)
(2.11)
It is worth noticing that if we change the time scale from days to months for examples the numeric value
for �ti
gets smaller and the probability on the numerator gets larger (as it includes more events in the same
interval). This e↵ect can be seen in Figure 2.2, where h(t) is estimated for time measured in months and
with a time scale given in days.
7
0 2 4 6
0.0
0.5
1.0
1.5
2.0
[Time]=years
Time
Hazard
0 500 1500 2500
0.000
0.002
0.004
0.006
[Time]=days
Time
Hazard
Figure 2.2: The e↵ect of changing the time scale on h(t).
2.6 Kaplan-Meier Estimator
The Kaplan-Meier (KM) estimator (Kaplan and Meier, 1958) is one of the most used tools in survival analysis,
it provides a non-parametric estimate of S(t), able to incorporate right censoring. Considering m events in
a population of n individuals, with n � m of them being right censored. Denoting by nj
the number of
individuals at risk at time tj
, meaning the inidividuals still alive (not experienced the event) at time tj
, and
the number of deaths (occurred events) at time t by dj
. In a discrete setting and assuming independence
between the events and censoring times (Random Censoring), we can interpret the KM estimator as a product
of the probabilities of surviving each time tj
having survived until time tj�1 :
bS(t) = cPr(T � t)
=
Y
tj<t
(1� cPr(T = tj
|T > tj�1))
=
Y
tj<t
(1� dj
nj
)
In figure 2.3 we have an example of the KM estimation of S(t) for a survival dataset. The dashed lines
represent the 95% confidence interval which can be estimated as in (Klein and Moeschberger, 2003).
8
0 500 1000 1500 2000 2500
0.0
0.2
0.4
0.6
0.8
1.0
S(t) for WHAS100
t [days]
P(T>
t)
Figure 2.3: Example of S(t) KM estimation.
This estimation of the survival curve S(t) is completely uninformative in what concerns assessing relation-
ships with available explanatory variables, but when applied to di↵erent strata of the original data, is very
useful to assess relevant di↵erences between survival profiles. For example: in a clinical study calculating
one Kaplan-Meier curve for Male and Female patients can be very insightful in understanding the association
between gender and survival time, as we will see later, this di↵erence between survival curves is tested recur-
ring to statistical tests like the Log-Rank test. The Nelson-Aalen estimator (Aalen et al., 2008) is another
non-parametric estimator, this time for the Cumulative Hazard Function, it is equivalent to the KM estimator
for the calculation of the survival function by using the relationship between the two: S(ti
) = exp [�H(ti
)].
The Nelson-Aalen estimates the cumulative hazard function as the sum of the failure rates until time t:
H(ti
) =
i�1X
j=0
dj
nj
(2.12)
With dj
again denoting the number of deaths (events) that occurred at time tj
and nj
the total number of
individuals at risk at time tj
in other words, subjects that did not fail until tj�1.
2.7 Log-Rank Test
Although we can estimate survival curves for several groups it is useful to assess if the estimated curves
present statistically significant di↵erences. One of the most used methods is the Log-Rank test and it will be
presented in a similar way of (David G. Kleinbaum, Mitchel Klein, 2005). This testing procedure, assesses if
the individuals of the two (or more) groups come (or not) from the same population concerning their survival
time. The log-rank test is a large-sample chi-square that makes use of a test statistic that captures the
di↵erence in ters of survival between groups that is able to incorporate censoring. The log-rank statistic tries
9
to reflect the di↵erence in observed and expected number of failure events registered for each group at each
time step. The time steps are defined as the ordered failure times. First the following quantities are defined
(David G. Kleinbaum, Mitchel Klein, 2005):
mi,tj : number of deaths in group i at time t
j
.
ni,tj : number individuals at risk (still alive) in group i at time t
j
.
ei,tj : number of expected events in group i at time t
j
Oi
: total of observed events for group i
Ei
: total of expected events for group i
The expected number of events is calculated assuming that the two groups are equally prone to have individuals
experiencing events. So for each group, the expected number of events is proportional to the number of
individuals in the group:
ei,tj =
ni,tj
n1,tj + n2,tj
(m1,tj +m2,tj )
The total number of observed and expected events are sum over all failure times:
Oi
=
X
j
mi,tj
Ei
=
X
j
ei,tj
The Log-Rank statistic (LR) translates the di↵erence in expected versus observed events for each group. In
the two-group case this statistic will be equal for both groups, so calculating for group 1 we have:
LR =
(O1 � E1)2
V ar(O1 � E1)(2.13)
(2.14)
Making use of this statistic, the following hypothesis test is performed:
H0 : no di↵erence in survival between groups (2.15)
LR ⇠ �2 with 1 d.f. under H0 (2.16)
For the general case with G groups , the LR statistic is given by:
LR =
GX
i=1
(Oi
� Ei
)
2
V ar(Oi
� Ei
)
(2.17)
In this case, under the null hypothesis the LR statistic follows a �2 distribution with G�1 degrees of freedom.
10
2.8 Cox proportional hazards
When assessing the relationship between a vector of explanatory variables X = X1, X2, ..., Xp
and observed
survival times, regression analysis is usually very insightful and is supported by strong mathematical tools.
Unfortunately, typical regression techniques such as multivariate least squares, logistic regression and others
are not able to deal with survival data due to the existence of censoring. In order to fill this void, in 1972
Sir David Cox introduced the acclaimed and widely used Cox proportional hazards model, (Cox, 1972), also
known as the Cox regression. This kind of regression given its semi-parametric nature, made it the most
used regression model in survival analysis.The Cox regression relies on the proportional hazards assumption:
it is assumed that for every pair of individuals i, j their hazard functions are proportional along time, this has
the elegant consequence that every hazard function hi
(t), can be written in relation to an abstract baseline
function h0(t), so the hazard function for individual i is the result of a product of two factors:
hi
(t) = Ki
h0(t)
Cox modeled the factor Ki
as a function of the covariates of individual i. A natural choice to model this
factor would be a linear combination of covariates, but as the hazard function cannot take negative values,
the linear combination is exponentiated, so the values are always positive (David G. Kleinbaum, Mitchel Klein,
2005).
h(t|xi
) = e�xih0(t)
The factor multiplying by the baseline hazard is often called relative risk or hazard ratio, given that if known,
we only can compare the risks among individuals, we cannot infer any absolute hazard value, because we do
not know h0(t). In applications the relative risks are quite useful to compare survival profiles among subjects,
and to assess the impact that each variable has in the subjects survival time.
Partial Likelihood and estimation of �
Cox approach to the estimation of � = (�1, ...,�p
) resorts to a maximum likelihood approach, in particular
aconditional partial likelihood. Partial because the baseline hazard funciton h0(t) does not need to be defined
in order to calculate �. Conditional, because the likelihood function is constructed only considering the set
of instants {tj
} at which events have occurred. Being Ti
the random variable representing the survival time
of individual i, and R(t
j
) the set individuals at risk at time tj
. Cox defined the conditional probability of
Ti
= tj
(restricting the event time to tj
) as the fraction of total hazard o individual i at time tj
(Cox, 1972):
P (Ti
= tj
|event at tj
) =
hi
(tj
)Pl2R(tj)
hl
(tj
)
11
P (Ti
= tj
|death at tj
) is the probability that the individual i fails at time tj
, knowing that an individual has
died at time tj
. Having individuals j experiencing the event at times tj
, the conditional likelihood is given by:
L(�) = P (T1 = t1, Tj
= tj
, .., Tk
= tk
|events at t1, tj , ..., tk)
=
kY
j=1
hj
(tj
)Pl2R(tj)
hl
(tj
)
=
kY
j=1
exj�h0(t)Pl2R(tj)
exl�h0(t)
=
kY
j=1
exj�h0(t)
h0(t)P
l2R(tj)
exl�
=
kY
j=1
exj�
Pl2R(tj)
exl�
Following the proportional hazards assumption the term h0(t) cancels out and thus is irrelevant to this max-
imum likelihood estimation. Moreover is to note that the information from censored subjects is incorporated
in the denominator: for all individuals in the risk set R(t
j
) their hazards are summed. The Cox model
log-likelihood function becomes:
l(�) =kX
i=1
x
i
� �kX
i=1
log
2
4X
l2R(ti)
exl�
3
5 (2.18)
Since the baseline hazard canceled out in the calculations, this is one of the main reasons for this model
popularity: in order to calculate the regression parameters � no assumption has to be made about the
baseline hazard function, as long as the proportional hazards assumptions holds. The estimation of ˆ� is given
by maximizing 2.18 in relation to �:
ˆ� = argmax l(�)
In order to find maximize 2.18 we di↵erentiate the log-likelihood in order to each coe�cient �r
, resulting in :
Ur
(�) =@l(�)
@�r
=
kX
i=1
[x
r,i
�Ar,i
(�)]
where Ar,i
(�) =
Pl2R(tj)
x
r,i
exl�
Pl2R(tj)
exl�
Ur
(�) is known as the score function. The entries for the Hessian matrix are given by:
⌥
r,⇠
(�) =@2l(�)
@�r
@�⇠
The derivative in conjunction with @
2l(�)
@�r@�⇠can be used with the Newton-Raphson algorithm to calculate the
the maximum likelihood estimate of � (Cox, 1972).
12
Wald Test
The Wald is used to assess the significance of being di↵erent from zero for each coe�cient �r
. The Wald
statistic is given by (Harrell, 2001):
W (�r
) =
�2r
�2�r
The covariance matrix for the parameter vector � can be approximated by the Hessian matrix in 2.19, the
standard error ��r for each coe�cient is given by the squared root of the corresponding diagonal element.
Under the null hypothesis of the coe�cient being equal to zero, this statistic follows a �2 dsitribution with
one degree of freedom. More formally the hypothesis test made is the following:
H0 : �r
= 0
with H0 true :W (�r
) ⇠ �2with 1 degree of freedom
This test is one of the most used when assessing the significance of a coe�cient, for example it is used
to calculate the p-values of the Cox’s coe�cients in the R routine “coxph” of the survival analysis package
“survival” (Therneau, 2014).
Testing the Proportional Hazards Assumption
As the name indicates, the key assumption in the Cox’s model is the proportionality of hazards along time,
given by the relation:
h0(t)eXi(t)�
h0(t)eXj(t)�=
eXi(t)�
eXj(t)�, (2.19)
note that even for time changing covariates, although the relative hazard is not independent of time, the rela-
tive impact of any two values for a covariate is always determined by �. There a variety of di↵erent approaches
when assessing the proportionality of hazards assumption, the simplest one and perhaps more intuitive, is to
plot the survival curves estimated for example using the Kaplan-Meier estimator, if the assumption holds, the
curves plotted in log-log scale should be approximately parallel, ( (Therneau and Grambsch, 2000) provides a
simple explanation to this result). In the Cox model, with time-fixed covariates only, the cumulative hazard
function is given by:
H(t) =
tZ
0
h0(u)eXi�du = (2.20)
= eXi�
tZ
0
h0(u) = (2.21)
= eXi�H0(t) (2.22)
Thu we have for the survival function:
Si
(t) = � log[eXi�H0(t)] () (2.23)
Si
(t) = log[H0(t)]�X
i
� (2.24)
13
An alternative test based on the work by (Grambsch and Therneau, 1994), extends the Cox model to include
time varying coe�cients, �(t) is calculated based on Schoenfeld residuals, the more �(t) is constant along
the time, the more reasonable is the proportionality of hazards assumption.
Ridge and LASSO
The partial log-likelihood can adapted in order to restrict the structure of �. For example, one common
feature is to promotesparsity since in many settings the number of covariates is large. To assure convergence
and significance for the coe�cients we want the model to force some of the coe�cients to zero. This can be
done by introducing a penalty term as in ridge regression (Goeman, 2010) :
lridge
(�) = lCox
(�)� �||�||2, (2.25)
the parameter � is a tuning parameter usually called the regularization term. It has been shown that the ridge
penalty term introduced, does not necessarily lead to sparsity [CITAR]. An alternative approach to penalization
is the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996), the penalized Cox partial
likelihood is given by:
llasso
(�) = lCox
(�)� �pX
r=0
|�r
|, (2.26)
s These methods are very useful when performing feature selection. After fitting the model, the covariates
that have coe�cients close to zero are the favorites to be discarded. Both methods are available in the R
package ”penalized” (Goeman, 2012).
Competing Risks and Time-varying covariates
In clinical studies it is fairly common to have individuals that are being subject to multiple risks of failure.
For example, in the Stanford heart transplant study, reported by (Crowley and Hu, 1977), the failures are
classified between death from organ rejection or death given other complications related to the procedure.
Several approaches to analyze this kind of data have been studied, for example (Kalbfleisch and Prentice,
2011) treats this case by fitting two separate Cox models, one for each kind of failure treating other failure
types as censored observations. Another approach adopted by various authors including (Larson and Dinse,
1985) involves fitting more complex models incorporating the di↵erent failure types. Here we will present one
of the most simple methods to incorporate competing risks, when fitting a survival model, the (Lunn and
McNeil, 1995), their approach allows the analysis of competing risks survival data directly using the standard
Cox proportional hazards model. Their strategy relies on reformulating the data according to each individuals
cause of death. Considering an example scenario of two competing risks I and II given by ⇢ = 0, ⇢ = 1, each
subject appears twice (two competing risks), if the subject dies from risk ⇢ = 0 it appears censored in the entry
corresponding to risk ⇢ = 1. The covariates vector is duplicated in order to take into account interactions of
the covariates with the type of failure . In table 2.2 we have the general case for two groups with subjects
having covariate vector x: As shown, an individual has always two entries in the data, one accounting for
each kind of failure. The basis for these procedure relies on the assumption that the overall hazard function
14
Table 2.2: Example of the subject replication.
Individual Ti
�i
Type of failure (⇢) Augmented covariates
i ti
1 ⇢i
{⇢i
x,x}
ii
0 1� ⇢i
{(1� ⇢i
)x,x}
is the sum of the independent hazard functions from each competing risk , thus using the counting process
framework we have for the cumulative hazard function:
H(t) =nX
i=0
Hi
(t) (2.27)
=
nX
i=0
Hi, I(t) +nX
i=0
Hi, II(t) (2.28)
The existence of time-dependent covariates is very common in any clinical data, the most common type are
repeated measurements on a subject along time, or a change in the patient’s therapy. To incoroporate such
time-series, the data is usually written in counting process format as described in(Therneau and Grambsch,
2000). So for example having a time variable covariate X(t) we can describe its behaviour by adding a
new individual at the time hte variable changed its value, and censoring the individual corresponding to the
previous value, as can be seen in Table 2.3 This extension does not need any modifications in the standard
Table 2.3: Example of coding time-dependent covariates.
id Interval � X(t)
A (0,102] 1 0
B (0,21] 0 0
B (21,200] 1 1
Cox model, this data format allows for the computer routine to pick the right values of each covariate at each
time point where the components of the partial likelihood is being calculated, there is no subject replication,
at each time point, one subject enters in the partial likelihood calculation only once.
2.9 Other Survival Models
Buckley-James regression
A least squares approach to survival data was first introduced by (Miller, 1976). Buckley and James (Buckley
and James, 1979) presented a type of linear regression that was able to incorporate censored data that has
been shown to have good statistical properties(Miller and Halpern, 1982; Heller and Simono↵, 1990). The
Buckley-James model assumes that the survival time T is linearly related to the covariate vector X:
Ti
= �0 + �Xi
+ ✏i
;
where the values of ✏i
are i.i.d. having E[✏i
] = 0, V ar(✏i
) = �2, independently distributed from X
i
. Since for
right censored observations we only observe Ci
the usual least squares regression approach is not applicable.
15
Buckley and James defined their regression on a new response variable T ⇤i
:
T ⇤i
= Ti
�i
+ E[Ti
|T > Ci
](1� �i
)
recalling that � = 1 when the event is registered and � = 0 when the observation is censored. This expression
replaces censored survival times Ci
with E[Ti
|T > Ci
], their approach to the calculation of this conditional
expectation was (Buckley and James, 1979) :
E[Ti
|Ti
> Ci
] = E [�0 + �Xi
+ E [✏i
|�0 + �Xi
+ ✏i
> Ci
]]
= �0 + �Xi
+ E[✏i
|✏i
> Ci
� �0 + �Xi
]
= �0 + �Xi
+
1Z
Ci��0+�Xi
✏
Ci
� �0 + �Xi
dF
where F is the distribution of ✏ obtaiend through the Kaplan-Meier estimator. Since this estimate depends on
� itself, the estimation process is done by iterations. A few reasons can be enumerated on why one should
use the Buckley James regression over the Cox regression (Stare et al., 2000):
The assumption of proportionality of hazards may not hold.
When predicting survival times, the Cox model can only be done by estimating a baseline hazard h0(t)
which is not part of the Cox model estimation process.
The results of a Cox model are less intuitive in than results from a linear fit.
Survival trees by Goodness of Split
A tree-based approach to survival data was introduced by (Leblanc and Crowley, 1993), the method recursively
makes partitions of the data. The partitioining of the data is made until it has been split into many regions,
each only containing a few observations. The splitting rule used is the maximization of a standardized two-
sample logrank test statistic, which is called goodness of split G. Each partitioning is made under the CART
framework (Breiman et al., 1984) and the authors further restrict to splits on a single covariate Xj
, these can
be described by the following rules:
1. Each split depends on the value of one predictor Xj
.
2. if Xj
is an ordered variable, then partition is done founding a split point c such that one group with
Xj
< c and the other Xj
� c.
3. if Xj
is nominal with values in B = {b1, ..., br}, the partition is made on non-empty disjoint subsets of
B.
2.10 Performance Metrics for Survival Models
2.10.1 Somers’ D
Somers’D (Somers, 1962) is an asymmetric measure of association between two variables. Given a predictor
variable Z and an outcome variable T we may estimate DTZ
as a performance of using Z as a predictor of
16
T . For example, given X as the hazard function given by a Cox model fit on a dataset, we can check the
model ability of the estimated hazard function on predicting the survival times of the subjects. The definition
of Somers’ D is usually expressed in terms of Kendall’s ⌧ZT
(Kendall and Gibbons, 1990) :
⌧ZT
= E [sign(Zi
� Zj
)sign(Ti
� Tj
)] ;
sign(a) =
8>>>><
>>>>:
�1, if a < 0
1, if a > 0
0, if a = 0
where Xi
is the predictor variable for individual i, and Ti
the outcome of individual i. Expression 2.29
is unable to incorporate censored values of Zi
or Zj
, because there is certainty in which of the values is
greater. In order to incorporate censoring replaced the factors sign(Zi
�Zj
) with a censored signed di↵erence
censored sign di↵erence. The version given by (Newson, 2006) considers the general case with both right and
left censoring. Here we give the particular case for right-censored survival data codified with standard event
indicators � 2 {0, 1}:
csign(ai
, �i
, aj
, �j
) =
8>>>><
>>>>:
1, if ai
> aj
and �j
= 1
�1, if ai
< aj
and �i
= 1;
0 otherwise;
where the values of �i
, �j
are the event indicators (0 for censored) of ai
and aj
respectively. This new operator
has an intuitive interpretation: using an analogy with survival times and event indicators, if ai
is apparently
longer than aj
we can only be sure it is in indeed larger if aj
is not censored, otherwise there is uncertainty
and as random censoring is admitted the expected di↵erence between the two is zero. For the second case,
if ai
is apparently shorter than aj
,certainty can only be assured when ai
is not censored.
With this new operator a new quantity ⌧ censZT
analogous to ⌧ZT
can be defined:
⌧ censZT
= E[csign(Zi
, RZ
i
, Zj
, RZ
j
)csign(Ti
, �i
, Tj
, �j
)]
As in survival studies the only source of censoring is usually in the T variable, the formula above can be
reduced to:
⌧ censZT
= E[sign(Z1 � Z2)csign(Ti
, �i
, Tj
, �j
)]
The Somers’D measure (Somers, 1962) is calculated by:
DTZ
=
⌧ censZT
⌧ZZ
2.10.2 Harrell’s concordance c index
The Harrell’s concordance c index (Harrell et al., 1982) can be calculated directly from Somers’D:
c =D
TZ
2
+ 0.5
However this index can be be interpreted in a more intuitive way: it measures the total of concordant pairs
among all possible pairs of individuals. Its calculation can be done following the following steps:
17
1. Form all possible pairs over the data
2. Omit the pairs whose shorter survival time is censored. Omit pairs i and j if Ti
= Tj
unless on of them
is dead. These are the permissible pairs.
3. For each permissible pair where Ti
6= Tj
, count 1 if the shorter survival time has higher predicted risk.
Count 0.5 if the predicted outcomes are tied. For Ti
= Tj
count 1 if the predicted risks are tied. For
each permissible pair where Ti
= Tj
and only one censored, count 1 if the uncensored one has a higher
predicted risk.
4. In the cases not specified count 0.5
5. The C-index, is given by
C =
Concordance
Permissible Pairs Count
2.10.3 Time-dependent ROC
The typical Receiver Operating Curve (ROC) is a common measure of predictive ability when dealing with a
binary classification problem. In general the model outputs a continuous value on which upon the use of some
threshold one of the outcomes is chosen. Given the model output Y a typical classification setting would be:
Outcome =
8><
>:
1 if Y � Ythr
0 if Y < Ythr
ROC is particularly useful on assessing the discrimination ability of the model, because it does not depend
on any particular threshold value. To calculate the ROC curve, the classification is done for every threshold
value possible and the values for the true positive rate (sensitivity) and false positive rate (1 - specificity) are
computed and displayed on an axis. Usually the sensitivity is placed on the horizontal axis and the false positive
rate on the vertical axis. Generally a survival problem does not fit in the binary setting specified above, and
even if we define a priori two classes depending on a chosen threshold, survival data also presents censoring.
To extend the the use of ROC to surival data (Heagerty et al., 2000) developed the Time-dependent ROC. In
their work, they take as binary outcome a modified event indicator D(t). This event indicator takes D(t) = 0
if the event did not happen until time t and takes D(t) = 1 otherwise. Given the fact that the outcome
binary variable values depend at which time they are being evaluated, there will be a ROC curve for every
time instant ti
. Each curve ROC(ti
) is obtained by computing the sensitivity and specificity for each possible
value of the threshold Ythr
. Assuming that higher values for the model prediction Y indicate longer survival,
sensitivities and specifities are given by:
sensitivity(Ythr
, ti
) = P (Y > Ythr
|D(t) = 1)
specificity(Ythr
, ti
) = P (Y Ythr
|D(t) = 0)
In figure 2.4 we have the ROC(t) to measure the predictive ability and discrimination of a Cox regression
fit on the WHAS100 dataset. The curves were calculated using the R package ”survivalROC” (Heagerty and
packaging by Paramita Saha-Chaudhuri, 2013) :
18
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ROC(1 year) AUC=0.775000
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ROC(2 year) AUC=0.808000
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ROC(3 year) AUC=0.740349
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ROC(4 year) AUC=0.708767
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ROC(5 year) AUC=0.706897
FPR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ROC(6 year) AUC=0.755103
FPR
TPR
Figure 2.4: Example of ROC(t).
2.11 Counting Processes Framework
Introduced by Aalen in (1975) and (1978), the counting process approach has been a source of many important
developments and continues to be one of the most studied fields in state-of-the-art survival analysis, issues such
as left censoring, right censoring, left truncation and time-dependent covariates can be elegantly incorporated
in this framework. In this text we will skip the heavy mathematical theory regarding stochastic integrals,
martingales and measeure theory.In (Aalen et al., 2008) he authors give a very intuitive explanation on the
counting process modeling of survival data, that will serve as basis for the exposition made in the next section.
This framework is based on the theory of stochastic processes and stochastic integrals, in particular Martingale
theory, first we define the event counting process, N(t), it counts the number of events that occurred until
time t, this can be translated using the indicator I function:
N(t) = I(T t, �i
= 1); (2.29)
Then we define the at-risk process, a decreasing process that gives the number of subjects in the risk set at
time t:
Y (t) = I(T � t); (2.30)
The expected number of events is related with the survival curves for every individual. To calculate the process
of expected number of individuals that fail until tj
. At each time instant we can consider an individual’s event
as a Bernoulli random variable with probability p equal for all subjects, the total number of events becomes
a binomial random variable with expected value given by np. The counting process states that n translated
by a process Y (t) and each probability p can be given by the population hazard function h(t). Since these
19
values change over time, the expected value is given by the integrating their product:
⇤(t) =
tZ
0
Y (u)h(u)du. (2.31)
The di↵erence between these two quantities is calld the counting process Martingale:
M(t) = N(t)� ⇤(t) (2.32)
M(t) possesses the Martingale property (Mikosch, 1998) In substance it means that the process is driftless,
meaning its expected value for a given time in the future is equal to the current value of the process.
20
Chapter 3
Survival outlier detection and robust
estimation
The outlier detection and robust estimation fields intersect each other very often. Outlier detection con-
cerns the identification of outlying observations in a dataset. Robust estimation or inference, addresses the
consequences of having outliers in the data when applying analysis methods, such as fitting regressions or
computing statistics. There are several views and definitions on what is considered an outlier. Definitions
vary greatly with applications. For example (Hawkins, 1980) defines an outlier as an observation that deviates
so much from other observations as to arouse suspicion that it was generated by a di↵erent mechanism than
the remaining data, (Johnson et al., 1992) defines an outlier saying it is an observation in a data set which
appears to be incosistent with the remainder of that set of data, which alludes to a more parametric view. In
econometrics, (Je↵rey Wooldridge) gives an informal definition, regarding outliers present in a OLS regression
context, loosely speaking, an observation is an influential observation if dropping it from the analysis changes
key OLS estimates by a practical large amount. Informal definitions in the survival field are also prolific,
(Nardi and Schemper, 1999) define outlying observations as individuals whose survival time is too short, or
too long with respect to the values of its covariates. In a regression context, the term outlier is very often
interchanged with overly influential, this term refers to the influence that one particular observation has on
the model estimated parameters, for example the slope parameter of a linear regression.
Taxonomy of Outlier Detection Methods
Outlier detection methods can be divided into several classes. In relation to data dimension, we have a clear
division between univariate methods and multivariate methods. Another important categorization is between
parametric methods and non-parametric methods. Parametric methods are based on the assumption that
the data has an underlying known distribution or even that the data is based on a distribution with unknown
parameters. These methods flag as outliers, the observations that deviate from the a priori model. In the
class of non-parametric methods we only have model-free techniques, in particular distance-based methods
and clustering techniques. Regarding the method’s output: hard classifiers, label each observation as outlier
21
or not; the denominated soft classifiers output an outlying score for each observation. Another important
distinction (Davies and Gather, 1993) is between single-step and sequential procedures. Single-step procedures
do not rely on inward or outward removal of observations, while sequential procedures rely on eliminating the
most outlying observations at each step (outward) or including the least outlying at each step (inward).
Swamping and masking e↵ects.
When trying to identify outliers, one must rely on a quantity to use as measure of outlyingness (for example,
residuals) to assess if a given observation is in fact an outlier or not. When performing such analysis, swamping
and masking phenomenons are very likely to occur. Here we provide their definition given by (Ben-Gal, 2005):
Swamping E↵ect One outlier observation swamps a second observation if the latter can be considered an
outlier in presence of the first but not by itself.
Masking E↵ect One outlier masks another outlier if the second outlier can be considered an outlier only by
itself but not in the presence of the first outlier.
When performing model-based outlier detection, as it will be our case, swamping is prone to generate false-
positives. These occur since the model fit was biased by the presence of the true outliers, and consequently
inlying observations can appear as outliers in relation to the fitted model. Masking will potentially generate
false negatives. True outliers are hidden or “masked” again because of the bias introduced in the model by
the presence of true outliers.
Swamping and masking: a practical example
Consider the artificial data depicted in Figure 3.1. It is composed by 8 inliers that represent a strict linear
trend and two outlying observations (9 and 10) that can be considered outliers. One possible measure of
outlyingness is the regression residual of each observation to the fitted line. In Table 3.1 we have the list of
observations sorted by their linear regression residual. We verify that instead of having the two outliers in the
top two positions, we have observation 8 and 10, and only in the fourth position we have observation 9. In
this example we have a very clear example of swamping and masking e↵ects. Observation 8 is being swamped
by the two outliers, and observation 9 is being masked by observation 10.
Cox Regression Residual Analysis
Similar with typical regression analysis, one of the most tradtitional methods to detect overly influential
observations is to compute the leverage the observation has on the regression line (Rousseeuw and Leroy,
2005). The leverage is a value proportional to (X � X)
2, this method flaw is that it does not take into
account the outcome of the individual, distribution based methods such as Box-plot also su↵er from the
same limitation. In survival analysis it is very typical a model-based approach when assessing for outliers.
In this class, the use of residuals specific of the Cox proportional hazards model are by far the most used
in practice. In this section we review some of these residuals: Cox-Snell residuals, Martingale residuals and
22
1
2
3 4
5
6
9
10
7
8
Figure 3.1: Example of linear regression fit on a data set with outliers.
Table 3.1: Linear regression example: observations sorted by residuals.
Observation Residual
8 25.89
10 24.12
7 10.08
9 8.48
1 1.66
6 1.59
2 0.42
5 0.39
4 0.14
3 0.00
Deviance residuals. Further we review some outlier detection methods typically used on a regression context:
the likelihood displacement statistic and DFBETAS.
Cox-Snell Residuals
The Cox-Snell residuals were the first kind of residuals to be defined for the Cox regression (Cox and Snell,
1968), they are still widely when assessing for outlying observations. The Cox-Snell residual for individual i
is given by:
rCi =
ˆHi
(Ti
) = e�xiH0(Ti
); (3.1)
23
the baseline hazard is usually obtained using the Nelson-Aalen estimator (David Collett, 2003). Important to
notice is that as H(t) is a monotonically increasing function, there is a certain bias for censored observations
to get lower residual values, to correct this, the modified Cox-Snell residuals were introduced as:
rCi =
8><
>:
rCi for observed event times
rCi + 0.693 for censored event times
(3.2)
Martingale residuals
As seen in section 2.11 the martingale counting process is a residual-like quantity that expresses the di↵erence
between observed and expected number of events, if we assign for every individual its own martingale process
we have:
Mi
(t) = Ni
(t)� ⇤
i
(t) (3.3)
The martingale residual is defined as the value of process M(t) at the time of failure/censoring (follow-up
time), as N(t) takes 1 if the event is observed and zero when censored, they are given simply by (Therneau
et al., 1990) :
rMi = �
i
� ⇤
i
(TC
i
) (3.4)
where �i
is the event indicator for individual i, and ⇤
i
(TC
i
), the value of the cumulative hazard function at
the follow-up time of individual i. The value of �i
is present in the data, ⇤i
(TC
i
) can be calculated from the
estimated Cox model with the choice of an appropriate baseline hazard function h0(t).
Deviance Residuals
The skewness of Martingale residuals make its plots is di�cult to interpret. In 1990 (Therneau et al., 1990)
introduced the deviance residuals in order to have them more distributed around zero, they are given by:
rDi = sgn(r
Mi)[�2{rMi + �
i
log(�i
� rMi)}]
12 (3.5)
The likelihood displacement statistic
Let ˆ� be the value of � that maximizes the partial Cox likelihood and ˆ�(�i) the estimate when observation i
is eliminated from the fitting. The likelihood displacement (David Collett, 2003) statistic (LD) is given by:
LDi
= 2logL( ˆ�)� 2logL( ˆ�(�i)) (3.6)
Under the null hypothesis �(�i) = � the LD statistic follows a chi-square distribution with one degree of
freedom. Therefore we calculate the p-value for this test for all observations, the more significant ones are
considered the most outlying observations.
24
DFBETAS
One common criteria used for assessing the influence of one particular observation is to measure its impact
on the estimated parameter vector �. DFBETAS measure the change on the estimation values upon deleting
each observation in turn, scaled by their standard errors (Harrell, 2001). More formally the j-th component
of DFBETAS for a given observation i is given by:
DFBETAj
=
⇣b�j
� b�j
�i
⌘/�
�
. (3.7)
The standard deviation ��
can be provided by the quared root of the each diagonal component of the partial
likelihood Hessian matrix 2.19. Since DFBETAS are a vector valued quantity, analyzing the components
associated with each covariate allows to study in which components does the observation shows an outlying
behavior.
Finite Sample Breakdown Point
The concept of breakdown point was introduced by Hampel in 1968, providing an asymptotic definition. In
this work we will review it in a small sample context (Donoho and Huber, 1983). Qualitatively the breakdown
point corresponds to the smallest fraction of data that may cause an estimator to take on arbitrary values
(Huber, 2011). For instance, when estimating the mean of a sample by the sample mean estimator, one single
corrupt observation is able to o↵set the estimator by an arbitrarily large value, in this case the breakdown is
said to be 1/N where N is the dataset size. On the other hand, to o↵set the median estimator by an arbitrary
value, one necessarily needs half the observations to be corrupt, for this case the breakdown point is 1/2.
Robust methods for the Cox model estimation
It has been shown that the estimation process of a Cox model can be severely a↵fected by the presence of
overly influential or outlier observations. In terms of robustness, it has been pointed in (Kalbfleisch and
Prentice, 2011; Struthers and Kalbfleisch, 1986) that the Cox regression has a breakdown-point of 1/N . In
order to perform robust Cox regression in the presence of outliers, (Farcomeni and Viviani, 2011) proposed a
modified Cox model that is fit by trimming the smallest contributions to the partial likelihood. Choosing a
trimming level ↵, typically between 0.1 and 0.2, then the problem corresponds to find the subset of cardinality
(1�↵)n that maximizes the Cox partial likelihood. The results indicate that this process of trimming increases
the robustness of the method. Simulated data was used with contamination levels of 5%, the trimmed model
was the procedure that ensured the estimation more close to the pure values of the model (data without
contamination). Unfortunately the authors did not publish any code or package for download, so we will
not be able to use it in our results section for comparison with the methods proposed. In 1993 Bednarski
(Bednarski, 1993), proposed a way to increase the robustness of the Cox model estimation. This new method
consists on modifying the maximization process of the Cox partial likelihood. This method is available through
the routine “coxr” from the “coxrobust” R package (Bednarski and Borowicz, 2006).
25
Chapter 4
Proposed Methods
In this chapter we present three new methods aiming at performing outlier detection in a survival analysis
context. We will start by reviewing the Bootstrap resampling procedure and related concepts. Next we provide
the insight of how a test statistic in conjunction with a resampling technique (like the Bootstrap) can be used
to perform outlier detection. The test statistic used, must be sensitive to the presence of outliers, our three
methods rely on using the performance of a survival model as a statistic sensitive to outliers, in particular
we will use Harrell’s concordance c index (Harrell et al., 1982), the rationale is the belief, that the larger the
amount of outliers present in the data, the lower the performance of the model.
The first proposed method is One Step Deletion (OSD), that maximizes concordance of the model on a
dataset by removing the most likely outlying observations at each step. The second developed method is the
Bootstrap Hypothesis Test (BHT), which relies on performing an hypothesis test for each observation in the
dataset, where the null hypothesis means that the observation does not increase concordance when absent
from the data, and thus it is an outlier; the significance from this test is used as measure of outlyingness. The
third developed method is the Dual Bootstrap Hypothesis Test (DBHT), which extends BHT by performing
an hypothesis test on the inequality of two random variables, produced by resampling concordance under two
di↵erent variants of the bootstrap specially designed for this method.
General Strategy
The proposed methods share the same underlying strategy. Let a dataset be denoted by D with observations
d1, d2, ..., dN and let a sample from from this dataset be denoted by D⇤. Let ⌧ represent an arbitrary test
statistic, with ⌧(D⇤) representing the value of the test statistic on sample D⇤. To search for outliers, we will
study the impact of removing each observation di
on the value of ⌧ , again with the requirement that ⌧ must
be somehow sensitive to the presence of outliers. Bootstrap resampling will be employed in order to better
assess the impact that observation di
has on concordance, aiming to make this assessment more resistant
from masking and swamping interactions.
26
The Bootstrap
The Bootstrap (Efron, 1979) is a resampling technique which main goal is to recreate the underlying distri-
bution of the data. It is used when the underlying distribution is unknown or simplifying assumptions are not
reasonable. Bootstrapping can be useful when one wants to gain insight on the behaviour of a test statistic
⌧ on the underlying distribution. Given a dataset D with N observations, one bootstrap sample is obtained
by sampling with replacement N observations from D, the bootstrapped test statistic can be obtained by
calculating the value of ⌧ for each bootstrap sample, like is illustrated in Figure 4.1. One typical example
(Efron, 1979) concerns the sample mean of a data sample. It would be very useful to know the standard
error of this statistic. One way to do this would be using the following Bootstrap procedure: 1) generate B
bootstrap samples from the original data; 2) calculate the sample mean for each of the B bootstrap samples;
3) calculate the standard error of the B sample means calculated from the bootstrap samples. The rationale
behind this resampling procedure is the fact that when resampling with replacement, we are using the original
dataset as a distribution, and in fact the original dataset as an empirical distribution is the best approximation
of the underlying distribution of the observed data.
D = (d1,d2, …,dN)
D*1 D*2 D*B
Original dataset
B bootstrap samples
τ ( D*1) τ ( D*2) τ ( D*B)
B bootstrap replications
Figure 4.1: Bootstrapping a test statistic ⌧ . Adapted from (Efron and Tibshirani, 1994).
27
The Plug-in principle
We denote by F the true underyling distribution that generated dataset D. The empirical distribution ˆF
is defined as the discrete distribution that puts probability 1/N on each observation di
, i = 1, 2, .., N . The
plug-in estimate (Efron and Tibshirani, 1994) of a parameter ✓ = ⌧(F ) is given by:
ˆ✓ = ⌧( ˆF ).
This means that a test statistic of the probability distribution F is estimated by the same function of the
empirical distribution ˆF .
Bootstrap Hypothesis testing
In two of our mehods we will perform hypothesis tests following a Bootstrap approach, further explained in
(MacKinnon, 2009). Given a test statistic ⌧ with an observed value onD of ⌧0, to asses where ⌧0 is on the
distribution F (⌧), we apply the plug-in principle given that F which is unknown. For example, considering we
want to calculate the approximated p-value hatp for the test: H0 : ⌧ > ⌧0; using a bootstrap approach, this
can be done by generating B bootstrap samples and then calculating the fraction of samples which values of
⌧ are larger than ⌧0 :
p = P (⌧ > ⌧0) =1
B
BX
j=1
I�⌧�D⇤
j
�> ⌧0
�. (4.1)
where I represents the indicator function.
4.1 Motivation for the use of Bootstrapping
The semi-clairvoyant outlier counter
In this section we explain the idea behind using a bootstrap approach in order to perform outlier detection. We
will start by defining a test statistic that is sensitive to the presence of outliers and how it can be used on outlier
detection. Finally we propose a way of improving the utility of such statistic by using a bootstrap approach.
As outlier-sensitive test statistic we present what we named as semi-clairvoyant outlier counter (SOC) test
statistic. This test statistic counts outliers in an imperfect way, simulating an entity with some expertise on
counting outliers, but sometimes missing true outliers and other times assuming inliers as outliers.First we
define Isc
, that represents the semi-clairvoyant outlier indicator function, with the following characteristics:
Isc
(d⇤i
) =
8><
>:
Bernoulli(pTP
) when di
is an outlier
Bernoulli(pFP
) when di
is not an outlier (inlier);(4.2)
where pTP
represents the probability of counting a true outlier as an outlier (true positive) , and pFP
the
probability of counting an inlier as an outlier (false positive). The SOC is given by summing the indications
over all observations in given data sample D⇤:
SOC(D⇤) =
nX
i=1
Isc
(d⇤i
); (4.3)
28
It is fairly intuitive to verify that under certain conditions the SOC is an outlier-sensitive test statistic, in
particular if pTP
> pFP
we can expect higher counts for data samples with larger amounts of outliers.
Using SOC for outlier detection
Considering a dataset D of N observations with k < N outliers, our strategy to perform outlier detection
using the SOC operator is based on its expected value under two di↵erent scenarios: 1) when one outlier
is removed from D; and 2) when one inlier is removed from D. The remaining dataset now with N � 1
observations will be denoted D� when one outlier is removed, and by D+ when the observation removed was
an inlier. For the first scenario the expected value of SOC in the remaining data is:
E⇥SOC(D�
)
⇤= (N � k)p
FP
+ (k � 1)pTP
(4.4)
= NpFP
+ k(pTP
� pFP
)� pTP
, (4.5)
similarly for the second scenario, the expected value is:
E⇥SOC(D+
)
⇤= (N � k � 1)p
FP
+ kpTP
(4.6)
= NpFP
+ k(pTP
� pFP
)� pFP
. (4.7)
Making the di↵erence between these two expected values, we have:
E⇥SOC(D+
)
⇤� E
⇥SOC(D�
)
⇤= p
TP
� pFP
(4.8)
Considering pTP
> pFP
we can expect lower values of SOC when an outlier is removed. With this expected
di↵erence under the two scenarios, the following outlier detection strategy could be devised:
1. For all observations remove one at a time.
2. For each removal calculate SOC on the remaining data.
3. The lower the value of SOC on the remaining data, the more outlying is the observation considered to
be.
One potential problem with this strategy is that the di↵erence pTP
� pFP
may be very small in relation to
standard deviation of the SOC statistic in each scenario. The standard deviation of SOC in each scenario is
given by:
SD⇥SOC(D�
)
⇤=
p(N � k)p
FP
(1� pFP
) + (k � 1)pTP
(1� pTP
); (4.9)
SE⇥SOC(D+
)
⇤=
p(N � k � 1)p
FP
(1� pFP
) + kpTP
(1� pTP
). (4.10)
By inspection we see that the standard deviation grows with N. For example, with 100 individuals wich 10 of
them are outliers, and a SOC with values of pTP
= 0.8 and pFP
= 0.1, the di↵erence in expected values is
pTP
� pFP
= 0.7 the standard deviation of SOC for the outlier, and inlier removed scenarios are respectively:
3.09 and 3.10. So when removing an observation the variance of the statistic on the remaining data will
introduce a large amount of confusion in our method, since it is very large in comparison with the di↵erence
of between expected values. To solve this problem we would need an outlier sensitive statistic with much
lower variance, in order to achieve this, we will employ Bootstrap resampling as explained in the next section.
29
Bootstrapping the SOC test statistic
For typical cases, the magnitudes of the variances in Eq. 4.9 will be much higher than the di↵erences between
the expected values of the two scenarios — Eq. 4.8. It is predictable that a single calculation of the SOC
statistic when removing each observation will have very low significance due to the high variance of SOC.
To mitigate this e↵ect, a bootstrap approach can be employed. Even if SOC was described as a stochastic
test statistic we consider that it takes the same value for the same dataset, and probabilities will reflect the
uncertainty about the data sample to which SOC is applied. Drawing B bootstrap samples from D we can
define a new test statistic named SOCB , corresponding to the mean SOC bootstrapped over B bootstrap
samples D⇤1 , D
⇤2 , .., D
⇤B
:
SOCB
=
PB
i=1 SOC(D⇤i
)
B
By the linearity of the expected value, the expected di↵erence between scenarios continues to be (pTP
�pFP
),
while the standard deviation, this time is given by:
SD
"PB
i=1 SOC(D⇤i)
B
#=
sB
B2V ar(SOC(D⇤i))
�(4.11)
=
r1
B⇥ SD [SOC(D⇤i)] ; (4.12)
where SD [SOC(D⇤i)] is the standard deviation seen in expression 4.9. This last expression translates the
advantage of intensive simulation in order to minimize the variance of the SOC statistic. Now for the same
setting in as in section 4.1 with a value of B = 100, instead of a standard deviation of about 3 we would
have an approximate standard deviation ofq
1100 ⇥ 3 =
310 = 0.3. This will ultimately allow for a significant
outlying score. In order to achieve more significance for the outlying scores we have to increase B, this value
will necessarily depend on the size of the data set and the level of significance one wants for the statistic.
Concordance as an outlier-sensitive statistic
The choice of which test statistic to use as a potential outlier-sensitive statistic will greatly influence the
performance of our methods. As mentioned the statistic we will use will be related with a survival model’s
performance, to assess it, we will use Harrell’s concordance c index. The main assumption underlying our
methods is that the c index of a survival model fit on a given dataset will increase as the quantity of outliers
decrease in the data. Behind this choice is the fact that the c index is a rank measure, thus it only measures
how well predicted values are concordant with rank-ordered response variables. For example, the c-index for
two patients with predicted hazard ratios of 0.4 and 0.6 is the same as if the patients had hazard ratios of
0.1 and 0.9 (Harrell, 2001), it only measures if the outcome is concordant with the response variables or not.
Thus, unlike measures such as the sum of squared errors, one observation by itself has a limited contribution
for the overall concordance. This robustness may allow for the maximization of the c-index without worrying
if it is being maximized at the cost of the majority of the data, only to fit better one or a cluster of outlying
observations, as it can happen with the sum of squared errors as exemplified in (Fischler and Bolles, 1981).
30
4.2 One Step Deletion
The One Step Deletion (OSD) algorithm removes a subset of observations in order to maximize the con-
cordance of a model fit on the remaining data. This maximization is made by a greedy search. At each
step, every observation is temporarily removed and corresponding concordances computed for the model on
the remaining data. The observation that when removed caused the highest improvement in concordance is
eliminated definitely from the data. Each of this eliminated observations is considered more outlying than
the ones that remained in the dataset. The algorithm terminates when the quantity of removed observations
equals a reasonable amount of expected outliers. The output consists on the subset of observations that were
eliminated, which are considered the most outlying ones.
Input
Besides the input dataset D and a survival model, this method has one input parameter given by k, cor-
responding to the maximum amount of outliers expected in the data. This parameter is needed since is
impossible to remove all observations, eventually the model estimation will cease to converge as the number
of remaining observations gets too low.
Output
The output is the subset of the k most outlying observations according to the method. In this subset there
is no score of outlyingness defined. No score of outlyingness is defined since due to masking and swamping
e↵ects, one cannot conclude if the first observation to be removed was the most extreme outlier, because
more extreme outliers might be masked by this one particular observation, or the observation that has been
removed was being swamped by extreme outliers and was not even an outlier.
Algorithm
The input parameter k determines the number of steps the algorithm. At each step one of the observations
is removed from the data, not being part of it in further steps. To decide which observation is removed,
at each step the method removes one observation at a time and fits a Cox model on the remaining data,
the observation that when removed, led to the highest concordance is removed. On the rare case when no
removal improves the concordance of the model, the algorithm terminates, returning the observation removed
until then, these steps are given in detail in algorithm 1.
31
Algorithm 1 One Step Deletion
1: Inputs: D : input dataset; Model : survival model k : number of expected outliers.
2: Output: subset of removed observations.
3: OutlierSet = ? {stores the observations already removed (outliers)}
4: count = 0 {counts the number observations already removed}
5: Dactual
= D {Dactual
contains the actual set of remaining observations}
6: while count k do
7: C0 = C(Dactual
,Model) {compute the model concordance for the actual set}
8: �Ccandidate
= 0 {initialize the concordance variation with zero}
9: dcandidate
= null {start with no candidate to be removed}
10: for all di
2 Dactual
do
11: Di
actual
= Dactual
\ di
{remove observation i from the actual set}
12: �Ci
= C(Di
actual
,Model)� C0 {compute the concordance variation upon removing observation
i}
13: if �Ci
> �Ccandidate
then
14: �Ccandidate
= �
i
15: dcandidate
= di
{if the concordance improvement of i is larger than of the previous candidate, i is now the candidate
observation to be removed}
16: end if
17: end for
18: if dcandidate
6= null then
19: OutlierSet = toRemoveSet [ dcandidate
{add the candidate to the output}
20: Dactual
= Dactual
\ dc
andidate {remove the observation from the actual set}
21: count++ {increment the number of removed observations}
22: else
23: return OutlierSet {special case when no removal improves concordance}
24: end if
25: end while
26: return OutlierSet
Example
In this example we apply BHT to a dataset, to illustrate how it works and to analyze its output. As survival
model we chose the Cox model and a parameter of k equal to 15 (in this case 15% of the total observations).
In Table 4.1 we can observe the algorithm flow. The column �C displays the concordance improvement
at each step (a removal). The last column C contains the concordance c-index of the fitted model on the
remaining dataset. The respective output of the algorithm is given in Table ??, the outlyingness measure
used, is the order in which the observations were removed, so the faster the observation gets removed, the
32
more outlying is considered to be. By inspecting Table 4.1 we verify that if the outlyingness measure was �C
the ranking of the outliers would be di↵erent. Experimentally, sorting by order of removal has shown to be
more successful at identifying the outliers, this may be due the fact that due to masking and swamping the
values of �C depend strongly on the observations already removed.
Table 4.1: Evolution of the OSD algorithm when applied to an example dataset.
#step Observations removed �C C
0 {} 0 0.692
1 {1} 0.0111 0.7031
2 {1,67} 0.0099 0.7130
3 {1,67,97} 0.0112 0.7242
4 {1,67,97,51} 0.0068 0.7310
5 {1,67,97,51,23} 0.0089 0.7399
6 {1,67,97,51,23,31} 0.0059 0.7458
7 {1,67,97,51,23,31,93} 0.0049 0.7507
8 {1,67,97,51,23,31,93,52} 0.0109 0.7616
9 {1,67,97,51,23,31,93,52,56} 0.0063 0.7679
10 {1,67,97,51,23,31,93,52,56,57} 0.0098 0.7777
11 {1,67,97,51,23,31,93,52,56,57,7} 0.0090 0.7867
12 {1,67,97,51,23,31,93,52,56,57,7,30} 0.0113 0.7980
13 {1,67,97,51,23,31,93,52,56,57,7,30,13} 0.0106 0.8086
14 {1,67,97,51,23,31,93,52,56,57,7,30,13,78} 0.0110 0.8196
15 {1,67,97,51,23,31,93,52,56,57,7,30,13,78,8} 0.0124 0.8320
Comments
Although we previously argued that concordance can be relatively resistant to masking and swamping, the
step-wise process of removing one observation can be very fragile. We can have outlying observations that
are only perceived as outliers in a later stage of the process, because the observations responsible for masking
them were not removed at an early stage. Similarly for swamping we can have non-outlying observations
removed at an early stage erroneously because the model fitting was severely a↵ected by the presence of
strong outliers that were swamping those regular observations. On the other hand if the swamping and
masking characteristics are not an extreme scenario, if concordance captures the e↵ect of outliers we have a
computationally cheap and extremely simple algorithm for survival outlier detection.
4.3 Bootstrap Hypothesis Test Outlier Detection
Singh K. and Xie M. in their 2003 paper (Singh and Xie, 2003) introduced the Bootlier plot, a method that
uses the bootstrap resampling technique to extract outliers. As an example for their rationale, they consider
33
the bootstrapping of the sample mean on a dataset of N real numbers, containing only one outlier. The
probability of having a bootstrap sample free of outliers is given by: (1� 1n
)
n ⇡ 1/e(⇡ 37%) as n ! 1. This
means that 37% of the bootstrap samples will be outlier-free, the remaining 63% will contain at least one
outlier.If the outlier observation seriously a↵ects the sample mean, the authors argue that when generating
B bootstrap samples and respective histogram of B sample mean values will present multi-modal e↵ect. In
particular two modes: one resulting from the 37% outlier-free samples and the other mode, from the remaining
samples which contain at least one outlier — that severely contaminates the sample mean. This e↵ect is
illustrated in Figure 4.2, each large circle containing smaller circles, represents a bootstrap sample. A portion
of the samples do not contain the outlier, so their mean maps much lower on the histogram. The bootstrap
samples containing the outlier map much higher in the histogram. With only one outlier, the following strategy
Figure 4.2: Representation of the multi-modal e↵ect of the sample mean with one outlier in the data (red
circle).
to detect outliers can be devised: 1) for each observation i compute the histogram of the sample mean from
B bootstrap samples generated from the data without observation i; 2) the observation that when removed
does not cause a multi-modal histogram is the outlier. Unfortunately a single-outlier setting is a very limited
scenario. To overcome this limitation, we will use the belief that the concordance of a survival model is an
outlier-sensitive test statistic — it tends to increase in data samples with less number of outliers. Using such
assumption we will remove one observation at a time, and the observations that more systematically improve
concordance (when absent) will be considered the most outlying ones. This method does one hypothesis test
for each observation in the dataset. The resulting p-value is assigned as outlying measure to the observation
under test.The hypothesis test for each observation i can be stated as:
H0 : CModel,⇠ ˆ
F�i C
original
H1 : CModel,⇠ ˆ
F�i> C
original
34
The hypothesis tests will be made following the bootstrap approach explained in section 4. Each empirical
distribution ˆF�i
represents the a distribution where observation di
has probability zero. Writing �Ci
=
CModel,(X,T,�)⇠Datai
� Coriginal
it is more useful to formulate the hypothesis test as:
H0 : �Ci
0 (4.13)
H1 : �Ci
> 0 (4.14)
Input
Besides a survival model and input dataset D, this method has one input parameter: B, corresponding to
the number of bootstrap samples generated from the empirical distribution. The value of B needs to be large
enough in order to achieve the convergence of the p values . The number of necessary bootstrap samples B
necessary for the convergence of the output, has shown to be dependent on the number of individuals and
number of covariates. In our tests the value for B was iteratively increased until convergence of the p values.
Output
The BHT method is a soft-classifier and single-step method, the output consists of an outlying score for each
observation.
Algorithm
In algorithm 2 we have the sequence of operations needed to compute each observations p-value. First we
compute the baseline concordance C0 as the concordance of the model fit with all observations. Then for each
observation, we remove the observation under test from the data, then generate B bootstrap samples from
the remaining data. The proportion of samples who register a model concordance higher than the baseline,
corresponds to the p-value.
Algorithm 2 Bootstrap Hypothesis TestInput: D : input dataset; Model survival model; B : number of bootstrap samples.
Output: a p value for each observation di
2 D.
C0 = C(D⇤j
,Model) {compute the baseline concordance C0 as the concordance on the original} dataset
for all di
2 D do
D�i
= D \ di
{ remove observation i from the original dataset}
From D�i
generate B bootstrap samples D⇤1 , D
⇤2 , ..., D
⇤B
.
p[i] =BP
j=1I(C(D⇤
j
,Model) > C0) {compute the p-value for each observation}
end for
return the vector of p-values p
35
Example
To illustrate how the BHT method works we present some results when applying BHT on dataset. An
example of BHT’s output is presented in Table 4.3 where the observations are sorted by their p value from
the hypothesis test in expression 4.13. In Figure 4.3 we have the overlapped histograms of the bootstrapped
concordance for two di↵erent observations from the dataset, where one, following the concordance criteria, is
clearly more outlying than the other, the histogram on blue corresponds to a more outlying observation
than the one corresponding to the red histogram. This blue histogram is more shifted to the right of
� = 0 than the red, so the p-value for the test will be lower and thus more outlying. Using the p-value
as measure of outlyingness, allows us to measure how systematically the removal of such observation leads to
the improvement of concordance on the remaining data.
Figure 4.3: Two histograms: outlier (blue) and inlier observation (red) produced by BHT.
Comments
This method extends the ideas of the Bootlier plot (Singh and Xie, 2003). Being a single-step procedure, its
output is more flexible in terms of analysis, mainly it allows the definition a threshold for significance from
which an observation is considered an outlier or not. Given being single-step it also tends to be less sensitive
to masking and swamping e↵ects.
4.4 Dual Bootstraps Hypothesis Testing Outlier Detection
This method aims to improve the approach taken in the BHT method. In the BHT method, removing one
observation from the dataset, and then assess the impact of each removal on concordance has a undesired
e↵ect, since the model has less observations to fit (observation under test is removed), there is tendency for
the concordance to increase, this potentially introduces confusion in the hypothesis test made in BHT, in
particular it may increase the number of “false positives’. The rationale behind DBHT is to generate two
36
Table 4.2: Example of a BHT output, sorted by p values
# Observation p
1 67 0.274
2 1 0.284
3 78 0.285
4 56 0.285
5 69 0.293
6 8 0.294
7 45 0.300
8 93 0.308
9 30 0.313
10 32 0.315
11 23 0.316
12 100 0.316
13 91 0.325
14 29 0.326
15 13 0.328
histograms from two antagonistic versions of the bootstrap procedure: the poison and antidote bootstraps
and then compare them. The antidote bootstrap excludes the observation under test from every bootstrap
sample, it can be described as:
1: Input: dataset D; index of observation under test i.
2: D�i
= D \ di
.
3: Generate B bootstrap samples from D�i
— each with size N .
4: Output: B antidote bootstrap samples
The poison bootstrap works by forcing the observation under test to be part of every bootstrap sample, the
procedure is the following:
1: Input: dataset D; index of observation under test i.
2: Generate B bootstrap samples from D�i
— each with size N � 1.
3: Add observation di
to each bootstrap sample generated.
4: Output: B poison bootstrap samples
Using these two bootstrap variants the strategy is the following: for each observation i we make the hypothesis
that such observation is “poison” (meaning the observation is an outlier). To test it, we will compare the
histograms of concordance variation �C between the antidote and poison bootstraps. If the observation is
an outlier, we are expecting that the antidote bootstrap pushes the histogram for higher values of �C —
since that one outlier is always absent from the samples. Additionally we are expecting the poison bootstrap
to generate lower values of �C since all samples will have an outlier. The more the poison histogram is to the
37
left of the antidote histogram, the more outlying the observation is considered to be. We consider �Cantidote
and �Cpoison
as two real random variables and we perform the following hypothesis:
H0 : E [�Cantidote
] > E [�Cpoison
] ; (4.15)
H1 : E [�Cantidote
] E [�Cpoison
] , (4.16)
to calculate the p-value of the test we use a independent two sample t test with unequal variances as described
in (Rajagopalan, 2006) (“test for equality of population means with known equal variances”).
Input
Similar to the BHT method, besides a survival model and the input dataset D, DBHT only takes one input
parameter: B the number of bootstrap samples used on the antidote and poison bootstrap procedures.
Output
The BHT method is a soft-classifier and single-step method. Thus the output is an outlying measure for each
observation, from this one can extract the the k most oultying observations.
Algorithm
Algorithm 3 Dual Bootstraps Hypothesis TestInput: D : input dataset; Model survival model; B : number of bootstrap samples.
Output: a p-value for each observation.
for all di
2 D do
D�i
= D \ di
{remove observation i from the original dataset}
Generate B poison bootstrap samples.
Generate B antidote bootstrap samples from.
Compute the B values of �Cpoison
and store them in vector psn.
Compute the B values of �Cantidote
and store them in vector ant.
From psn and ant compute the p-value using a t test for equality of means.
end for
return the vector of p-values p
Example
In figure 4.4 we have the poison and antidote histograms of observation 1 — overlapped and clearly apart
from each other, confirmed by the low p value of the test. Contrasting with observation 1,in Figure 4.5 we
have the histograms for observation 82, there is no clear distinction between the two histograms, this indicates
that this observation is not that influential to the concordance of a survival model and therefore it does not
appear to be an outlier.
38
Figure 4.4: Contrast between antidote (blue) and poison (red) bootstrap histograms of concordance variation
— for a typical outlier.
Figure 4.5: Antidote (blue) and poison (red) bootstrap histograms of concordance variation — for a typical
inlier.
39
Chapter 5
Results
5.1 Goals
In this section we assess the performance of the developed methods: OSD ,BHT, and DBHT on several
datasets. To compare their performances we will also employ to the same datasets, some of the alternative
methods previously mentioned in Chapter 3 in particular, martingale residuals (MART), deviance residuals
(DEV), likelihood displacement statistic (LD), and DFBETAS (DFB). There were used two types of datasets:
1) artificial d.atasets, that simulate a set of survival observations containing outliers; and 2) real clinical
datasets. For the artificial datasets, having prior knowledge of which observations are outliers, we can assess the
methods’ performance on doing outlier identification. Our goal is to check if our concordance based methods
can match or even surpass the alternative methods in terms of performance on the scenarios recreated. We
will analyze how each method’s performance evolves under di↵erent conditions such as the censoring amount,
quantity of outliers and if it is a↵ected by the type of baseline hazard used to simulate the data. Another
goal was to study the behavior of parameter B — the number of bootstrap samples — for the methods BHT
and DBHT, more specifically the relation between B and the method’s performance. On the real datasets
containing clinical data, we will study our approach to robust Cox regression based on trimming a certain
quantity of outliers from the original data. Given the fact that the breakdown-point of the Cox estimation
process is the lowest possible (see section 3) if the methods are accurate enough, we may exclude observations
that could have been distorting the model estimation when all observations were included in the fit, we call
this process outlier trimming. Outlier trimming consists on removing a certain level of outliers, for example:
the 5% or 10% of the most outlying observations, in order to fit the model on the remaining uncontaminated
dataset. Based on this we will compare the cross-validated c-index of a Cox model, when fitting the model
with all observations, trimming the top-3 outliers, trimming the top-10 and finally the top-30 outliers.
5.2 Simulated Data
Before generating the simulated data, we have to define what is an outlier in our simulation study. An
observation is considered outlying regarding the relationship between the vector of covariates Xi
and (Ti
, �i
).
40
Our rationale is based on the fact that in general we cannot judge about the outlyingness of an observation
only by looking at its covariates. Having unusual covariates is not enough to be considered an outlier, instead
we focus on unusual behavior. This relationship can be translated by a survival model such as the Cox
proportional hazards model, that belongs to the class of generalized linear models (GLM).In a survival context
GLM models describe the hazard of an individual by a function of a linear combination of its covariates and
time. The hazard function in a generalized linear model is described as:
h(t,X) = g(t,�X); (5.1)
for the Cox model, g is proportional to the exponential of �X. These type of models are completely
characterized by their parameter vector �, it defines the direction of hazard, meaning it describes the e↵ects
of covariates on survival time. We consider an outlying observation, one that is generated by a model, very
distinct from the model generating the large majority of the data. The more distinct the model is, the more
outlying we can consider the observation to be. To evaluate the distinction between the models generating
each observation, we only have to look at at the � parameters. In Figure 5.1 we have the example of a
β2
β1
βG
β’
β’’
β’’’ β’’’’
Θ’’’
Figure 5.1: A 2-D example of a general trend �G with examples of outliers sources.
pure model �G that generates the large majority of the data, the remaining vectors are models that produce
outlying observations. Each vector represents a two dimensional � parameter of a GLM model. Looking at
each one separately, we have �0 similar in direction to the general trend but shorter, meaning the individuals
will have lower hazard for the same covariates, �00is also very similar in direction but it has a larger norm
which means the same covariates will have a higher hazard; �000
points in the opposite direction of the general
41
trend, so the e↵ects of the two covariates are the opposite comparing with the general trend model. Other
variants are models that are not opposite but have a negative dot product like �0000. We define two measures
of outlyingness that are measured in relation to the general trend model �G: the discrepancy between norms
|�|/|�0 | and the value of cos ✓. The value of cos ✓ the lower, the more opposite is the e↵ect of covariates
in relation to the general trend. We can consider for example an outlier, as a person that has a response
to a drug opposite from the vast majority, if in general the administration of a certain drug decreases the
patient’s hazard, for this outlier it will increase. The di↵erence in norms aims to translate, the discrepancy in
the hazards magnitude for the same covariates, continuing with the drug example, for di↵erent norms, if the
administration of a certain drug decreases the hazard, for the outlying patient it will also decrease but in a much
more lower/higher quantity than the general trend. In our results section we will generate outliers, varying
both the angle and the magnitude of � parameters of outlying models. When measuring the performance
of outlier detection methods on the simulated datasets we have to take into account that the observations
are randomly generated from distributions: the inliers from the general distribution �G
= (1, 1, 1), and the
outliers from an outlying distribution �0. It may happen that observations initially intended to be inliers, may
be drawn from the lower or upper tail of the distribution and may configure an outlier. For the same reason,
it may also occur that observations generated from the outlying distribution become inliers. Our analysis
of performance will assume that for each scenario the observations generated from general distribution are
inliers and the observations generated from the outlying distribution are outliers. To assess performance we
will use two measures: the True Positive Rate (TPR), also known as sensitivity, and the area under the ROC
curve (AUC). For datasets with k outliers the TPR will measure for each scenario the fraction of true outliers
found in the top-k most outlying observations indicated by each method. The AUC provides us a threshold-
independent outlier detection ability. The AUC is not applicable to the output of the OSD method, because
it does not provide an outlying score for every observation. The TPR and AUC are measured applying them
to 50 random datasets per simulation configuration, then we take the mean and standard deviation of the
metrics.
Generating survival observations from a Cox model
The model chosen to recreate survival times was the Cox proportional hazards. The simulated observations
will be generated from two di↵erent Cox models, a general trend model �G and an outlier model �0. From
the Cox hazard function, the distribution of T is given by:
F (t|X) = 1� exp [�H0(t)⇥ exp(�X)] ;
the vector of covariates X characterizing each individual,will be generated from a three-dimensional normal
distribution with zero mean with identity covariance matrix. The survival times are generated using the
methodology explained in (Bender et al., 2005), each observation time as function of the covariate vector X
can be given by:
T = H�10 [� log(U)⇥ exp(��X)] , (5.2)
42
where U is a uniform random variable distributed in interval [0, 1]. Before being able to generate survival times
from the hazard functions produced by each Cox model, we need to define the baseline hazard function h0(t).
Our choice is the Weibull function, characterized two parameters:scale � and shape ⌫. For T ⇠ Weibull(�, ⌫)
the corresponding baseline hazard is:
hWeibull
(t) = �⌫t⌫�1
The inverse of the corresponding cumulative hazard function is given by:
H�10 (t) = (��1t)1/⌫ ; (5.3)
inserting this cumulative baseline hazard function in equation 5.2 we have:
T = ��1[� log(U)⇥ exp(��X)]
1/⌫=
✓� log(U)
�⇥ exp(�X)
◆ 1⌫
.
in order to recreate random censoring we will generate event indicators as:
�i
⇠ Bernoulli(p = c).. (5.4)
Several scenarios will be simulated, for each one, the vector of covariates is given by Xi
⇠ N(0, I), where I
is the identity matrix. Each simulated dataset contains 100 observations with hazard functions given by:
hi
(t) =
8><
>:
h0(t) exp{�GX} 1 i n� k;
h0(t) exp{�0X} n� k < i n;
. (5.5)
where �G, the pure model will be always equal to (1, 1, 1), the twelve di↵erent vectors simulated for �0can be
seen in Table 5.1. Concerning censoring, we will experiment scenarios with amounts of c = 0.2 and c = 0.3 of
censored observations. Regarding the characteristics of outliers, levels of k = 5 and k = 10 will be simulated.
On the generating the observations three types of baseline hazard h0(t) will be used. The baselines correspond
to having a baseline survival function that follows a Weibull distribution function, for its scale � and shape
⌫ parameters, three configurations were used corresponding to a constant, strictly decreasing, and strictly
increasing hazards functions, represented in Figure 5.2. The following dimension values were considered in
our simulation:
Levels of censoring: c = 0.2 and c = 0.3.
Outlier amounts: k = 5 and k = 10.
Weibull baseline hazards with parameters: (� = 1, ⌫ = 1), (� = 1.5, ⌫ = 0.5), (� = 0.5, ⌫ = 1.5).
The methods use the Cox proportional hazards as survival model, OSD is parametrized with k = 10, and for
DBHT and BHT the number of bootstrap samples used was B = 1000.
43
Table 5.1: The di↵erent outlier configurations used in the simulation data. The pure model is �G
=(1,1,1).
# ⇥
0 ||�0 ||/||�G|| �0
1 180 1 (-1,-1,-1)
2 180 0.2 (-0.2,-0.2,-0.2)
3 180 5 (-5,-5,-5)
4 135 0.2 (-0.143,0,-0.283)
5 135 5 (-3,6,0,-7.07)
6 90 0.2 (-0.245,0,-0.245)
7 90 5 (6.12,0,-6.12)
8 0 0.2 (0.2,0.2,0.2)
9 0 5 (5,5,5)
10 180 10 (-10,-10,-10)
11 0 10 (10,10,10)
12 135 10 (-7.15,0,-14.15)
0 1 2 3 4
0.0
0.5
1.0
1.5
2.0
t
h(t)
0 1 2 3 4
0.0
0.5
1.0
1.5
2.0
t
h(t)
0 1 2 3 4
0.0
0.5
1.0
1.5
2.0
t
h(t)
Figure 5.2: The three types of Weibull baseline hazards used: � = 1, ⌫ = 1 (blue); � = 1.5, ⌫ = 0.5 (orange);
� = 0.5, ⌫ = 1.5 (red).
44
Simulation results
The average TPR and AUC for each scenario are displayed respectively on Table 5.2 and 5.3, more detailed
results of the simulation can be seen in appendix I 6. The highest value for each scenario is marked on bold.
We may observe that for both the TPR and AUC, the DBHT method attains the best performance for 9
of the 12 di↵erent outlier scenarios. Also worth notice is the very poor performance on scenarios 9 and 11,
these two scenarios correspond to the only two scenarios where the the angle between �G and �0is zero and
the magnitude (thus the hazard) is higher than the general trend (see Table 5.1). One possible explanation
for the low performance registered may come from the fact that these type of outliers may help in pointing
the model in the right direction, and can possibly mitigate the e↵ect of other outliers originated from the
general distribution. Additionally as explained previously, concordance is a rank coe�cient, so the di↵erences
in magnitude are not so easily captured as di↵erences in angle, as indicated by the overall results (Tables 5.3
and 5.2).
Table 5.2: Average of TPR grouped by outlier scenarios.
Scenario # MART DEV LD DFB OSD BHT DBHT
1 0.29 0.36 0.43 0.36 0.47 0.43 0.47
2 0.22 0.25 0.31 0.29 0.32 0.31 0.34
3 0.50 0.58 0.59 0.52 0.63 0.59 0.65
4 0.22 0.23 0.30 0.28 0.30 0.29 0.32
5 0.44 0.54 0.52 0.48 0.58 0.53 0.58
6 0.21 0.22 0.28 0.26 0.27 0.26 0.28
7 0.40 0.50 0.40 0.41 0.44 0.37 0.42
8 0.18 0.18 0.23 0.22 0.22 0.20 0.23
9 0.32 0.36 0.18 0.25 0.09 0.06 0.07
10 0.53 0.63 0.64 0.57 0.68 0.60 0.70
11 0.38 0.46 0.24 0.32 0.14 0.11 0.12
12 0.49 0.60 0.54 0.51 0.60 0.52 0.60
Scenario sensitivity analysis
Here we present the analysis on how the performances of the outlier detection methods behave under di↵erent
simulation conditions. For each of the 12 outlier scenarios we break down the averaged values shown in
Tables 5.2 and 5.3 by each simulation dimension: outliers amount, level of censoring and baseline hazard
type. The methods performances sliced by the two values of k are present in Tables 5.4, 5.5, 5.6, and 5.7.
Among the alternative methods the martingale residuals (MART) is the only one that presents a consistent
TPR increase when passing from 5 to 10 outliers present in the data, the TPR of our proposed method
consistently increases when going from 5 to 10 outliers. Regarding AUC there is an overall tendency to
decrease as the outlier level goes from 5 to 10. The two metrics point in di↵erent directions regarding the
relation between censoring and performance.
45
Table 5.3: Average of AUC grouped by outlier scenarios.
Scenario # MART DEV LD DFB BHT DBHT
1 0.70 0.70 0.74 0.68 0.78 0.82
2 0.65 0.65 0.70 0.64 0.71 0.75
3 0.80 0.80 0.78 0.77 0.86 0.90
4 0.64 0.64 0.69 0.63 0.71 0.73
5 0.78 0.77 0.74 0.75 0.82 0.84
6 0.63 0.63 0.67 0.63 0.68 0.71
7 0.76 0.76 0.66 0.73 0.70 0.72
8 0.62 0.62 0.66 0.62 0.65 0.68
9 0.74 0.72 0.61 0.69 0.60 0.60
10 0.83 0.83 0.80 0.81 0.87 0.92
11 0.78 0.76 0.61 0.73 0.59 0.61
12 0.80 0.80 0.74 0.78 0.81 0.86
Table 5.5: Average TPR of the proposed methods grouped by outlier scenario and outlier amount k.
OSD BHT DBHT
k = 5 k = 10 k = 5 k = 10 k = 5 k = 10
1 0.43 0.51 0.40 0.46 0.43 0.51
2 0.27 0.37 0.27 0.35 0.29 0.38
3 0.59 0.67 0.58 0.59 0.63 0.67
4 0.25 0.34 0.26 0.32 0.28 0.36
5 0.57 0.60 0.54 0.52 0.57 0.58
6 0.24 0.31 0.23 0.29 0.25 0.32
7 0.41 0.47 0.37 0.37 0.39 0.44
8 0.17 0.26 0.16 0.24 0.18 0.28
9 0.05 0.13 0.04 0.09 0.04 0.09
10 0.66 0.70 0.60 0.60 0.69 0.71
11 0.10 0.18 0.08 0.13 0.08 0.16
12 0.57 0.63 0.53 0.51 0.59 0.61
46
Table 5.4: Average TPR of the alternative methods grouped by outlier scenario and outlier amount k.
MART DEV LD DFB
k = 5 k = 10 k = 5 k = 10 k = 5 k = 10 k = 5 k = 10
1 0.26 0.31 0.35 0.36 0.41 0.45 0.36 0.37
2 0.19 0.24 0.22 0.28 0.28 0.35 0.27 0.30
3 0.45 0.55 0.58 0.57 0.61 0.56 0.54 0.50
4 0.20 0.24 0.22 0.25 0.27 0.33 0.26 0.31
5 0.38 0.50 0.56 0.52 0.56 0.48 0.50 0.47
6 0.20 0.22 0.19 0.25 0.26 0.31 0.25 0.28
7 0.34 0.45 0.50 0.51 0.41 0.38 0.42 0.41
8 0.16 0.19 0.15 0.21 0.20 0.27 0.20 0.25
9 0.29 0.35 0.35 0.38 0.16 0.19 0.22 0.28
10 0.48 0.58 0.65 0.62 0.68 0.59 0.58 0.56
11 0.36 0.41 0.45 0.48 0.22 0.26 0.29 0.35
12 0.44 0.54 0.62 0.58 0.58 0.50 0.52 0.50
Table 5.6: Average AUC of the alternative methods grouped by outlier scenario and outlier amount k.
MART DEV LD DFB
k = 5 k = 10 k = 5 k = 10 k = 5 k = 10 k = 5 k = 10
1 0.70 0.69 0.70 0.69 0.74 0.74 0.69 0.66
2 0.66 0.63 0.66 0.63 0.70 0.70 0.65 0.62
3 0.82 0.79 0.82 0.78 0.81 0.81 0.79 0.75
4 0.65 0.63 0.65 0.62 0.70 0.70 0.64 0.62
5 0.78 0.77 0.79 0.76 0.78 0.78 0.77 0.74
6 0.65 0.62 0.65 0.62 0.68 0.68 0.65 0.61
7 0.77 0.75 0.77 0.75 0.70 0.70 0.74 0.71
8 0.63 0.61 0.63 0.61 0.66 0.66 0.63 0.60
9 0.74 0.74 0.73 0.71 0.63 0.63 0.70 0.67
10 0.84 0.82 0.85 0.82 0.84 0.84 0.82 0.79
11 0.78 0.78 0.77 0.76 0.64 0.64 0.73 0.73
12 0.81 0.79 0.82 0.78 0.78 0.78 0.79 0.76
47
Table 5.7: Average AUC of proposed methods by outlier scenario and outlier amount k.
BHT DBHT
k = 5 k = 10 k = 5 k = 10
1 0.77 0.79 0.81 0.82
2 0.71 0.72 0.74 0.75
3 0.87 0.85 0.90 0.89
4 0.72 0.71 0.73 0.73
5 0.83 0.81 0.85 0.83
6 0.69 0.67 0.71 0.70
7 0.72 0.68 0.73 0.71
8 0.66 0.64 0.68 0.67
9 0.62 0.58 0.61 0.58
10 0.88 0.86 0.92 0.91
11 0.61 0.58 0.62 0.59
12 0.83 0.80 0.87 0.85
Concerning the level of censoring, in tables 5.8,5.9,5.10, and 5.11 is displayed the performance metric for
each method discriminated by censoring levels c = 0.2 and c = 0.3. In terms of TPR, the alternative methods
show a slight decrease in performance when the censoring level was increased from 0.2 to 0.3, the proposed
methods presented a similar behavior. Analyzing the changes in AUC from passing c = 0.2 to c = 0.3, we
verify that all alternative methods presented a significant drop in performance, while the proposed methods
experienced only a small decrease similar to the decrease in TPR.
Table 5.8: Average TPR of the alternative methods grouped by outlier scenario and level of censoring.
MART DEV LD DFB
c = 0.2 c = 0.3 c = 0.2 c = 0.3 c = 0.2 c = 0.3 c = 0.2 c = 0.3
1 0.29 0.29 0.36 0.35 0.44 0.42 0.36 0.37
2 0.22 0.22 0.26 0.24 0.34 0.29 0.29 0.28
3 0.49 0.50 0.60 0.55 0.63 0.54 0.53 0.51
4 0.21 0.22 0.24 0.23 0.31 0.29 0.27 0.29
5 0.44 0.44 0.56 0.52 0.56 0.49 0.50 0.47
6 0.22 0.20 0.24 0.21 0.30 0.26 0.28 0.25
7 0.39 0.41 0.52 0.49 0.41 0.38 0.42 0.41
8 0.17 0.18 0.16 0.19 0.23 0.23 0.21 0.23
9 0.31 0.33 0.39 0.34 0.19 0.16 0.26 0.24
10 0.53 0.53 0.65 0.62 0.68 0.59 0.57 0.56
11 0.39 0.37 0.48 0.45 0.25 0.24 0.33 0.30
12 0.49 0.49 0.63 0.57 0.59 0.49 0.53 0.50
48
Table 5.9: Average TPR of the proposed methods grouped by outlier scenario and level of censoring c.
OSD BHT DBHT
c = 0.2 c = 0.3 c = 0.2 c = 0.3 c = 0.2 c = 0.3
1 0.48 0.47 0.44 0.42 0.47 0.47
2 0.33 0.31 0.32 0.30 0.35 0.32
3 0.65 0.61 0.62 0.55 0.67 0.63
4 0.30 0.30 0.30 0.28 0.32 0.31
5 0.61 0.55 0.56 0.51 0.60 0.55
6 0.29 0.25 0.29 0.23 0.30 0.26
7 0.45 0.43 0.39 0.36 0.43 0.41
8 0.20 0.23 0.18 0.21 0.22 0.24
9 0.09 0.09 0.06 0.07 0.06 0.07
10 0.70 0.66 0.61 0.59 0.72 0.68
11 0.14 0.14 0.10 0.11 0.12 0.12
12 0.63 0.57 0.54 0.50 0.63 0.57
Table 5.10: Average AUC of the alternative methods grouped by outlier scenario and censoring amount c.
MART DEV LD DFB
c = 0.2 c = 0.3 c = 0.2 c = 0.3 c = 0.2 c = 0.3 c = 0.2 c = 0.3
1 0.70 0.69 0.70 0.69 0.75 0.73 0.69 0.67
2 0.66 0.64 0.66 0.63 0.72 0.67 0.64 0.63
3 0.82 0.79 0.83 0.78 0.82 0.74 0.78 0.76
4 0.64 0.64 0.64 0.63 0.70 0.68 0.63 0.63
5 0.80 0.76 0.80 0.75 0.77 0.71 0.77 0.74
6 0.64 0.62 0.64 0.63 0.70 0.65 0.64 0.62
7 0.78 0.74 0.78 0.74 0.68 0.65 0.74 0.72
8 0.61 0.63 0.61 0.62 0.67 0.64 0.61 0.63
9 0.76 0.72 0.75 0.69 0.63 0.59 0.71 0.66
10 0.85 0.81 0.85 0.81 0.84 0.76 0.82 0.79
11 0.80 0.76 0.79 0.74 0.62 0.61 0.75 0.72
12 0.84 0.77 0.84 0.76 0.79 0.70 0.81 0.74
49
Table 5.11: S
Average AUC of proposed methods by outlier scenario and censoring amount c.
BHT DBHT
c = 0.2 c = 0.3 c = 0.2 c = 0.3
1 0.79 0.77 0.82 0.81
2 0.72 0.70 0.75 0.74
3 0.88 0.84 0.91 0.89
4 0.72 0.70 0.74 0.72
5 0.83 0.80 0.86 0.83
6 0.70 0.66 0.72 0.69
7 0.71 0.69 0.73 0.71
8 0.64 0.65 0.67 0.68
9 0.60 0.60 0.60 0.59
10 0.88 0.86 0.92 0.91
11 0.60 0.59 0.61 0.60
12 0.83 0.80 0.88 0.84
Analyzing the e↵ects of di↵erent baseline hazards on performance, the simulation results grouped by type
of baseline hazard are shown in tables 5.12, 5.13, 5.14, and 5.15. In terms of TPR neither the alternative and
proposed methods experienced significant changes between the three types of baseline hazards. Regarding
AUC the behavior is similar with no significant changes detected for any of the methods between types of
baseline hazard.
Table 5.12: Average TPR of the alternative methods by outlier scenario and baseline hazard(�, ⌫).
MART DEV LD DFB
(1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5)
1 0.29 0.28 0.29 0.36 0.35 0.37 0.44 0.42 0.43 0.36 0.36 0.38
2 0.22 0.22 0.21 0.25 0.26 0.24 0.32 0.31 0.31 0.29 0.30 0.27
3 0.51 0.51 0.47 0.59 0.57 0.57 0.59 0.59 0.57 0.53 0.52 0.51
4 0.23 0.23 0.20 0.23 0.24 0.23 0.30 0.31 0.29 0.28 0.29 0.27
5 0.45 0.43 0.44 0.52 0.56 0.54 0.50 0.55 0.52 0.48 0.50 0.47
6 0.20 0.21 0.21 0.23 0.23 0.22 0.29 0.29 0.27 0.27 0.26 0.26
7 0.40 0.40 0.40 0.51 0.51 0.50 0.38 0.40 0.40 0.40 0.42 0.42
8 0.18 0.17 0.18 0.17 0.19 0.17 0.23 0.24 0.22 0.23 0.23 0.21
9 0.32 0.31 0.33 0.37 0.37 0.35 0.17 0.18 0.17 0.24 0.24 0.27
10 0.55 0.52 0.52 0.62 0.63 0.65 0.65 0.62 0.63 0.57 0.55 0.57
11 0.39 0.38 0.38 0.47 0.45 0.48 0.24 0.23 0.25 0.32 0.30 0.33
12 0.48 0.49 0.49 0.59 0.60 0.61 0.53 0.54 0.54 0.49 0.51 0.53
50
Table 5.13: Average TPR of the proposed methods by outlier scenario and baseline hazard(�, ⌫).
OSD BHT DBHT
(1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5)
1 0.47 0.46 0.49 0.44 0.43 0.43 0.48 0.46 0.47
2 0.33 0.33 0.30 0.32 0.31 0.30 0.34 0.34 0.33
3 0.65 0.63 0.60 0.60 0.57 0.58 0.67 0.65 0.62
4 0.30 0.30 0.30 0.30 0.28 0.28 0.32 0.33 0.30
5 0.56 0.60 0.59 0.51 0.55 0.54 0.55 0.59 0.58
6 0.27 0.28 0.27 0.27 0.27 0.24 0.29 0.29 0.27
7 0.45 0.43 0.44 0.38 0.37 0.36 0.42 0.41 0.43
8 0.22 0.22 0.20 0.21 0.20 0.18 0.24 0.24 0.20
9 0.09 0.10 0.08 0.06 0.07 0.06 0.07 0.07 0.06
10 0.67 0.69 0.69 0.61 0.60 0.59 0.70 0.70 0.70
11 0.12 0.13 0.16 0.10 0.09 0.12 0.11 0.12 0.13
12 0.58 0.62 0.60 0.53 0.53 0.51 0.59 0.61 0.60
Table 5.14: Average AUC of alternative methods by outlier scenario and baseline hazard type (�, ⌫).
MART DEV LD DFB
(1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5)
1 0.70 0.70 0.70 0.70 0.70 0.69 0.74 0.73 0.74 0.67 0.69 0.68
2 0.65 0.64 0.65 0.65 0.65 0.64 0.69 0.69 0.70 0.65 0.63 0.64
3 0.80 0.81 0.80 0.80 0.80 0.80 0.78 0.79 0.78 0.77 0.77 0.77
4 0.63 0.65 0.63 0.64 0.65 0.62 0.69 0.70 0.68 0.62 0.65 0.62
5 0.77 0.79 0.78 0.76 0.79 0.77 0.73 0.77 0.73 0.75 0.76 0.75
6 0.64 0.63 0.63 0.64 0.63 0.63 0.69 0.67 0.67 0.63 0.63 0.62
7 0.76 0.77 0.76 0.76 0.77 0.76 0.66 0.67 0.66 0.72 0.74 0.73
8 0.62 0.62 0.62 0.62 0.62 0.62 0.66 0.65 0.65 0.61 0.62 0.63
9 0.74 0.74 0.74 0.72 0.73 0.72 0.61 0.62 0.61 0.69 0.68 0.68
10 0.84 0.83 0.82 0.84 0.83 0.82 0.81 0.80 0.79 0.82 0.80 0.80
11 0.78 0.78 0.78 0.76 0.76 0.77 0.62 0.61 0.61 0.73 0.73 0.73
12 0.78 0.80 0.82 0.78 0.80 0.82 0.73 0.74 0.76 0.76 0.78 0.80
BHT and DBHT sensitivity to parameter B
The assessment was made using the outlier scenario where BHT and DBHT have shown median performance
(in terms of AUC and TPR), both corresponding to the outlier scenario # 7. For each of the 4 configurations
(baseline hazard fixed at � = 1, ⌫ = 1) 20 runs were made, making 80 runs for each value of B. The values
of B were increased by increments of 100, until there was no longer improvement in performance. The mean
values of TPR and AUC were taken, and are depicted in Figures 5.3, and 5.4. As expected from the previous
results, DBHT convergences much faster to the maximum performance than BHT, with the DBHT’s TPR
converging at about 400 bootstrap samples as compared with the BHT that converges only at B = 800. In
terms of AUC, DBHT converges to its maximum performance at about 400, as BHT at only about 800.
51
Table 5.15: Average AUC of proposed methods by outlier scenario and baseline hazard(�, ⌫).
BHT DBHT
(1, 1) (0.5, 1.5) (1.5, 0.5) (1, 1) (0.5, 1.5) (1.5, 0.5)
1 0.78 0.77 0.79 0.81 0.81 0.82
2 0.72 0.71 0.72 0.75 0.74 0.75
3 0.87 0.86 0.86 0.90 0.90 0.90
4 0.71 0.72 0.70 0.73 0.74 0.72
5 0.81 0.83 0.82 0.83 0.85 0.85
6 0.68 0.68 0.68 0.72 0.70 0.71
7 0.70 0.70 0.70 0.72 0.72 0.72
8 0.65 0.65 0.65 0.68 0.69 0.66
9 0.59 0.59 0.61 0.59 0.60 0.60
10 0.88 0.87 0.86 0.92 0.91 0.91
11 0.59 0.59 0.60 0.60 0.61 0.61
12 0.81 0.82 0.81 0.85 0.87 0.86
200 400 600 800 1000
0.30
0.35
0.40
0.45
B
TPR
200 400 600 800 1000
0.30
0.35
0.40
0.45
B
TPR
Figure 5.3: Evolution of TPR with parameter B, DBHT on blue and BHT on red.
52
200 400 600 800 1000
0.60
0.65
0.70
0.75
0.80
B
AUC
200 400 600 800 1000
0.60
0.65
0.70
0.75
0.80
B
AUC
Figure 5.4: Evolution of AUC with parameter B, DBHT on blue and BHT on red.
5.3 Worcester Heart Attack Study dataset
The dataset from the Worcester Heart Attack Study, contains data of 100 individuals each with 5 covariates.
This data concerns the survival times of patients having their first heart attack. This dataset is publicly
available at https://www.umass.edu/statdata/statdata/data/. The outliers detected by the methods for the
WHAS dataset are presented in Table 5.16. The selection corresponds to the fifteen observations with the
lowest p-values. The estimates for the regression coe�cients when fitting the Cox model to all observations
are given in Table 5.17, we observe that only two covariates are statistically significant: the age at the first
hear attack (Age) and the body mass index (BMI ). Performing a 5% trimming of outliers, the Cox estimates
53
Table 5.16: Top-15 outliers detected by the methods on the WHAS100 dataset.
# MART DEV LD DFB OSD BHT DBHT
1 93 1 97 8 1 67 1
2 51 31 67 97 67 1 67
3 90 56 1 93 97 78 97
4 33 85 52 52 51 56 56
5 11 97 23 30 23 69 23
6 27 93 7 10 31 8 90
7 40 30 57 78 93 45 93
8 1 78 78 7 52 93 8
9 31 51 56 56 56 30 78
10 56 90 17 32 57 32 51
11 85 67 29 54 7 13 29
12 97 91 31 98 30 33 69
13 46 11 91 57 13 47 72
14 69 23 30 51 78 51 30
15 30 27 32 90 8 97 17
54
Table 5.17: Cox model estimated with all WHAS observations.
� p-value
Los -0.0220 0.3967
Age 0.0386 0.0025
Gender 0.1558 0.6066
BMI -0.0711 0.0497
Table 5.18: Cox model fit on the WHAS dataset with 5% outlier trimming, using the alternative methods.
MART DEV LD DFB
Xi
�i
p �i
p �i
p �i
p
Los -0.0322 0.2074 -0.0194 0.4543 -0.0169 0.5012 -0.1001 0.0541
Age 0.0443 0.0006 0.0492 0.0005 0.0588 0.0001 0.0490 0.0003
Gender 0.5758 0.0559 0.0751 0.8154 -0.0316 0.9214 -0.0716 0.8253
BMI -0.0794 0.0286 -0.0967 0.0160 -0.0970 0.0156 -0.1305 0.0020
Table 5.19: Cox model fit on the WHAS dataset with 5% outlier trimming, using the proposed methods.
OSD BHT DBHT
Xi
�i
p �i
p �i
p
Los -0.0195 0.4412 -0.1048 0.0434 -0.0216 0.4260
Age 0.0525 0.0003 0.0485 0.0003 0.0555 0.0002
Gender 0.1408 0.6544 0.1786 0.5690 -0.0020 0.9951
BMI -0.1064 0.0067 -0.1214 0.0031 -0.1010 0.0122
Table 5.20: Cox model on the WHAS data with 10% outlier trimming for alternative methods.
MART DEV LD DFB
Xi
�i
p �i
p �i
p �i
p
Los -0.0399 0.1646 -0.0308 0.3014 -0.0158 0.5654 -0.1139 0.0358
Age 0.0523 0.0001 0.0481 0.0006 0.0690 0.0000 0.0503 0.0012
Gender 0.5013 0.1064 0.3778 0.2475 -0.2390 0.4821 -0.2272 0.5076
BMI -0.0936 0.0134 -0.1613 0.0002 -0.1458 0.0013 -0.1706 0.0004
55
Table 5.21: Cox model fit on the WHAS dataset with 10% outlier trimming, using the proposed methods.
OSD BHT DBHT
Xi
�i
p �i
p �i
p
Los -0.0252 0.3743 -0.1662 0.0062 -0.1298 0.0172
Age 0.0677 0.0000 0.0478 0.0000 0.0588 0.0001
Gender 0.0422 0.8983 0.0029 0.9921 0.2585 0.4247
BMI -0.1336 0.0024 -0.1620 0.0012 -0.1618 0.0003
5.4 Bone Marrow Transplant dataset
The Bone Marrow Transplant dataset (BMT) (Klein and Moeschberger, 2003) contains data about 137
leukemia patients each with 10 covariates. The data concerns the survival time after the bone marrow
transplant. It is publicly available in the R package “KMsurv” (by Klein et al., 2012). The outliers detected
by the methods in the BMT dataset are presented in Table 5.22. The observations presented correspond to
the ones with the 10% lowest p-values. For BHT and DBHT a value of bootstrap samples B = 2000 has
shown to be su�cient for convergence.
Table 5.22: Top-10% outliers detected by the methods on the BMT dataset.
# MART DEV LD DFB OSD BHT DBHT
1 65 129 129 65 129 129 129
2 103 35 132 26 132 103 132
3 99 108 89 129 30 99 99
4 97 65 90 99 130 65 65
5 13 132 26 2 26 30 130
6 42 87 30 6 28 132 103
7 63 84 28 89 65 13 30
8 40 103 130 43 13 130 89
9 92 30 17 84 103 16 13
10 14 99 105 103 14 136 28
11 43 97 136 130 72 15 14
12 39 28 116 132 89 26 105
13 49 109 72 30 50 97 116
14 10 80 36 10 99 131 90
When using all the data, the statistically significant covariates are FAB, Hospital and MTX (Table 5.23).
After identifying the most outlying observations with each method, we are able to perform outlier trimming
in order to make a more robust estimation of the Cox model. Starting with a trimming level of 5%, new
Cox models were estimated (Table 5.24 and Table 5.25) there are no major changes from the model with
56
all data, apart from the the coe�cients of Donor Age and CMV that experience a considerable reduction of
their p-value. With an outlier trimming level of 10%, we van verify that covariate Donor Age for two of the
Table 5.23: Cox model estimatation with all BMT data (coe�cients with p-value below 5% on bold).
Xi
�i
p-value
Age Diagn -0.0017 0.9357
Donor Age 0.0316 0.1072
Sex -0.2738 0.2651
Donor Sex 0.0409 0.8662
CMV -0.1701 0.4922
Donor CMV 0.0038 0.9875
Wait Time -0.0001 0.8701
FAB 0.7917 0.0012
Hospital -0.5570 0.0004
MTX 1.0062 0.0026
Table 5.24: Cox model estimations with 5% outlier trimming using the alternative methods.
MART DEV LD DFB
Xi
�i
p �i
p �i
p �i
p
Age Diagn 0.0057 0.7733 0.0184 0.4007 0.0087 0.6996 0.0022 0.9163
Donor Age 0.0338 0.0630 0.0194 0.3390 0.0352 0.0936 0.0353 0.0787
Donor Age -0.4350 0.0754 -0.3857 0.1318 -0.3110 0.2309 -0.2544 0.3129
Donor Sex -0.0826 0.7417 0.2214 0.3959 0.2268 0.3834 0.2096 0.4284
CMV -0.4239 0.1077 -0.3828 0.1580 -0.3386 0.1910 -0.4058 0.1264
Donor CMV 0.0150 0.9511 0.0326 0.8968 0.1254 0.6206 -0.0774 0.7585
Wait Time 0.0000 0.9538 0.0001 0.7803 -0.0003 0.5304 0.0005 0.3899
FAB 1.1493 0.0000 1.0162 0.0001 0.9587 0.0002 1.1592 0.0000
Hospital -0.6484 0.0000 -0.6842 0.0001 -1.0732 0.0001 -0.7807 0.0000
MTX 1.6598 0.0000 1.2379 0.0008 1.7474 0.0003 1.6405 0.0000
alternative methods now appears above the 5% significance level. Among our proposed methods, by trimming
their 13 most outliying observations from the data, resulted that in the three methods, the covariate CMV
now appears with a p-value below the 5% level.
57
Table 5.25: Cox model estimations with 5% outlier trimming using the proposed methods.
OSD BHT DBHT
Xi
�i
p �i
p �i
p
Age Diagn 0.0192 0.3837 0.0086 0.6762 0.0183 0.3779
Donor Age 0.0242 0.2356 0.0327 0.0910 0.0250 0.2001
Donor Age -0.3655 0.1508 -0.3076 0.2234 -0.2922 0.2524
Donor Sex4 0.3848 0.1510 0.0357 0.8904 0.1293 0.6206
CMV -0.5158 0.0574 -0.3469 0.1868 -0.4485 0.0919
Donor CMV 0.0064 0.9797 0.0514 0.8349 0.0265 0.9153
Wait Time -0.0003 0.5408 0.0001 0.7567 0.0001 0.6760
FAB 1.0752 0.0000 1.2288 0.0000 1.2365 0.0000
Hospital -0.8930 0.0000 -0.7683 0.0000 -0.8743 0.0000
MTX 1.6513 0.0001 1.6951 0.0000 1.8809 0.0000
Table 5.26: Cox model estimations with 10% outlier trimming using the alternative methods.
MART DEV LD DFB
Xi
�i
p �i
p �i
p �i
p
Age Diagn 0.0029 0.8845 0.0295 0.1748 0.0017 0.9421 0.0102 0.6457
Donor AgeS 0.0335 0.0778 0.0225 0.2738 0.0555 0.0144 0.0368 0.0849
Sex -0.3882 0.1233 -0.5748 0.0359 -0.3793 0.1673 -0.3072 0.2421
Donor Sex 0.0508 0.8399 0.2325 0.3946 0.3354 0.2309 0.2538 0.3634
CMV -0.4251 0.1097 -0.4897 0.0765 -0.3968 0.1484 -0.5183 0.0637
Donor CMV 0.0068 0.9777 0.1063 0.6800 -0.0026 0.9922 0.0010 0.9968
Wait Time -0.0001 0.7227 0.0003 0.3269 -0.0002 0.6049 0.0005 0.3632
FAB 1.1796 0.0000 1.2746 0.0000 1.0994 0.0000 1.3017 0.0000
Hospital -0.6857 0.0000 -0.7872 0.0000 -1.4023 0.0001 -0.9851 0.0000
MTX 1.9100 0.0000 1.5219 0.0001 2.2493 0.0002 2.0618 0.0000
5.5 CancerSys Dataset
The CancerSys dataset (CSYS) contains data about 161 cancer patients (91 after removing missing values)
with bone metastasis, the recorded time corresponds to the follow-up time after bone metastasis have been
diagnosed. The top-10% outliers detected by the methods are displayed in Table 5.28. Fitting a Cox model to
all observations (see Table 5.29) we may verify that only two covariates are above the 5% level of significance:
AgeDiagn and ExtraMets. After doinf 5% outlier trimming of the data and then fitting a Cox model (see
Tables 5.30 and 5.31) we observe important changes, trimming for three of the alternative methods, resulted
on the covariate Sex becoming significant; for all three of our developed methods, removing the 4 most
58
Table 5.27: Cox model estimations with 10% outlier trimming using the proposed methods.
OSD BHT DBHT
Xi
�i
p �i
p �i
p
Age Diagn 0.0278 0.1945 -0.0174 0.4183 0.0226 0.2863
Donor Age 0.0124 0.5390 0.0332 0.0972 0.0288 0.1394
Sex -0.4453 0.0873 -0.4120 0.1148 -0.4235 0.1124
Donor Sex 0.2619 0.3427 0.0760 0.7804 0.1546 0.5717
CMV -0.6221 0.0263 -0.5407 0.0466 -0.5840 0.0378
Donor CMV 0.0597 0.8164 -0.0239 0.9256 0.1857 0.4689
Wait Time -0.0002 0.7155 0.0001 0.6230 0.0002 0.5920
FAB 1.2761 0.0000 1.2596 0.0000 1.2792 0.0000
Hospital -1.3875 0.0000 -0.9910 0.0000 -1.5141 0.0000
MTX 3.0082 0.0000 2.1267 0.0000 3.0513 0.0000
outlying observations resulted in having covariate XRayPattern emerging as significant. It is noteworthy that
for the DBHT only by removing 4 observations in 91, the p-value of XRayPattern decreased from 0.9 to 0.002.
Table 5.28: Top-15 outliers detected by the methods on the CSYS dataset.
# MART DEV LD DFB OSD BHT DBHT
1 68 60 49 68 83 83 83
2 83 143 112 78 126 112 126
3 34 68 60 34 84 22 112
4 78 110 124 53 91 49 124
5 53 64 22 57 112 47 28
6 97 124 28 23 49 124 49
7 126 112 32 126 22 126 60
8 23 78 62 83 62 62 91
9 91 53 83 9 60 28 73
10 57 126 140 97 124 91 62
59
Table 5.29: Cox model fitted to all CSYS data.
Xi
�i
p-value
Sex 0.2848 0.2658
AgeDiagn 0.0268 0.0034
XRayPattern -0.2442 0.0997
NSRE -0.0565 0.6210
ExtraMets 0.8031 0.0035
NTXBase 0.0004 0.2406
Table 5.30: Cox model fit with 5% outlier trimming using the alternative methods.
MART DEV LD DFB
Xi
�i
p �i
p �i
p �i
p
Sex 0.6734 0.0130 0.5709 0.0396 0.2684 0.3174 0.8407 0.0029
AgeDiagn 0.0209 0.0450 0.0198 0.0567 0.0311 0.0012 0.0147 0.1627
XRayPattern -0.2175 0.1766 -0.2532 0.1017 -0.3116 0.0428 -0.1344 0.3911
NSRE -0.0648 0.5530 -0.0858 0.4680 -0.0096 0.9339 -0.0275 0.8017
ExtraMets 1.0720 0.0002 0.9979 0.0006 1.0453 0.0005 1.1404 0.0001
NTXBase 0.0002 0.5102 0.0004 0.1568 0.0005 0.1512 0.0005 0.1436
Table 5.31: Cox model fit with 5% outlier trimming using the proposed methods.
OSD BHT DBHT
Xi
�i
p �i
p �i
p
Sex 0.1289 0.6167 0.2300 0.3803 0.2739 0.2917
AgeDiagn 0.0351 0.0002 0.0306 0.0012 0.0344 0.0003
XRayPattern -0.4881 0.0036 -0.3423 0.0282 -0.4292 0.0073
NSRE -0.1377 0.2560 -0.0461 0.6838 -0.0842 0.4665
ExtraMets 1.0456 0.0003 1.0025 0.0008 0.9654 0.0008
NTXBase 0.0003 0.2913 0.0005 0.1506 0.0004 0.2524
When performing a 10% level of outlier trimming, the MART method that with a 5% returned Sex
as significant backs again considering Sex non significant again. Our proposed methods maintain their
consistency, all three present the same significant covariates, and the coe�cient estimations are similar between
them.
60
Table 5.32: Cox model with 10% outlier trimming for alternative methods.
MART DEV LD DFB
Xi
�i
p �i
p �i
p �i
p
Sex 0.3642 0.1902 0.8061 0.0052 0.2234 0.4153 0.6558 0.0218
AgeDiagn 0.0365 0.0012 0.0158 0.1367 0.0352 0.0004 0.0289 0.0152
XRayPattern -0.3291 0.0502 -0.1408 0.4040 -0.3998 0.0136 -0.3261 0.0503
NSRE -0.1056 0.3328 -0.0105 0.9290 -0.0332 0.7764 -0.0188 0.8842
ExtraMets 1.0650 0.0004 1.2823 0.0000 1.2264 0.0001 0.9939 0.0011
NTXBase 0.0001 0.6687 0.0004 0.2584 0.0005 0.1074 0.0004 0.3105
Table 5.33: Cox model with 10% outlier trimming for proposed methods.
OSD BHT DBHT
Xi
�i
p �i
p �i
p
1 -0.0204 0.9400 0.1799 0.5039 0.2191 0.4228
2 0.0398 0.0001 0.0379 0.0001 0.0394 0.0001
3 -0.5477 0.0015 -0.4688 0.0042 -0.5169 0.0022
4 -0.1258 0.3132 -0.0831 0.4848 -0.0724 0.5272
5 1.3058 0.0000 1.1139 0.0003 1.1872 0.0001
6 0.0005 0.1512 0.0005 0.1541 0.0004 0.2351
Leave-one-out Cross-validation of the c-index
To assess the predictive ability of the model when facing new observations, we perform leave-one-out cross-
validation of the c-index. In this results the outliers also become part of the several test sets, but they are
never present in the training used to estimate the models, three types of outlier trimming were used: removal
of the 3 most oulying, 10 most outlying and the 30 more outlying observations, this leave-one-out values
obtained for the three real datasets are presented for each method in tables 5.34, 5.35, and 5.36. The results
are very positive for the three methods, with the concordance showing a steady increase while removing the
most outlying observations of each dataset.
Table 5.34: Leave-one-out c-indexes for the BHT method.
Dataset All data top-3 top-10 top-30
WHAS 0.6607 0.6813 0.6824 0.6900
BMT 0.6208 0.6314 0.6441 0.6668
CSYS 0.5963 0.6053 0.6147 0.6186
61
Table 5.35: Leave-one-out c-indexes for the DBHT procedure.
Dataset All data top-3 top-10 top-30
WHAS 0.6607 0.6710 0.6807 0.6910
BMT 0.6208 0.6288 0.6462 0.6630
CSYS 0.5963 0.6114 0.6160 0.6240
Table 5.36: Leave-one-out c-indexes for the OSD procedure.
Dataset All data top-3 top-10 top-30
WHAS 0.6607 0.6832 0.6853 0.6986
BMT 0.6208 0.6314 0.6441 0.6629
CSYS 0.5963 0.6100 0.6214 0.6196
62
Chapter 6
Conclusions and Future Work
We proposed three methods for outlier detection in a survival context. In our simulation study the methods
have achieved in general, a better performance when compared with alternative methods. Overall, DBHT
has shown promising results, being the best method in nine of the twelve simulated outlier scenarios. On
the three scenarios where the outlier source is colinear with the general model, the performance is poor for
all of our proposed methods. One possible cause is that concordance fails to capture these type of outliers,
given they have the same hazard direction as the inliers. We also have verified that the performance of the
proposed methods is relatively robust to changes in scenario conditions, when compared to the alternative
methods. On the real datasets, performing outlier trimming prior to fitting a Cox model looks promising,
since it potentially allows to unveil other Cox regression, that might have been distorted by the presence of
outliers in the original data. In terms of increasing the model’s predictive ability, the leave-one-out c-indexes
increased when excluding the most outlying observation from the fit.
In terms of future work, the developed methods leave room for many extensions, in particular given that
the methods use a survival model as a black-box (only used to fit and calculate concordance) they allow
several models to be used, instead of only using the Cox model as we did in this work. The bootstrapping
frameworks of BHT and DBHT, can also be applied to perform outlier detection on other kinds of data, the
methods just need one test statistic that is believed to be sensitive to outliers. In conclusion, is shown once
more that outlier detection methods can provide tools for performing robust regression and improve model
interpretability and accuracy in many fields of research.
63
Bibliography
Aalen, O., Borgan, O., and Gjessing, H. (2008). Survival and event history analysis: a process point of view.
Springer Science & Business Media.
Bednarski, T. (1993). Robust estimation in cox’s regression model. Scandinavian Journal of Statistics, pages
213–225.
Bednarski, T. and Borowicz, F. (2006). coxrobust: Robust Estimation in Cox Model. R package version 1.0.
Ben-Gal, I. (2005). Outlier detection. In Data Mining and Knowledge Discovery Handbook, pages 131–146.
Springer.
Bender, R., Augustin, T., and Blettner, M. (2005). Generating survival times to simulate cox proportional
hazards models. Statistics in medicine, 24(11):1713–1723.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classification and regression trees. CRC
press.
Buckley, J. and James, I. (1979). Linear regression with censored data. Biometrika, 66(3):429–436.
by Klein, O., Moeschberger, and modifications by Jun Yan (2012). KMsurv: Data sets from Klein and
Moeschberger (1997), Survival Analysis. R package version 0.1-5.
Cox, D. R. (1972). Regression Models and Life Tables. Journal of the Royal Statistic Society, B(34):187–202.
Cox, D. R. and Snell, E. J. (1968). A general definition of residuals. Journal of the Royal Statistical Society.
Series B (Methodological), pages 248–275.
Crowley, J. and Hu, M. (1977). Covariance analysis of heart transplant survival data. Journal of the American
Statistical Association, 72(357):27–36.
David Collett (2003). Modelling survival data in medical research. Boca Raton, Fla. : Chapman &
Hall/CRC, c2003.
David G. Kleinbaum, Mitchel Klein (2005). Survival analysis: a self-learning text. New York, NY : Springer,
c2005.
Davies, L. and Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical
Association, 88(423):782–792.
Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. A Festschrift for Erich L. Lehmann,
157184.
64
Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, pages 1–26.
Efron, B. and Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Farcomeni, A. and Viviani, S. (2011). Robust estimation for the cox regression model based on trimming.
Biometrical Journal, 53(6):956–973.
Fischler, M. and Bolles, R. (1981). Random Sample Consensus: A Paradigm for Model Fitting with Applica-
tions to Image Analysis and Automated Cartography. Communications of the ACM.
Goeman, J. J. (2010). L1 penalized estimation in the cox proportional hazards model. Biometrical Journal,
(52):–14.
Goeman, J. J. (2012). Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs
and in the Cox model. R package version.
Grambsch, P. M. and Therneau, T. M. (1994). Proportional hazards tests and diagnostics based on weighted
residuals. Biometrika, 81(3):515–526.
Harrell, F. E. (2001). Regression modeling strategies: with applications to linear models, logistic regression,
and survival analysis. Springer.
Harrell, F. E., Cali↵, R. M., Pryor, D. B., Lee, K. L., and Rosati, R. A. (1982). Evaluating the yield of medical
tests. Jama, 247(18):2543–2546.
Hawkins, D. M. (1980). Identification of outliers, volume 11. Springer.
Heagerty, P. J., Lumley, T., and Pepe, M. S. (2000). Time-dependent roc curves for censored survival data
and a diagnostic marker. Biometrics, 56(2):337–344.
Heagerty, P. J. and packaging by Paramita Saha-Chaudhuri (2013). survivalROC: Time-dependent ROC curve
estimation from censored survival data. R package version 1.0.3.
Heller, G. and Simono↵, J. S. (1990). A comparison of estimators for regression with a censored response
variable. Biometrika, 77(3):515–520.
Huber, P. J. (2011). Robust statistics. Springer.
Johnson, R. A., Wichern, D. W., and Education, P. (1992). Applied multivariate statistical analysis, volume 4.
Prentice hall Englewood Cli↵s, NJ.
Kalbfleisch, J. D. and Prentice, R. L. (2011). The statistical analysis of failure time data, volume 360. John
Wiley & Sons.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the
American statistical association, 53(282):457–481.
Kendall, M. and Gibbons, J. D. (1990). Rank Correlation Methods. A Charles Gri�n Title, 5 edition.
Klein, J. P. and Moeschberger, M. L. (2003). Survival Analysis Techniques for Censored and Truncated Data.
Second edition.
Larson, M. G. and Dinse, G. E. (1985). A mixture model for the regression analysis of competing risks data.
Applied statistics, pages 201–211.
65
Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. John Wiley & Sons, 2nd edition.
Leblanc, M. and Crowley, J. (1993). Survival trees by goodness of split. Journal of the American Statistical
Association, 88(422):457–467.
Lunn, M. and McNeil, D. (1995). Applying cox regression to competing risks. Biometrics, pages 524–532.
MacKinnon, J. G. (2009). Bootstrap hypothesis testing. Handbook of Computational Econometrics, pages
183–213.
Mikosch, T. (1998). Elementary stochastic calculus, with finance in view. AMC, 10:12.
Miller, R. and Halpern, J. (1982). Regression with censored data. Biometrika, 69(3):521–531.
Miller, R. G. (1976). Least squares regression with censored data. Biometrika, 63(3):449–464.
Nardi, A. and Schemper, M. (1999). New residuals for cox regression and their application to outlier screening.
Biometrics, 55(2):523–529.
Newson, R. (2006). Confidence intervals for rank statistics: Somers’ d and extensions. Stata Journal, 6(3):309.
Ojo, A. O., Hanson, J. A., Wolfe, R. A., Leichtman, A. B., Agodoa, L. Y., and Port, F. K. (2000). Long-term
survival in renal transplant recipients with graft function. Kidney international, 57(1):307–313.
Rajagopalan, V. (2006). Selected Statistical Tests. New Age International.
Rocha, C. (2011). Examining time to rearrest by drug treatment experience of drug court eligible o↵enders.
Rousseeuw, P. J. and Leroy, A. M. (2005). Robust regression and outlier detection, volume 589. John Wiley
& Sons.
Singh, K. and Xie, M. (2003). Bootlier-Plot: Bootstrap Based Outlier Detection Plot. Sankhya: The Indian
Journal of Statistics (2003-2007), 65(3):532–559.
Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. American sociological
review, pages 799–811.
Stare, J., Heinzl, H., and Harrell, F. (2000). On the use of buckley and james least squares regression for
survival data. New approaches in applied statistics: Metodoloski zvezki, 16.
Struthers, C. A. and Kalbfleisch, J. D. (1986). Misspecified proportional hazard models. Biometrika,
73(2):363–369.
Therneau, T. M. (2014). A Package for Survival Analysis in S. R package version 2.37-7.
Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.
Therneau, T. M., Grambsch, P. M., and Fleming, T. R. (1990). Martingale-based residuals for survival models.
Biometrika, 77(1):147–160.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.
Series B (Methodological), pages 267–288.
66
Appendix A: Results on the simulation
data
Table 6.1: TPR of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.28(0.19) 0.39(0.19) 0.43(0.19) 0.37(0.16) 0.44(0.21) 0.42(0.16) 0.44(0.21)
2 0.2(0.16) 0.22(0.17) 0.28(0.19) 0.27(0.15) 0.26(0.19) 0.28(0.18) 0.28(0.18)
3 0.44(0.22) 0.59(0.16) 0.62(0.18) 0.55(0.17) 0.62(0.19) 0.58(0.19) 0.63(0.19)
4 0.21(0.14) 0.23(0.14) 0.32(0.14) 0.25(0.15) 0.28(0.17) 0.3(0.15) 0.31(0.16)
5 0.41(0.2) 0.54(0.17) 0.53(0.19) 0.49(0.19) 0.54(0.2) 0.51(0.19) 0.56(0.21)
6 0.2(0.16) 0.24(0.14) 0.31(0.18) 0.28(0.17) 0.25(0.17) 0.28(0.17) 0.28(0.18)
7 0.3(0.19) 0.54(0.17) 0.41(0.17) 0.38(0.15) 0.4(0.2) 0.39(0.19) 0.41(0.2)
8 0.16(0.13) 0.13(0.13) 0.24(0.16) 0.22(0.13) 0.18(0.17) 0.17(0.14) 0.2(0.17)
9 0.28(0.17) 0.39(0.18) 0.17(0.15) 0.23(0.16) 0.05(0.1) 0.04(0.08) 0.06(0.1)
10 0.53(0.19) 0.7(0.19) 0.72(0.18) 0.64(0.16) 0.72(0.19) 0.62(0.16) 0.73(0.18)
11 0.38(0.2) 0.49(0.17) 0.25(0.14) 0.32(0.15) 0.06(0.13) 0.07(0.13) 0.06(0.11)
12 0.39(0.22) 0.64(0.2) 0.61(0.2) 0.53(0.18) 0.6(0.19) 0.54(0.19) 0.59(0.2)
67
Table 6.2: AUC of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB BHT DBHT
1 0.71(0.13) 0.72(0.14) 0.74(0.14) 0.7(0.13) 0.77(0.15) 0.79(0.15)
2 0.66(0.12) 0.66(0.13) 0.71(0.15) 0.65(0.13) 0.71(0.14) 0.75(0.13)
3 0.82(0.13) 0.83(0.13) 0.83(0.13) 0.8(0.14) 0.86(0.11) 0.91(0.08)
4 0.65(0.13) 0.67(0.12) 0.71(0.11) 0.62(0.12) 0.72(0.12) 0.73(0.13)
5 0.8(0.12) 0.8(0.12) 0.78(0.13) 0.78(0.11) 0.82(0.13) 0.85(0.1)
6 0.66(0.11) 0.66(0.11) 0.74(0.12) 0.67(0.12) 0.72(0.13) 0.75(0.13)
7 0.77(0.13) 0.8(0.13) 0.71(0.14) 0.74(0.12) 0.7(0.12) 0.73(0.13)
8 0.61(0.1) 0.62(0.1) 0.7(0.11) 0.61(0.12) 0.66(0.12) 0.67(0.13)
9 0.76(0.12) 0.76(0.12) 0.65(0.13) 0.71(0.13) 0.61(0.1) 0.61(0.09)
10 0.9(0.11) 0.9(0.1) 0.88(0.14) 0.88(0.12) 0.91(0.09) 0.95(0.06)
11 0.82(0.12) 0.82(0.12) 0.66(0.11) 0.76(0.13) 0.59(0.09) 0.62(0.1)
12 0.8(0.15) 0.82(0.15) 0.81(0.16) 0.8(0.14) 0.84(0.12) 0.88(0.1)
Table 6.3: TPR of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.32(0.11) 0.37(0.14) 0.48(0.13) 0.37(0.12) 0.53(0.12) 0.51(0.12) 0.55(0.12)
2 0.22(0.1) 0.28(0.09) 0.38(0.11) 0.3(0.12) 0.39(0.11) 0.39(0.14) 0.39(0.12)
3 0.59(0.13) 0.6(0.15) 0.63(0.15) 0.51(0.12) 0.7(0.14) 0.67(0.11) 0.71(0.11)
4 0.23(0.1) 0.22(0.12) 0.31(0.13) 0.29(0.1) 0.32(0.13) 0.31(0.13) 0.35(0.11)
5 0.52(0.16) 0.51(0.14) 0.49(0.14) 0.51(0.1) 0.6(0.14) 0.53(0.15) 0.57(0.12)
6 0.21(0.13) 0.25(0.11) 0.29(0.14) 0.29(0.12) 0.29(0.14) 0.29(0.13) 0.3(0.14)
7 0.46(0.11) 0.51(0.14) 0.35(0.12) 0.4(0.08) 0.5(0.15) 0.4(0.14) 0.45(0.15)
8 0.18(0.07) 0.17(0.1) 0.25(0.13) 0.22(0.09) 0.23(0.15) 0.21(0.12) 0.25(0.12)
9 0.33(0.11) 0.42(0.15) 0.16(0.1) 0.28(0.1) 0.14(0.09) 0.08(0.09) 0.08(0.07)
10 0.6(0.12) 0.62(0.14) 0.66(0.15) 0.55(0.11) 0.68(0.17) 0.65(0.13) 0.73(0.1)
11 0.39(0.14) 0.5(0.14) 0.26(0.13) 0.35(0.12) 0.17(0.09) 0.12(0.09) 0.15(0.09)
12 0.55(0.12) 0.61(0.13) 0.53(0.16) 0.51(0.13) 0.62(0.16) 0.54(0.18) 0.62(0.16)
68
Table 6.4: AUC of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB BHT DBHT
1 0.71(0.08) 0.71(0.08) 0.76(0.1) 0.67(0.1) 0.81(0.07) 0.84(0.07)
2 0.64(0.09) 0.63(0.09) 0.72(0.11) 0.63(0.1) 0.74(0.11) 0.74(0.11)
3 0.82(0.1) 0.81(0.1) 0.8(0.1) 0.77(0.11) 0.89(0.06) 0.91(0.07)
4 0.6(0.08) 0.61(0.08) 0.68(0.09) 0.58(0.07) 0.71(0.1) 0.72(0.09)
5 0.79(0.1) 0.77(0.1) 0.72(0.11) 0.76(0.1) 0.82(0.08) 0.84(0.07)
6 0.64(0.09) 0.63(0.08) 0.68(0.1) 0.62(0.08) 0.67(0.09) 0.7(0.1)
7 0.74(0.1) 0.75(0.09) 0.62(0.1) 0.68(0.1) 0.7(0.11) 0.73(0.11)
8 0.6(0.06) 0.6(0.06) 0.66(0.09) 0.57(0.07) 0.63(0.09) 0.66(0.1)
9 0.76(0.08) 0.74(0.1) 0.61(0.09) 0.7(0.08) 0.55(0.07) 0.56(0.07)
10 0.85(0.07) 0.85(0.08) 0.83(0.1) 0.8(0.07) 0.89(0.06) 0.92(0.05)
11 0.78(0.09) 0.77(0.08) 0.6(0.06) 0.73(0.11) 0.58(0.06) 0.58(0.06)
12 0.82(0.1) 0.8(0.1) 0.73(0.11) 0.79(0.09) 0.81(0.1) 0.86(0.1)
Table 6.5: TPR of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.28(0.18) 0.3(0.17) 0.39(0.18) 0.33(0.16) 0.41(0.19) 0.39(0.18) 0.42(0.18)
2 0.21(0.16) 0.22(0.15) 0.28(0.19) 0.28(0.16) 0.29(0.2) 0.26(0.17) 0.3(0.19)
3 0.44(0.21) 0.6(0.17) 0.61(0.17) 0.54(0.17) 0.62(0.18) 0.58(0.17) 0.66(0.16)
4 0.22(0.16) 0.22(0.17) 0.24(0.18) 0.28(0.17) 0.24(0.19) 0.27(0.19) 0.28(0.2)
5 0.36(0.19) 0.5(0.21) 0.52(0.21) 0.46(0.17) 0.52(0.2) 0.48(0.19) 0.5(0.19)
6 0.18(0.15) 0.18(0.17) 0.26(0.19) 0.24(0.17) 0.22(0.2) 0.21(0.16) 0.26(0.2)
7 0.35(0.19) 0.47(0.19) 0.41(0.18) 0.4(0.16) 0.42(0.21) 0.34(0.2) 0.4(0.2)
8 0.18(0.16) 0.17(0.15) 0.21(0.16) 0.23(0.16) 0.21(0.17) 0.18(0.14) 0.23(0.17)
9 0.32(0.17) 0.31(0.18) 0.15(0.16) 0.22(0.17) 0.04(0.08) 0.04(0.08) 0.04(0.09)
10 0.5(0.2) 0.6(0.18) 0.66(0.16) 0.56(0.13) 0.63(0.15) 0.58(0.18) 0.68(0.15)
11 0.38(0.19) 0.43(0.19) 0.21(0.15) 0.28(0.16) 0.08(0.13) 0.09(0.12) 0.09(0.13)
12 0.43(0.21) 0.57(0.19) 0.5(0.19) 0.48(0.19) 0.52(0.21) 0.5(0.2) 0.57(0.18)
69
Table 6.6: AUC of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB BHT DBHT
1 0.69(0.12) 0.68(0.12) 0.73(0.12) 0.67(0.13) 0.77(0.11) 0.81(0.13)
2 0.66(0.13) 0.67(0.14) 0.67(0.14) 0.69(0.12) 0.71(0.13) 0.74(0.14)
3 0.81(0.14) 0.81(0.15) 0.79(0.13) 0.78(0.14) 0.87(0.1) 0.89(0.08)
4 0.66(0.11) 0.65(0.11) 0.72(0.11) 0.65(0.13) 0.72(0.12) 0.73(0.15)
5 0.75(0.15) 0.75(0.16) 0.76(0.17) 0.74(0.15) 0.8(0.14) 0.83(0.13)
6 0.65(0.12) 0.65(0.12) 0.68(0.12) 0.63(0.13) 0.67(0.13) 0.72(0.12)
7 0.74(0.14) 0.74(0.14) 0.7(0.13) 0.72(0.14) 0.71(0.13) 0.72(0.14)
8 0.66(0.11) 0.65(0.11) 0.65(0.13) 0.65(0.13) 0.66(0.13) 0.7(0.12)
9 0.72(0.14) 0.7(0.13) 0.61(0.12) 0.68(0.13) 0.62(0.1) 0.61(0.08)
10 0.84(0.12) 0.84(0.11) 0.83(0.13) 0.83(0.12) 0.88(0.08) 0.92(0.06)
11 0.76(0.14) 0.75(0.12) 0.64(0.1) 0.73(0.13) 0.59(0.11) 0.61(0.09)
12 0.77(0.15) 0.76(0.15) 0.74(0.15) 0.75(0.14) 0.81(0.12) 0.84(0.12)
Table 6.7: TPR of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.29(0.11) 0.37(0.12) 0.45(0.14) 0.36(0.11) 0.49(0.14) 0.43(0.14) 0.48(0.13)
2 0.26(0.1) 0.26(0.1) 0.34(0.13) 0.32(0.13) 0.37(0.15) 0.33(0.13) 0.37(0.14)
3 0.57(0.14) 0.56(0.12) 0.52(0.13) 0.5(0.12) 0.67(0.13) 0.58(0.1) 0.67(0.12)
4 0.24(0.1) 0.26(0.1) 0.33(0.13) 0.31(0.11) 0.35(0.12) 0.34(0.13) 0.35(0.13)
5 0.5(0.13) 0.52(0.13) 0.44(0.17) 0.47(0.11) 0.56(0.16) 0.51(0.13) 0.57(0.14)
6 0.23(0.11) 0.25(0.11) 0.3(0.11) 0.29(0.11) 0.33(0.12) 0.3(0.12) 0.32(0.11)
7 0.47(0.15) 0.51(0.14) 0.37(0.13) 0.41(0.12) 0.47(0.13) 0.37(0.12) 0.45(0.11)
8 0.2(0.09) 0.22(0.12) 0.24(0.14) 0.25(0.11) 0.27(0.12) 0.26(0.12) 0.28(0.11)
9 0.33(0.14) 0.37(0.13) 0.19(0.14) 0.25(0.13) 0.13(0.12) 0.07(0.09) 0.09(0.1)
10 0.56(0.13) 0.57(0.15) 0.55(0.14) 0.54(0.11) 0.66(0.14) 0.61(0.12) 0.67(0.12)
11 0.42(0.16) 0.45(0.15) 0.25(0.15) 0.34(0.14) 0.17(0.15) 0.14(0.14) 0.16(0.14)
12 0.55(0.14) 0.54(0.15) 0.46(0.13) 0.45(0.13) 0.59(0.15) 0.52(0.14) 0.59(0.12)
70
Table 6.8: AUC of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1; ⌫ = 1.
# Scenario MART DEV LD DFB BHT DBHT
1 0.67(0.09) 0.67(0.09) 0.72(0.11) 0.64(0.11) 0.76(0.09) 0.82(0.07)
2 0.63(0.09) 0.64(0.09) 0.67(0.09) 0.63(0.1) 0.72(0.09) 0.75(0.09)
3 0.78(0.11) 0.76(0.11) 0.69(0.13) 0.74(0.11) 0.84(0.07) 0.89(0.05)
4 0.62(0.09) 0.61(0.09) 0.66(0.09) 0.61(0.09) 0.7(0.09) 0.72(0.09)
5 0.74(0.08) 0.73(0.08) 0.68(0.09) 0.71(0.09) 0.79(0.08) 0.82(0.09)
6 0.61(0.09) 0.62(0.08) 0.64(0.11) 0.6(0.09) 0.68(0.09) 0.71(0.08)
7 0.77(0.08) 0.75(0.09) 0.61(0.11) 0.73(0.09) 0.67(0.1) 0.71(0.09)
8 0.6(0.08) 0.6(0.08) 0.64(0.08) 0.61(0.09) 0.64(0.1) 0.68(0.09)
9 0.72(0.1) 0.67(0.11) 0.57(0.08) 0.65(0.09) 0.6(0.08) 0.6(0.08)
10 0.78(0.11) 0.77(0.11) 0.7(0.11) 0.76(0.11) 0.86(0.08) 0.9(0.07)
11 0.75(0.11) 0.71(0.12) 0.59(0.08) 0.7(0.12) 0.58(0.08) 0.59(0.08)
12 0.74(0.11) 0.73(0.1) 0.64(0.1) 0.7(0.1) 0.8(0.07) 0.83(0.08)
Table 6.9: TPR of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.24(0.16) 0.35(0.2) 0.38(0.21) 0.33(0.14) 0.42(0.22) 0.41(0.22) 0.42(0.2)
2 0.2(0.16) 0.25(0.14) 0.29(0.19) 0.3(0.2) 0.32(0.2) 0.27(0.19) 0.32(0.19)
3 0.52(0.2) 0.6(0.15) 0.69(0.14) 0.58(0.14) 0.64(0.25) 0.64(0.22) 0.7(0.2)
4 0.25(0.18) 0.25(0.18) 0.26(0.2) 0.28(0.16) 0.24(0.22) 0.28(0.19) 0.28(0.2)
5 0.35(0.24) 0.6(0.21) 0.62(0.16) 0.56(0.14) 0.64(0.2) 0.6(0.19) 0.62(0.18)
6 0.21(0.15) 0.22(0.19) 0.3(0.18) 0.27(0.18) 0.27(0.13) 0.28(0.19) 0.29(0.19)
7 0.34(0.17) 0.55(0.19) 0.46(0.15) 0.45(0.13) 0.45(0.17) 0.43(0.16) 0.42(0.17)
8 0.16(0.19) 0.15(0.16) 0.2(0.17) 0.19(0.18) 0.16(0.19) 0.16(0.12) 0.2(0.15)
9 0.23(0.15) 0.37(0.22) 0.2(0.18) 0.19(0.14) 0.06(0.11) 0.02(0.06) 0.02(0.06)
10 0.4(0.23) 0.65(0.18) 0.69(0.2) 0.51(0.2) 0.68(0.19) 0.57(0.2) 0.69(0.19)
11 0.34(0.16) 0.41(0.21) 0.21(0.15) 0.28(0.15) 0.08(0.12) 0.06(0.13) 0.08(0.12)
12 0.45(0.21) 0.69(0.18) 0.63(0.12) 0.55(0.16) 0.67(0.16) 0.57(0.16) 0.67(0.16)
71
Table 6.10: AUC of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.68(0.13) 0.69(0.14) 0.72(0.13) 0.7(0.13) 0.77(0.12) 0.8(0.11)
2 0.67(0.11) 0.69(0.12) 0.73(0.14) 0.65(0.15) 0.69(0.14) 0.75(0.12)
3 0.85(0.12) 0.86(0.11) 0.87(0.12) 0.83(0.11) 0.92(0.07) 0.93(0.06)
4 0.69(0.14) 0.68(0.12) 0.69(0.15) 0.67(0.12) 0.74(0.13) 0.74(0.15)
5 0.81(0.13) 0.83(0.14) 0.83(0.16) 0.82(0.14) 0.85(0.13) 0.87(0.11)
6 0.65(0.1) 0.67(0.12) 0.67(0.15) 0.67(0.13) 0.71(0.13) 0.71(0.11)
7 0.8(0.13) 0.81(0.14) 0.74(0.17) 0.77(0.13) 0.76(0.14) 0.76(0.15)
8 0.64(0.12) 0.63(0.11) 0.68(0.14) 0.63(0.14) 0.66(0.12) 0.69(0.13)
9 0.75(0.13) 0.76(0.13) 0.65(0.14) 0.72(0.11) 0.62(0.12) 0.63(0.13)
10 0.83(0.12) 0.85(0.14) 0.86(0.13) 0.81(0.13) 0.88(0.12) 0.91(0.09)
11 0.77(0.11) 0.77(0.12) 0.63(0.1) 0.75(0.13) 0.62(0.1) 0.63(0.11)
12 0.85(0.11) 0.86(0.12) 0.83(0.11) 0.84(0.1) 0.85(0.11) 0.92(0.07)
Table 6.11: TPR of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.3(0.12) 0.35(0.13) 0.43(0.13) 0.37(0.12) 0.48(0.14) 0.44(0.11) 0.49(0.14)
2 0.22(0.1) 0.28(0.12) 0.37(0.15) 0.32(0.12) 0.36(0.14) 0.33(0.15) 0.39(0.14)
3 0.52(0.18) 0.59(0.13) 0.62(0.14) 0.5(0.13) 0.69(0.13) 0.6(0.13) 0.69(0.14)
4 0.22(0.12) 0.23(0.09) 0.36(0.12) 0.29(0.07) 0.35(0.12) 0.32(0.12) 0.36(0.11)
5 0.46(0.15) 0.52(0.13) 0.57(0.16) 0.45(0.13) 0.62(0.11) 0.58(0.13) 0.62(0.12)
6 0.22(0.1) 0.28(0.09) 0.36(0.1) 0.28(0.09) 0.34(0.1) 0.36(0.13) 0.36(0.09)
7 0.45(0.14) 0.53(0.12) 0.41(0.11) 0.43(0.13) 0.46(0.16) 0.39(0.11) 0.43(0.11)
8 0.19(0.1) 0.21(0.11) 0.27(0.13) 0.25(0.13) 0.27(0.12) 0.24(0.09) 0.29(0.13)
9 0.34(0.13) 0.39(0.12) 0.19(0.13) 0.28(0.11) 0.13(0.1) 0.11(0.09) 0.1(0.1)
10 0.57(0.12) 0.65(0.11) 0.67(0.13) 0.55(0.12) 0.74(0.1) 0.61(0.12) 0.74(0.11)
11 0.44(0.15) 0.48(0.14) 0.25(0.12) 0.33(0.1) 0.18(0.13) 0.12(0.11) 0.16(0.16)
12 0.52(0.14) 0.61(0.17) 0.6(0.12) 0.5(0.12) 0.7(0.1) 0.56(0.13) 0.68(0.1)
72
Table 6.12: AUC of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.68(0.1) 0.69(0.09) 0.72(0.1) 0.67(0.1) 0.79(0.08) 0.82(0.07)
2 0.64(0.09) 0.64(0.09) 0.72(0.1) 0.63(0.08) 0.71(0.1) 0.75(0.1)
3 0.8(0.1) 0.8(0.1) 0.79(0.11) 0.76(0.11) 0.86(0.07) 0.9(0.06)
4 0.63(0.07) 0.63(0.09) 0.73(0.1) 0.65(0.1) 0.73(0.08) 0.76(0.08)
5 0.79(0.1) 0.8(0.11) 0.79(0.13) 0.74(0.11) 0.85(0.09) 0.86(0.06)
6 0.63(0.07) 0.62(0.09) 0.7(0.1) 0.6(0.1) 0.71(0.06) 0.73(0.05)
7 0.79(0.1) 0.78(0.1) 0.64(0.1) 0.75(0.1) 0.7(0.09) 0.7(0.1)
8 0.61(0.08) 0.61(0.09) 0.66(0.09) 0.61(0.08) 0.64(0.09) 0.68(0.09)
9 0.76(0.09) 0.73(0.1) 0.6(0.07) 0.68(0.08) 0.57(0.07) 0.59(0.07)
10 0.85(0.1) 0.85(0.1) 0.83(0.09) 0.8(0.1) 0.87(0.07) 0.92(0.05)
11 0.82(0.09) 0.8(0.07) 0.59(0.08) 0.74(0.08) 0.57(0.07) 0.61(0.08)
12 0.84(0.07) 0.84(0.08) 0.77(0.1) 0.8(0.07) 0.84(0.05) 0.89(0.05)
Table 6.13: TPR of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.27(0.17) 0.36(0.18) 0.44(0.2) 0.36(0.16) 0.46(0.24) 0.42(0.2) 0.45(0.21)
2 0.17(0.15) 0.2(0.17) 0.24(0.19) 0.26(0.2) 0.24(0.2) 0.26(0.2) 0.24(0.2)
3 0.45(0.23) 0.54(0.2) 0.55(0.18) 0.51(0.18) 0.57(0.23) 0.52(0.17) 0.59(0.18)
4 0.18(0.17) 0.19(0.15) 0.31(0.18) 0.28(0.17) 0.28(0.21) 0.24(0.17) 0.29(0.18)
5 0.38(0.19) 0.56(0.19) 0.53(0.17) 0.51(0.15) 0.55(0.18) 0.52(0.16) 0.56(0.17)
6 0.2(0.16) 0.17(0.15) 0.22(0.18) 0.22(0.17) 0.2(0.19) 0.19(0.18) 0.21(0.18)
7 0.35(0.2) 0.46(0.18) 0.38(0.19) 0.4(0.17) 0.38(0.18) 0.34(0.18) 0.36(0.17)
8 0.14(0.15) 0.17(0.15) 0.21(0.16) 0.2(0.15) 0.18(0.17) 0.15(0.12) 0.18(0.15)
9 0.3(0.17) 0.34(0.18) 0.15(0.16) 0.21(0.15) 0.05(0.1) 0.06(0.11) 0.05(0.1)
10 0.52(0.24) 0.65(0.18) 0.63(0.18) 0.59(0.18) 0.62(0.2) 0.58(0.19) 0.65(0.17)
11 0.31(0.18) 0.42(0.19) 0.2(0.15) 0.25(0.15) 0.08(0.12) 0.06(0.11) 0.07(0.11)
12 0.45(0.22) 0.56(0.17) 0.52(0.17) 0.49(0.17) 0.52(0.19) 0.51(0.19) 0.55(0.18)
73
Table 6.14: AUC of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.74(0.13) 0.73(0.14) 0.77(0.14) 0.73(0.12) 0.77(0.13) 0.82(0.14)
2 0.63(0.12) 0.64(0.11) 0.66(0.13) 0.63(0.14) 0.7(0.13) 0.71(0.11)
3 0.8(0.16) 0.8(0.15) 0.77(0.14) 0.78(0.16) 0.85(0.11) 0.89(0.1)
4 0.65(0.11) 0.64(0.12) 0.71(0.13) 0.67(0.13) 0.74(0.14) 0.74(0.13)
5 0.77(0.13) 0.76(0.14) 0.74(0.14) 0.75(0.13) 0.82(0.11) 0.84(0.11)
6 0.63(0.11) 0.63(0.11) 0.68(0.12) 0.63(0.12) 0.65(0.12) 0.68(0.11)
7 0.75(0.15) 0.75(0.14) 0.67(0.16) 0.73(0.14) 0.7(0.13) 0.73(0.14)
8 0.65(0.1) 0.63(0.11) 0.65(0.1) 0.63(0.12) 0.65(0.12) 0.68(0.13)
9 0.71(0.13) 0.71(0.13) 0.63(0.13) 0.68(0.12) 0.61(0.11) 0.6(0.1)
10 0.84(0.13) 0.84(0.13) 0.82(0.11) 0.83(0.13) 0.87(0.11) 0.91(0.08)
11 0.74(0.14) 0.73(0.14) 0.63(0.11) 0.72(0.13) 0.61(0.1) 0.62(0.11)
12 0.76(0.16) 0.76(0.15) 0.71(0.17) 0.74(0.14) 0.82(0.11) 0.85(0.09)
Table 6.15: TPR of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.33(0.11) 0.34(0.11) 0.43(0.14) 0.37(0.11) 0.49(0.15) 0.43(0.12) 0.49(0.13)
2 0.28(0.12) 0.32(0.11) 0.33(0.12) 0.32(0.1) 0.39(0.15) 0.37(0.16) 0.4(0.13)
3 0.56(0.11) 0.55(0.11) 0.52(0.13) 0.47(0.1) 0.64(0.13) 0.52(0.12) 0.64(0.13)
4 0.26(0.09) 0.27(0.11) 0.32(0.14) 0.32(0.13) 0.35(0.13) 0.3(0.12) 0.37(0.13)
5 0.53(0.15) 0.56(0.14) 0.48(0.14) 0.46(0.11) 0.59(0.15) 0.51(0.13) 0.58(0.14)
6 0.22(0.12) 0.24(0.12) 0.29(0.12) 0.27(0.11) 0.3(0.14) 0.25(0.12) 0.31(0.14)
7 0.46(0.15) 0.48(0.13) 0.36(0.14) 0.4(0.12) 0.45(0.15) 0.33(0.13) 0.42(0.12)
8 0.19(0.09) 0.22(0.12) 0.28(0.13) 0.27(0.13) 0.29(0.14) 0.25(0.11) 0.29(0.12)
9 0.38(0.12) 0.38(0.13) 0.19(0.13) 0.28(0.11) 0.15(0.12) 0.1(0.1) 0.1(0.11)
10 0.61(0.14) 0.58(0.14) 0.51(0.15) 0.55(0.1) 0.71(0.12) 0.62(0.13) 0.72(0.11)
11 0.41(0.16) 0.49(0.12) 0.26(0.13) 0.34(0.13) 0.18(0.11) 0.13(0.11) 0.16(0.1)
12 0.54(0.08) 0.55(0.2) 0.42(0.18) 0.49(0.12) 0.59(0.17) 0.46(0.15) 0.55(0.14)
74
Table 6.16: AUC of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 0.5; ⌫ = 1.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.68(0.09) 0.67(0.09) 0.71(0.11) 0.66(0.1) 0.77(0.11) 0.81(0.09)
2 0.64(0.1) 0.64(0.08) 0.66(0.11) 0.61(0.08) 0.72(0.1) 0.76(0.07)
3 0.77(0.09) 0.76(0.08) 0.72(0.09) 0.72(0.09) 0.81(0.08) 0.88(0.05)
4 0.63(0.1) 0.64(0.09) 0.67(0.09) 0.62(0.1) 0.68(0.07) 0.73(0.09)
5 0.79(0.09) 0.78(0.1) 0.7(0.1) 0.75(0.1) 0.8(0.08) 0.83(0.08)
6 0.61(0.08) 0.61(0.08) 0.62(0.11) 0.6(0.08) 0.65(0.09) 0.69(0.09)
7 0.73(0.11) 0.72(0.1) 0.62(0.1) 0.7(0.1) 0.66(0.1) 0.7(0.09)
8 0.6(0.07) 0.6(0.08) 0.63(0.1) 0.61(0.08) 0.65(0.09) 0.69(0.08)
9 0.72(0.09) 0.69(0.1) 0.58(0.08) 0.66(0.09) 0.57(0.06) 0.58(0.07)
10 0.81(0.1) 0.79(0.1) 0.7(0.11) 0.78(0.09) 0.87(0.06) 0.9(0.05)
11 0.77(0.1) 0.75(0.09) 0.59(0.09) 0.72(0.09) 0.57(0.07) 0.58(0.08)
12 0.76(0.1) 0.73(0.13) 0.65(0.13) 0.73(0.11) 0.77(0.08) 0.82(0.07)
Table 6.17: TPR of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.25(0.21) 0.36(0.19) 0.42(0.2) 0.38(0.19) 0.45(0.2) 0.42(0.2) 0.41(0.19)
2 0.2(0.17) 0.26(0.15) 0.36(0.19) 0.29(0.17) 0.3(0.18) 0.33(0.23) 0.36(0.21)
3 0.35(0.18) 0.62(0.19) 0.62(0.16) 0.51(0.21) 0.56(0.21) 0.62(0.17) 0.6(0.18)
4 0.14(0.13) 0.22(0.16) 0.25(0.17) 0.21(0.15) 0.25(0.16) 0.23(0.18) 0.24(0.18)
5 0.4(0.16) 0.64(0.17) 0.61(0.18) 0.5(0.17) 0.65(0.16) 0.63(0.15) 0.64(0.12)
6 0.24(0.17) 0.2(0.17) 0.25(0.18) 0.25(0.17) 0.29(0.2) 0.21(0.14) 0.25(0.16)
7 0.36(0.15) 0.48(0.18) 0.39(0.14) 0.4(0.11) 0.38(0.24) 0.31(0.17) 0.37(0.2)
8 0.16(0.14) 0.11(0.14) 0.15(0.16) 0.16(0.15) 0.13(0.13) 0.12(0.15) 0.11(0.14)
9 0.3(0.17) 0.36(0.17) 0.2(0.15) 0.27(0.18) 0.07(0.12) 0.04(0.1) 0.04(0.1)
10 0.48(0.26) 0.64(0.12) 0.71(0.12) 0.61(0.15) 0.66(0.2) 0.63(0.12) 0.71(0.12)
11 0.39(0.18) 0.48(0.18) 0.23(0.19) 0.33(0.18) 0.14(0.15) 0.11(0.14) 0.11(0.15)
12 0.47(0.21) 0.63(0.18) 0.62(0.19) 0.54(0.16) 0.54(0.18) 0.5(0.21) 0.56(0.19)
75
Table 6.18: AUC of each method in the 12 outlier scenarios. c = 0.2; k = 5; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.73(0.11) 0.72(0.11) 0.79(0.11) 0.72(0.12) 0.83(0.08) 0.86(0.08)
2 0.7(0.12) 0.7(0.14) 0.74(0.14) 0.69(0.11) 0.79(0.12) 0.78(0.12)
3 0.8(0.13) 0.83(0.12) 0.82(0.15) 0.77(0.15) 0.88(0.09) 0.91(0.09)
4 0.63(0.11) 0.61(0.14) 0.68(0.11) 0.62(0.12) 0.72(0.13) 0.73(0.11)
5 0.82(0.12) 0.84(0.12) 0.82(0.17) 0.79(0.12) 0.87(0.1) 0.88(0.1)
6 0.65(0.11) 0.64(0.12) 0.69(0.15) 0.66(0.13) 0.72(0.14) 0.74(0.12)
7 0.77(0.11) 0.77(0.12) 0.68(0.14) 0.74(0.13) 0.72(0.15) 0.73(0.13)
8 0.6(0.11) 0.62(0.11) 0.66(0.09) 0.64(0.12) 0.67(0.1) 0.64(0.12)
9 0.75(0.13) 0.76(0.13) 0.65(0.11) 0.7(0.13) 0.64(0.12) 0.64(0.1)
10 0.82(0.11) 0.83(0.1) 0.84(0.12) 0.81(0.1) 0.89(0.07) 0.92(0.06)
11 0.8(0.1) 0.78(0.11) 0.64(0.12) 0.75(0.14) 0.65(0.1) 0.66(0.1)
12 0.86(0.12) 0.87(0.12) 0.83(0.15) 0.83(0.13) 0.84(0.12) 0.88(0.11)
Table 6.19: TPR of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.31(0.11) 0.36(0.11) 0.47(0.13) 0.34(0.12) 0.54(0.13) 0.46(0.12) 0.52(0.11)
2 0.25(0.11) 0.25(0.13) 0.33(0.13) 0.29(0.11) 0.34(0.12) 0.31(0.13) 0.35(0.14)
3 0.54(0.17) 0.6(0.15) 0.61(0.14) 0.53(0.14) 0.69(0.15) 0.61(0.13) 0.68(0.13)
4 0.25(0.11) 0.26(0.11) 0.38(0.12) 0.32(0.13) 0.38(0.12) 0.34(0.13) 0.39(0.13)
5 0.48(0.16) 0.54(0.15) 0.51(0.16) 0.48(0.11) 0.63(0.15) 0.5(0.14) 0.6(0.14)
6 0.21(0.09) 0.25(0.13) 0.32(0.12) 0.3(0.1) 0.33(0.11) 0.3(0.13) 0.33(0.12)
7 0.44(0.13) 0.52(0.15) 0.43(0.12) 0.44(0.13) 0.5(0.14) 0.4(0.09) 0.49(0.11)
8 0.2(0.13) 0.19(0.13) 0.26(0.12) 0.24(0.13) 0.25(0.14) 0.21(0.11) 0.27(0.14)
9 0.37(0.12) 0.39(0.13) 0.19(0.13) 0.3(0.13) 0.11(0.11) 0.08(0.08) 0.09(0.1)
10 0.58(0.17) 0.66(0.14) 0.65(0.14) 0.58(0.1) 0.73(0.14) 0.58(0.13) 0.72(0.12)
11 0.39(0.14) 0.52(0.15) 0.29(0.11) 0.36(0.13) 0.18(0.09) 0.11(0.09) 0.16(0.08)
12 0.54(0.15) 0.62(0.15) 0.53(0.18) 0.53(0.12) 0.67(0.15) 0.52(0.14) 0.66(0.13)
76
Table 6.20: AUC of each method in the 12 outlier scenarios. c = 0.2; k = 10; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.68(0.09) 0.69(0.11) 0.74(0.11) 0.66(0.1) 0.78(0.08) 0.83(0.06)
2 0.63(0.1) 0.63(0.09) 0.69(0.1) 0.62(0.1) 0.7(0.08) 0.75(0.08)
3 0.82(0.1) 0.81(0.09) 0.81(0.1) 0.78(0.1) 0.86(0.08) 0.9(0.06)
4 0.62(0.09) 0.62(0.09) 0.7(0.09) 0.62(0.08) 0.71(0.09) 0.75(0.08)
5 0.77(0.11) 0.77(0.11) 0.71(0.13) 0.73(0.11) 0.8(0.08) 0.85(0.08)
6 0.61(0.07) 0.62(0.08) 0.69(0.08) 0.6(0.08) 0.67(0.09) 0.72(0.08)
7 0.79(0.1) 0.78(0.12) 0.66(0.11) 0.75(0.11) 0.69(0.09) 0.72(0.11)
8 0.62(0.08) 0.6(0.08) 0.65(0.1) 0.61(0.09) 0.62(0.09) 0.66(0.1)
9 0.77(0.1) 0.74(0.1) 0.59(0.08) 0.71(0.1) 0.59(0.06) 0.59(0.07)
10 0.84(0.11) 0.84(0.1) 0.81(0.1) 0.81(0.1) 0.86(0.08) 0.92(0.06)
11 0.8(0.1) 0.8(0.11) 0.61(0.08) 0.75(0.09) 0.57(0.07) 0.59(0.08)
12 0.84(0.1) 0.84(0.1) 0.75(0.12) 0.81(0.1) 0.8(0.1) 0.88(0.08)
Table 6.21: TPR of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.26(0.17) 0.37(0.14) 0.38(0.17) 0.38(0.18) 0.41(0.17) 0.36(0.17) 0.42(0.16)
2 0.17(0.16) 0.17(0.16) 0.22(0.17) 0.22(0.17) 0.21(0.19) 0.23(0.19) 0.23(0.19)
3 0.48(0.25) 0.54(0.17) 0.6(0.19) 0.55(0.18) 0.55(0.2) 0.54(0.15) 0.6(0.18)
4 0.17(0.14) 0.2(0.15) 0.22(0.15) 0.24(0.14) 0.23(0.15) 0.22(0.13) 0.26(0.15)
5 0.36(0.17) 0.52(0.15) 0.56(0.2) 0.47(0.18) 0.53(0.16) 0.51(0.17) 0.55(0.18)
6 0.17(0.13) 0.16(0.15) 0.21(0.19) 0.22(0.14) 0.18(0.17) 0.19(0.16) 0.2(0.16)
7 0.36(0.2) 0.5(0.2) 0.41(0.2) 0.46(0.19) 0.42(0.23) 0.38(0.22) 0.42(0.23)
8 0.16(0.17) 0.14(0.13) 0.18(0.16) 0.18(0.16) 0.13(0.17) 0.15(0.15) 0.16(0.15)
9 0.32(0.21) 0.32(0.17) 0.12(0.13) 0.22(0.18) 0.03(0.07) 0.05(0.08) 0.03(0.07)
10 0.45(0.2) 0.69(0.19) 0.68(0.16) 0.53(0.17) 0.67(0.17) 0.61(0.17) 0.7(0.16)
11 0.34(0.21) 0.46(0.14) 0.22(0.16) 0.26(0.16) 0.14(0.17) 0.12(0.13) 0.1(0.14)
12 0.46(0.24) 0.62(0.19) 0.56(0.18) 0.56(0.18) 0.57(0.19) 0.55(0.17) 0.59(0.18)
77
Table 6.22: AUC of each method in the 12 outlier scenarios. c = 0.3; k = 5; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.68(0.12) 0.68(0.12) 0.71(0.13) 0.66(0.15) 0.74(0.12) 0.78(0.11)
2 0.63(0.09) 0.61(0.11) 0.69(0.11) 0.62(0.1) 0.66(0.11) 0.72(0.1)
3 0.82(0.15) 0.82(0.13) 0.81(0.13) 0.8(0.14) 0.84(0.11) 0.9(0.08)
4 0.62(0.12) 0.63(0.1) 0.69(0.12) 0.61(0.12) 0.67(0.13) 0.71(0.13)
5 0.76(0.18) 0.74(0.17) 0.77(0.2) 0.74(0.16) 0.83(0.1) 0.85(0.15)
6 0.62(0.1) 0.63(0.09) 0.65(0.12) 0.62(0.1) 0.66(0.13) 0.68(0.12)
7 0.76(0.14) 0.76(0.14) 0.69(0.13) 0.75(0.16) 0.72(0.14) 0.73(0.14)
8 0.63(0.11) 0.62(0.11) 0.63(0.11) 0.64(0.14) 0.66(0.12) 0.68(0.13)
9 0.74(0.13) 0.71(0.13) 0.6(0.09) 0.67(0.12) 0.61(0.1) 0.6(0.1)
10 0.8(0.15) 0.82(0.14) 0.81(0.15) 0.78(0.14) 0.87(0.09) 0.91(0.08)
11 0.76(0.13) 0.75(0.12) 0.61(0.08) 0.7(0.1) 0.6(0.11) 0.6(0.12)
12 0.82(0.14) 0.82(0.13) 0.78(0.14) 0.8(0.13) 0.82(0.15) 0.87(0.11)
Table 6.23: TPR of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB OSD BHT DBHT
1 0.32(0.12) 0.38(0.08) 0.44(0.14) 0.4(0.12) 0.54(0.12) 0.46(0.11) 0.54(0.09)
2 0.22(0.12) 0.27(0.11) 0.34(0.12) 0.28(0.13) 0.36(0.15) 0.34(0.14) 0.38(0.15)
3 0.51(0.13) 0.51(0.1) 0.46(0.14) 0.46(0.12) 0.61(0.14) 0.57(0.15) 0.61(0.14)
4 0.22(0.11) 0.24(0.12) 0.29(0.1) 0.3(0.15) 0.33(0.12) 0.32(0.15) 0.33(0.13)
5 0.5(0.11) 0.47(0.13) 0.39(0.14) 0.45(0.11) 0.57(0.16) 0.51(0.15) 0.54(0.14)
6 0.23(0.11) 0.24(0.14) 0.28(0.17) 0.26(0.09) 0.26(0.16) 0.25(0.15) 0.28(0.14)
7 0.44(0.17) 0.49(0.14) 0.37(0.14) 0.4(0.13) 0.44(0.18) 0.37(0.13) 0.44(0.16)
8 0.2(0.13) 0.24(0.13) 0.29(0.15) 0.26(0.14) 0.28(0.14) 0.25(0.11) 0.28(0.15)
9 0.34(0.12) 0.34(0.15) 0.18(0.13) 0.27(0.14) 0.13(0.1) 0.09(0.07) 0.08(0.07)
10 0.57(0.16) 0.61(0.15) 0.5(0.17) 0.56(0.1) 0.69(0.15) 0.55(0.11) 0.67(0.11)
11 0.39(0.15) 0.45(0.12) 0.27(0.1) 0.36(0.14) 0.16(0.11) 0.14(0.12) 0.15(0.13)
12 0.52(0.14) 0.56(0.15) 0.46(0.14) 0.51(0.14) 0.62(0.15) 0.48(0.11) 0.57(0.11)
78
Table 6.24: AUC of each method in the 12 outlier scenarios. c = 0.3; k = 10; � = 1.5; ⌫ = 0.5.
# Scenario MART DEV LD DFB BHT DBHT
1 0.69(0.09) 0.68(0.09) 0.72(0.11) 0.68(0.09) 0.8(0.07) 0.83(0.05)
2 0.62(0.09) 0.61(0.1) 0.69(0.1) 0.62(0.11) 0.71(0.1) 0.75(0.09)
3 0.76(0.11) 0.75(0.11) 0.68(0.14) 0.74(0.1) 0.84(0.07) 0.88(0.07)
4 0.64(0.09) 0.63(0.09) 0.65(0.1) 0.63(0.09) 0.71(0.1) 0.69(0.11)
5 0.75(0.08) 0.71(0.09) 0.64(0.1) 0.73(0.08) 0.79(0.08) 0.81(0.08)
6 0.63(0.1) 0.63(0.1) 0.66(0.12) 0.61(0.1) 0.66(0.13) 0.68(0.11)
7 0.71(0.12) 0.71(0.11) 0.61(0.09) 0.69(0.11) 0.67(0.11) 0.7(0.13)
8 0.64(0.11) 0.63(0.1) 0.66(0.09) 0.62(0.08) 0.63(0.08) 0.67(0.09)
9 0.71(0.11) 0.68(0.11) 0.58(0.08) 0.65(0.09) 0.59(0.05) 0.58(0.07)
10 0.8(0.11) 0.8(0.12) 0.69(0.12) 0.78(0.12) 0.83(0.08) 0.9(0.06)
11 0.77(0.13) 0.74(0.12) 0.57(0.09) 0.73(0.1) 0.6(0.07) 0.59(0.08)
12 0.76(0.09) 0.75(0.09) 0.68(0.1) 0.74(0.09) 0.77(0.09) 0.83(0.09)
79