Upload
emerald-roach
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Trend Analysis and Risk Identification. Lenka Nov áková 1 , Ji ří Kléma 1 , Michal Jakob 1 , Simon Rawles 2 , Olga Štěpánková 1. 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague. - PowerPoint PPT Presentation
Citation preview
Trend Analysis and Risk Identification
1 The Gerstner laboratory for intelligent decision making and
control, Czech Technical University, Prague
Lenka Nováková1, Jiří Kléma1, Michal Jakob1, Simon Rawles2, Olga Štěpánková1
PKDD 2003, Discovery Challenge
2 Department of Computer Science, University of Bristol,
Bristol, UK
Outline STULONG data, orientation towards CVD Used tools
– SumatraTT, Statistica, Weka
Used techniques– mainly statistical tests - ANOVA, Chi-square, etc.
Exploratory analysis and subgroup discovery– Entry table
Trend analysis – Entry and Control tables
– three principal ways of preprocessing
– derived aggregated attributes
– univariate and multivariate analysis
STULONG Data Four tables: Entry, Control, Letter, Death Dependent variable: CVD
– CardioVascular Disease
– boolean attribute derived of A2 questionnaire (Control table)
CVD = false The patient has no coronary disease.
CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14)
We remove patients who have diabetes (Hodn4)or cancer (Hodn15) only.
positive angina
pectoris
(silent)myocardial infarction
cerebrovascular accident
ischaemic heart
disease
ENTRY - subgroup discovery AQ no.6: Are there any differences in the ENTRY
examination for different CVD groups? Statistica 6.0
– module for interactive decision tree induction
– two tailed t-test or chi-square test to asses significance of subgroups
Dependencies are relatively weak Interesting dependencies found
– social characteristics: derived attribute AGE_of_ENTRY
– alcohol: positive effect of beer, no effect of wine
– sugar consumption increases CVD risk
– well-known dependencies are not mentioned (smoking, BMI, cholesterol)
ENTRY - general model General CVD model (in WEKA)
– feature selection + modeling (e.g., decision trees)
– tends to generate trivial models (always predicting false)
– asymmetric error-cost matrix does not help
Predict CVD risk– Identify principal variables
(Chi-squared test)
– Naïve Bayes + ROC evaluation
– three independent variables
– discretized AGE_of_ENTRY
– discretized BMI
– Cholrisk - derived of CHLST
– AUC = 0.66
CONTROL - trend analysis AQ no.7: Are there any differences in development
of risk factors for different CVD groups?
ENTRY table CONTR table
ICO – primary keyYear of birthYear of entrySmokingAlcoholCholesterolBody Mass IndexBlood pressure
ICO
Risk factors followedduring 20 years
Global Approach Risk factors to be observed are selected
– SYST, DIAST, TRIGL, BMI, CHLSTMG
Selected control examinations are transformed– pivoting
Patients with no control entries are removed – about 60 patients
Trend aggregates are calculated
ICO Entry Contr1 Contr2 Aggr1 AggrN... ContrM ...
ICO_1
ICO_2
Derived trend attributes
Intercept
Gradient
Correlation coefficient
Standard deviation
x (decimal time ~ year + 1/12 month)
y (observed variable)
referential time (1975)
Mean
Global Approach - results The derived aggregates were discretized
– e.g., the gradient can be strongly decreasing, decreasing, constant, increasing, strongly increasing
Chi-square test for independence wrt. to CVD Large number of aggregates proved to be
significant including gradients (Chi square test, p=0.05)
TRIGL_MG_Grad Pie Chart
CVD = false
78, 15%82, 16%
128, 24%
46, 9%
192, 37%
CVD = true
44, 21%
52, 25%
48, 23%
37, 18%
27, 13%
12
Strongly decreasing Decreasing Constant Increasing Strongly increasing
12
TRIGL_MG_Grad Pie Chart
TRIGLMGCount: <= 5
50, 24%
9, 4%
62, 30%
16, 8%
73, 35%
TRIGLMGCount: (5,13]
42, 19%48, 22%
53, 24%74, 34%
TRIGLMGCount: (13,18]
24, 10%20, 8%
79, 32%
124, 50%
TRIGLMGCount: > 18
6, 10%
17, 28%
37, 62%
Strongly decreasing Decreasing Constant Increasing Strongly increasing
ControlCount vs. CVD ControlCount
– number of examinations
– strong relation with CVD
– AUC = 0.35
– ControlCount CVD risk
– anachronistic attribute
– introduced by the design of the study
ControlCount has influence on the trend aggregates - ControlCount gradients tend to be more steep etc.
Conclusion: global approach cannot be applied (at least with these aggregates)
Windowing Approach I. The same risk factors, the same pivoting
transformation and similar trend aggregates BUT the constant number of examinations Issues:
– window
• time period vs. number of examinations
• 5 examinations are enough to express trend
– patients : records (1 : ControlCount – 3)
• entry is used as the first examination
• records are dependent
– CVD classification
• time from the last examination to CVD
• yes/no (yes = CVD in the next year or CVD in future)
Windowing Approach I.
First vector
New vector
Data ...Entry ??
Aggregate tests Trend aggregates approach the normal distribution
in all (both) the specified CVD groups Two groups were selected – CVD never appears in
the future (1000) vs. CVD appears at the next exam. (1)
T-test for comparison of the group means can be applied (p<=0.05)
Do the means of the calculated aggregates differ in the different CVD groups?
Just a few of them– two variables (!gradients!) are clearly significant only
• SYST and DIAST
– two significant intercepts
• TRIGL and CHLST
T-tests; Grouping: CVDGroup 1: 1000Group 2: 1
VariableMean1000
Mean1
t-value p
DIASTTrend2SYSTTrend2
-0.0802 0.5151 -3.04421 0.0023450.3296 1.1794 -2.69381 0.007088
Further tests of SYST, DIAST Try to test the gradients for all the CVD groups, not
only two extreme groups Repeated ANOVA can be applied – development of
SYST/DIAST trend for different CVD groups
Record counts categorized by time from the last control to CVD attack
122 115 105 91 91331
125
4837
1 2 3 4 5 10 999
Time_from_last_mod
0
1000
2000
3000
4000
5000
6000
No o
f obs
Repeated ANOVA, DIASTHealthy group (CVD=1000) vs. group of getting ill in the next exam (CVD=1)
Vertical bars denote 0.95 confidence intervals
CVD_categ1 CVD_categ1000
DIAST1 DIAST2 DIAST3 DIAST4 DIAST5
R1
81
82
83
84
85
86
87
88
89
DV
_1
Repeated ANOVA, DIASTGetting ill after 5 exams (CVD=5) vs. getting ill in the next exam (CVD=1)
Vertical bars denote 0.95 confidence intervals
CVD_categ1 CVD_categ5
DIAST1 DIAST2 DIAST3 DIAST4 DIAST5
R1
80
81
82
83
84
85
86
87
88
89
DV
_1
DIASTTrend vs Time_to_CVD
DIASTTrend/Time
Time_from_last_mod: <= 1
<= -0.5, 13%
(-0.5,0], 33%
> 0.5, 25%
(0,0.5], 30%
Time_from_last_mod: (1,2]
<= -0.5, 18%
(-0.5,0], 31%
> 0.5, 21%
(0,0.5], 30%
Time_from_last_mod: (2,3]
<= -0.5, 22%> 0.5, 13%
(0,0.5], 33%
(-0.5,0], 31%
Time_from_last_mod: (3,4]
<= -0.5, 30%
> 0.5, 5%
(0,0.5], 24%
(-0.5,0], 41%
Time_from_last_mod: (4,5]
<= -0.5, 24%
> 0.5, 8%
(0,0.5], 25%
(-0.5,0], 43%
Time_from_last_mod: (5,10]
<= -0.5, 19%> 0.5, 15%
(0,0.5], 25%
(-0.5,0], 42%
Time_from_last_mod: (10,999]
<= -0.5, 13%> 0.5, 14%
(0,0.5], 30%(-0.5,0], 42%
Time_from_last_mod: > 999
<= -0.5, 17%> 0.5, 14%
(0,0.5], 28%
(-0.5,0], 41%
Windowing Approach II.There are missing values of risk factorsWindowing I.
– skips missing values
– different numbers of rows are generated for different factors
Windowing II.– replaces the missing values
– the same numbers of rows are generated for different factors
– enables multivariate analysis• combination of different aggregates and their relation
with CVD
Windowing II.
First vector
New vector
Data ...Entry ??
Factorial ANOVA, BMITrendMultiple effects: CVD x BMI risk
Vertical bars denote 0.95 confidence intervals
CVD_categ1 CVD_categ1000
0 1
OBEZRISK
-0.10
-0.05
0.00
0.05
0.10
0.15
0.20
0.25
BM
ITre
nd
27 patientsonly!
ConclusionsThe main scope
– AQ no.7: Are there any differences in development of risk factors for different CVD groups?
Contributions– Pitfalls of the global approach revealed
– Using windowing – differences proved for SYST and DIAST blood pressures
– Other assumptions and ideas:• interesting course of development of risk factors
(DIAST is decreasing first then increases and CVD appears)
• other trends may have influence under specific conditions (BMITrend and overweight, etc.)