Trend Analysis and Risk Identification

Trend Analysis and Risk Identification

1 The Gerstner laboratory for intelligent decision making and

control, Czech Technical University, Prague

Lenka Nováková1, Jiří Kléma1, Michal Jakob1, Simon Rawles2, Olga Štěpánková1

PKDD 2003, Discovery Challenge

2 Department of Computer Science, University of Bristol,

Bristol, UK

Outline STULONG data, orientation towards CVD Used tools

– SumatraTT, Statistica, Weka

Used techniques– mainly statistical tests - ANOVA, Chi-square, etc.

Exploratory analysis and subgroup discovery– Entry table

Trend analysis – Entry and Control tables

– three principal ways of preprocessing

– derived aggregated attributes

– univariate and multivariate analysis

STULONG Data Four tables: Entry, Control, Letter, Death Dependent variable: CVD

– CardioVascular Disease

– boolean attribute derived of A2 questionnaire (Control table)

CVD = false The patient has no coronary disease.

CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14)

We remove patients who have diabetes (Hodn4)or cancer (Hodn15) only.

positive angina

pectoris

(silent)myocardial infarction

cerebrovascular accident

ischaemic heart

disease

ENTRY - subgroup discovery AQ no.6: Are there any differences in the ENTRY

examination for different CVD groups? Statistica 6.0

– module for interactive decision tree induction

– two tailed t-test or chi-square test to asses significance of subgroups

Dependencies are relatively weak Interesting dependencies found

– social characteristics: derived attribute AGE_of_ENTRY

– alcohol: positive effect of beer, no effect of wine

– sugar consumption increases CVD risk

– well-known dependencies are not mentioned (smoking, BMI, cholesterol)

ENTRY - general model General CVD model (in WEKA)

– feature selection + modeling (e.g., decision trees)

– tends to generate trivial models (always predicting false)

– asymmetric error-cost matrix does not help

Predict CVD risk– Identify principal variables

(Chi-squared test)

– Naïve Bayes + ROC evaluation

– three independent variables

– discretized AGE_of_ENTRY

– discretized BMI

– Cholrisk - derived of CHLST

– AUC = 0.66

CONTROL - trend analysis AQ no.7: Are there any differences in development

of risk factors for different CVD groups?

ENTRY table CONTR table

ICO – primary keyYear of birthYear of entrySmokingAlcoholCholesterolBody Mass IndexBlood pressure

ICO

Risk factors followedduring 20 years

Global Approach Risk factors to be observed are selected

– SYST, DIAST, TRIGL, BMI, CHLSTMG

Selected control examinations are transformed– pivoting

Patients with no control entries are removed – about 60 patients

Trend aggregates are calculated

ICO Entry Contr1 Contr2 Aggr1 AggrN... ContrM ...

ICO_1

ICO_2

Derived trend attributes

Intercept

Gradient

Correlation coefficient

Standard deviation

x (decimal time ~ year + 1/12 month)

y (observed variable)

referential time (1975)

Mean

Global Approach - results The derived aggregates were discretized

– e.g., the gradient can be strongly decreasing, decreasing, constant, increasing, strongly increasing

Chi-square test for independence wrt. to CVD Large number of aggregates proved to be

significant including gradients (Chi square test, p=0.05)

TRIGL_MG_Grad Pie Chart

CVD = false

78, 15%82, 16%

128, 24%

46, 9%

192, 37%

CVD = true

44, 21%

52, 25%

48, 23%

37, 18%

27, 13%

12

Strongly decreasing Decreasing Constant Increasing Strongly increasing

12

TRIGL_MG_Grad Pie Chart

TRIGLMGCount: <= 5

50, 24%

9, 4%

62, 30%

16, 8%

73, 35%

TRIGLMGCount: (5,13]

42, 19%48, 22%

53, 24%74, 34%

TRIGLMGCount: (13,18]

24, 10%20, 8%

79, 32%

124, 50%

TRIGLMGCount: > 18

6, 10%

17, 28%

37, 62%

Strongly decreasing Decreasing Constant Increasing Strongly increasing

ControlCount vs. CVD ControlCount

– number of examinations

– strong relation with CVD

– AUC = 0.35

– ControlCount CVD risk

– anachronistic attribute

– introduced by the design of the study

ControlCount has influence on the trend aggregates - ControlCount gradients tend to be more steep etc.

Conclusion: global approach cannot be applied (at least with these aggregates)

Windowing Approach I. The same risk factors, the same pivoting

transformation and similar trend aggregates BUT the constant number of examinations Issues:

– window

• time period vs. number of examinations

• 5 examinations are enough to express trend

– patients : records (1 : ControlCount – 3)

• entry is used as the first examination

• records are dependent

– CVD classification

• time from the last examination to CVD

• yes/no (yes = CVD in the next year or CVD in future)

Windowing Approach I.

First vector

New vector

Data ...Entry ??

Aggregate tests Trend aggregates approach the normal distribution

in all (both) the specified CVD groups Two groups were selected – CVD never appears in

the future (1000) vs. CVD appears at the next exam. (1)

T-test for comparison of the group means can be applied (p<=0.05)

Do the means of the calculated aggregates differ in the different CVD groups?

Just a few of them– two variables (!gradients!) are clearly significant only

• SYST and DIAST

– two significant intercepts

• TRIGL and CHLST

T-tests; Grouping: CVDGroup 1: 1000Group 2: 1

VariableMean1000

Mean1

t-value p

DIASTTrend2SYSTTrend2

-0.0802 0.5151 -3.04421 0.0023450.3296 1.1794 -2.69381 0.007088

Further tests of SYST, DIAST Try to test the gradients for all the CVD groups, not

only two extreme groups Repeated ANOVA can be applied – development of

SYST/DIAST trend for different CVD groups

Record counts categorized by time from the last control to CVD attack

122 115 105 91 91331

125

4837

1 2 3 4 5 10 999

Time_from_last_mod

0

1000

2000

3000

4000

5000

6000

No o

f obs

Repeated ANOVA, DIASTHealthy group (CVD=1000) vs. group of getting ill in the next exam (CVD=1)

Vertical bars denote 0.95 confidence intervals

CVD_categ1 CVD_categ1000

DIAST1 DIAST2 DIAST3 DIAST4 DIAST5

R1

81

82

83

84

85

86

87

88

89

DV

_1

Repeated ANOVA, DIASTGetting ill after 5 exams (CVD=5) vs. getting ill in the next exam (CVD=1)



DIAST1 DIAST2 DIAST3 DIAST4 DIAST5

R1

80

81

82

83

84

85

86

87

88

89

DV

_1

DIASTTrend vs Time_to_CVD

DIASTTrend/Time

Time_from_last_mod: <= 1

<= -0.5, 13%

(-0.5,0], 33%

> 0.5, 25%

(0,0.5], 30%

Time_from_last_mod: (1,2]

<= -0.5, 18%

(-0.5,0], 31%

> 0.5, 21%

(0,0.5], 30%


<= -0.5, 22%> 0.5, 13%

(0,0.5], 33%

(-0.5,0], 31%


<= -0.5, 30%

> 0.5, 5%

(0,0.5], 24%

(-0.5,0], 41%


<= -0.5, 24%

> 0.5, 8%

(0,0.5], 25%

(-0.5,0], 43%


<= -0.5, 19%> 0.5, 15%

(0,0.5], 25%

(-0.5,0], 42%


<= -0.5, 13%> 0.5, 14%

(0,0.5], 30%(-0.5,0], 42%

Time_from_last_mod: > 999

<= -0.5, 17%> 0.5, 14%

(0,0.5], 28%

(-0.5,0], 41%

Windowing Approach II.There are missing values of risk factorsWindowing I.

– skips missing values

– different numbers of rows are generated for different factors

Windowing II.– replaces the missing values

– the same numbers of rows are generated for different factors

– enables multivariate analysis• combination of different aggregates and their relation

with CVD

Windowing II.

First vector

New vector

Data ...Entry ??

Factorial ANOVA, BMITrendMultiple effects: CVD x BMI risk



0 1

OBEZRISK

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

0.25

BM

ITre

nd

27 patientsonly!

ConclusionsThe main scope

– AQ no.7: Are there any differences in development of risk factors for different CVD groups?

Contributions– Pitfalls of the global approach revealed

– Using windowing – differences proved for SYST and DIAST blood pressures

– Other assumptions and ideas:• interesting course of development of risk factors

(DIAST is decreasing first then increases and CVD appears)

• other trends may have influence under specific conditions (BMITrend and overweight, etc.)

Documents

Trend Analysis and Risk Identification