Upload
batilde-colin
View
113
Download
0
Embed Size (px)
Citation preview
VALEURS MANQUANTES
Quelle proportion? Importante - Non importante?
Aléatoires - Non aléatoires?
Quel « pattern » suivent les valeurs manquantes?
Valeurs manquantes
Une solution simple : écarter les « sujets » avec des réponses
incomplètes : « analyse des cas disponibles ou des cas complets »
utilisation non efficiente de l ’information cas complets peuvent être très différents
Généralisation? Représentativité?
Valeurs manquantes
Classification Exemple : 2 variables Y = revenu, X = Age
Complètement aléaloires (Missing Completely At Random = MCAR) : données manquantes = échantillon représentatif de l’ensemble complet de données
Probabilité que revenu soit récolté la même pour tous les
individus MCAR
Valeurs manquantes
Classification Exemple : 2 variables Y = revenu, X = Age
Aléatoires (Missing At Random = MAR) : probabilité qu’une donnée soit manquante dépend des valeurs des variables mesurées
Probabilité que revenu soit récolté dépend de l’âge des répondants mais ne varie pas en fonction du revenu des répondants au sein des groupes d’âge MAR
Valeurs manquantes
ClassificationExemple : 2 variables Y = revenu, X = Age
Valeurs manquantes non aléatoires (Missing Not At Random = MNAR) : l’occurrence des valeurs manquantes d’une variable dépend de la valeur réelle mais non observée de la variable.
Si probabilité que revenu varie aussi en en fonction du revenu dans les groupes d’âge MNAR
Valeurs manquantes
Classification
MCAR et MAR = « missing ignorable » MNAR = « missing non ignorable »
VALEURS MANQUANTES
Méthodes d’analyse
Deux grands types d’approches Imputation
Basées sur la vraisemblance (Likelihood – «Expectation-Maximization » algorithm). Estimation de paramètres par maximum de vraisemblance à partir des données incomplètes.
Méthodes d’analyse Différence principale entre les deux
approches imputation complète les «missing » approche basée sur le Likelihood : pas
d’estimation explicite des « missing » mais spécification d’un modèle et logiciels moins facilement disponibles pour certaines analyses
Si grands échantillons, résultats semblables avec les deux méthodes; si petits échantillons, MI supérieur?
Valeurs manquantes
IMPUTATION:
Imputation simple
Imputation multiple
Imputation simple Valeur basée sur la connaissance à priori moyenne des observations disponibles
pour les autres sujets avec des caractéristiques identiques
valeurs prédites par régression ou régression stochastique (valeurs manquantes remplacées par valeurs prédites + résidus pour refléter l’incertitude sur la valeur prédite)
VALEURS MANQUANTES
Valeurs manquantes
Imputation simple Hot Deck : valeur imputée sélectionnée à
partir de la distribution estimée pour chaque valeur manquante
Cold deck : remplacer une valeur manquante par une valeur constante provenant d’une source extérieure (ex : étude antérieure)
Imputation simple Étude longitudinale : dernière valeur
observée (LOCF) Substitution : remplacer des unités
sélectionnées par d’autres non sélectionnées dans l’échantillon (stade expérimental)
…………
VALEURS MANQUANTES
Imputation simple : problèmes Connaissance à priori : OK si nb. Missing
petit et chercheur expérimenté L’analyse de la base de données
complétée comme si les mesures ajoutées étaient des mesures réelles ne tient pas compte de l’incertitude liée au processus d’imputation
Les erreurs standards sont en général sous-estimées
VALEURS MANQUANTES
Imputation multiple (MI)
N’ajoute pas des valeurs
Analyse de plusieurs ensembles de données « complets »
Simulations nb. M d’imputations répétées = 3, suffisant si 20% de missing
Valeurs manquantes
Imputation multiple
Sauf si % « missing » très grand : peu de bénéfice avec + de 10 imputations – 5 imputations = recommandé
Ajuste les statistiques pour tenir compte de l’incertitude liée à l’imputation
Valeurs manquantes
Remarque
Méthodes choisies pour traiter les missings dans les essais cliniques ont un impact sur les calculs de taille d’échantillons
VALEURS MANQUANTES
Quelques situations
Analyses avec des modèles classiques Essais cliniques Etudes longitudinales ………..
Valeurs manquantes
Exemple 1
Developing a prognostic model in the presence of missing data: an ovarian cancer case studyTaane G. Clark*, Douglas G. Altman
Journal of Clinical Epidemiology 56 (2003) 28–37
Valeurs manquantes pour 8 des 10 facteurs prognostiques potentiels : 2-43%
Temps de survie connus
Valeurs manquantes
Exemple 1 - étapes de la procédure
1. Investigating the missing data
a. Quantifying the multivariate patterns of the missing data.
b. Plotting the proportion of missing data for each potential prognostic factor against diagnosis year to show time trends in measurement practice.
Valeurs manquantes
Exemple 1 - étapes de la procédure
1. Investigating the missing data
c. Exploring the relationship between missing data of potential prognostic factors with other prognostic variables, survival information [i.e., (log) survival time and the censoring indicator], and auxiliary variables.
Valeurs manquantes
Exemple 1 - étapes de la procédure
2. Specifying an imputation model.
3. Using the model to generate (via a random sampling procedure) M sets of imputed values for the missing data points, thus creating M completed datasets.
Valeurs manquantes
Exemple 1 - étapes de la procédure
4. For each completed dataset, carrying out a Cox regression, obtaining estimate of interest and its estimated variance
5. Combining the results from the different datasets to obtain a prognostic model.
Valeurs manquantes
Exemple 1 - étapes de la procédure
6. Constructing a final “completed data” model (Model 2) by removing the covariate with the highest P-value and repeating steps 4 and 5 until all remaining covariates were significant at a 5% level (backward elimination).
Valeurs manquantes
Exemple 1
Etape 1 : missing data = MAR
Etapes 2 et 3 = simulation bayésienne
Etape 3 : nombre d ’imputations répétées=10
Valeurs manquantes
Exemple 1 - Etape 1 – Pattern «missing»Prognostic variable N (%)
Grade
Unknown 139 (11.7)
Ascites
Presence 707 (59.5)
Absence 417 (35.1)
Unknown 65 (5.5)
Alkaline phosphatase 793 (66.7)
Valeurs manquantes
Exemple 1-Etape 1-Pattern «missing» The number of patients contributing to a
complete case analysis using all the prognostic factors would be 358 (245 deaths).
Plots of the proportion of missing data by diagnosis year show that the proportions for ascites, alkaline phosphatase, albumin, grade, and residual disease were constant.
Valeurs manquantes
Exemple 1 - Etape 1-Pattern «missing»
The proportion of missing CA125 data decreased linearly in time from 85 to 21% between 1984 and 1999.
The proportion of missing performance status had an increasing trend in time with a minimum of 18% in 1986 and a maximum of 71% in 1995.
An analysis of the survival distributions of non-missingand missing strata within each of the factors (log) CA125,
grade, FIGO stage, and performance status showed no visual or statistical evidence of significant differences.
Valeurs manquantes
Exemple 1 - Etape 1 - Evidence of MAR data
Valeurs manquantes
Exemple 1 - Etape 1 - Evidence of MAR data
Difference between the survival distributions of patients with and without missing data for ascites (P .002), albumin (P .003), alkaline phosphatase (P .020) and residual disease (P .020)
Valeurs manquantes
Exemple 1 - Etape 1 - Evidence of MAR data Those patients missing albumin and
alkaline phosphatase results had a better prognosis, suggesting that eliminating the patients with missing values would lead to an underestimate of the true survival of the cohort. The opposite effect was seen for ascites and residual disease.
Valeurs manquantes
Exemple 1 - Etape 1 - Evidence of MAR data
The univariate logistic models indicated that histology and clinical trial participation were associated with the missingness of all but one prognostic variable.
Valeurs manquantes
Exemple 1 - Etape 2 à 5 - Imputation
We completed 10 data sets by imputing 2,045 values in each. As a consequence, 6,265 additional real data values were incorporated into each dataset.
Valeurs manquantes
Exemple 1 - étape 2 – Imputation model
For binary variables (e.g., the presence or absence of ascites) we used a logistic model
For categorical variables with three or more ordered levels (e.g., performance status) we applied a polytomous (2 levels) logistic model
Valeurs manquantes
Exemple 1 - Etape 2 - Imputation model
For continuous variables (e.g., log CA125) we used normal linear regression truncated where appropriate to the credible range of values.
Valeurs manquantes
Exemple 1 - Etapes 2 à 5 - Imputation
The prevalences (%) of categorical prognostic factors in the original data (ignoring missing data) were consistent with those from the 10 imputations.
Valeurs manquantes
Exemple 1 - Etapes 2 à 5 - imputationOriginal Completed (a)
Prognostic Factor # % Median Range Overall %
Grade
I 131 12.5 149 144–153 12.5
II 278 26.5 315 310–321 26.5
III 641 61.0 724 716–732 60.9
Unknown 139 0 — — —
Ascites
Presence 707 62.9 750 747–752 63.0
Absence 417 37.1 440 437–442 37.0
Unknown 65 0 — — —
(a) 10 datasets with original data augmented by imputed missing values.
Valeurs manquantes
Exemple 1 - Etapes 2 à 5 - Imputation
The median and range of albumin, log CA125, and alkaline phosphatase in the original data were consistent with the median of the median of the 10 imputation distributions and the extreme values of these distributions, respectively.
Valeurs manquantes
Exemple 1 - Etape 2 à 5 - imputation
Original Completed
Prognostic Median Range Median Range
Factor
Log CA125 (5.34) (1.79–10.04) 5.16 1.79–10.04
Albumin (39.0) (20.0–50.0) 39.0 20.0–50.0
Log Alk. Phos. (4.54) (3.26–7.50) 4.54 3.26–7.50
Valeurs manquantes
Exemple 1 - Etapes 2 à 5 - Imputation
The narrow ranges of imputation values for each potential prognostic variable coincides with the visual impression that the distributions for each of the potential prognostic factors in the 10 imputed datasets were similar.
Valeurs manquantes
Exemple 1 - Etape 6 - Fitting the Cox models.
Model 1 : as four factors, each with missing values, were found not to be prognostic, the analysable dataset was 518 (380 deaths).
Model 2 : pooled analysis using 10 complete datasets with imputed missing values.
Grade and ascites were statistically significant in Model 2, but not in Model 1.
Valeurs manquantes
Exemple 1 - Etape 6 - Fitting the Cox models.
A complete case analysis based on Model 2 would include only 449 patients (319 deaths).
The confidence limits are narrower in the augmented data, especially for those with less missing observations in the original dataset.
Exemple 1 - Etape 6 - Fitting the Cox models. The models applied to completed data (i.e., the 10
datasets with imputed missing values) had better calibration (i.e., greater ability to
produce unbiased estimates of outcome) superior discrimination (i.e., improved ability to
provide accurate predictions for individual patients)
There was little difference between the discrimination measures of Model 1 and Model 2 when applied to the completed data.
Exemple 1 - Conclusion
Most data are multivariate in nature, so a small proportion of missing data for several variables can lead to a severely depleted complete case analysis.
MI seems appropriate in this setting if the original dataset is not too small.
Valeurs manquantes
Exemple 1 - conclusion
Using imputed data we are incorporating patients
that are removed merely because one or more of their prognostic factors are missing and, as a result, increasing power and adding precision to an analysis.
our approach may be viewed as a sensitivity analysis, and ultimately we need to use judgement about the plausibility of assumptions in a particular situation to assess which is the primary analysis.
Valeurs manquantes
Exemple 2 : une étude longitudinale
Attrition in longitudinal studies: How to deal with missing data
Jos Twisk*, Wieke de Vente
Journal of Clinical Epidemiology 55 (2002) 329–337
Valeurs manquantes
Exemple 2 - Conclusion When MANOVA for repeated
measurements is used to analyze a longitudinal dataset with missing data, imputation methods to replace these missing data are highly recommendable (because MANOVA as implemented in the software used (SPSS), uses listwise deletion of cases with a missing value).
Valeurs manquantes
Exemple 2 - Conclusion When GEE is used to analyze a
longitudinal dataset with missing data, not imputing at all may be better than any of the imputation methods applied.
If one chooses to impute missing values, longitudinal methods are generally preferred above cross-sectional methods.
Valeurs manquantes
Exemple 2 - Conclusion Using the more refined multiple imputation
method to impute missing values did not lead to different point estimates than the single imputation techniques.
The estimated standard errors were higher than the ones obtained from the complete dataset, which seems to be theoretically justified, because they reflect uncertainty in estimation caused by missing values.
Valeurs manquantes
Exemple 2 - Limitations Specific observational longitudinal dataset Four missing data scenarios Limited number of imputation techniques Missingness dependent on the outcome
variable Two statistical methods Less advanced multiple imputation
estimation pro-cedures)
Valeurs manquantes
Exemple 3 – Un essai clinique
Extrait de « Multiple Imputation : a primer».
JL Schafer
Statistical Methods in Medical Research, 1999; 8 (1) 3-15
VALEURS MANQUANTES
Softwares Routines pour STATA
http://www.stat.harvard.edu/~barnard/ S-PLUS SAS NORM (free sur INTERNET (Schafer,
1999) SOLAS™ for Missing Data Analysis and
Multiple Imputation http://www.statsol.ie/solas/solas.htm
Valeurs manquantes
Et SPSS?
Module MVA
Pattern des missings Méthodes de substitution :
Régression EM
Univariate Statistics
2081 2.9353 .67960 35 1.7 46 29
1302 6.8703 1.04820 814 38.5 109 22
1995 25.06 7.115 121 5.7 0 13
1939 24.486 2.5238 177 8.4 41 27
1916 17.062 1.6391 200 9.5 5 39
2083 2.66 2.950 33 1.6 0 48
1252 864 40.8
2080 36 1.7
poidbebe
hemog1
agem
perbg
baude
parite
gead
gretum
N Mean Std. Deviation Count Percent
Missing
Low High
No. of Extremesa
Number of cases outside the range (Q1 - 1.5*IQR, Q3 + 1.5*IQR).a.
Separate Variance t Testsa
-1.8 . -3.9 2.9 -19.3 -4.2
1598.7 . 1301.1 1240.3 949.7 1600.5
1302 1302 1301 1297 1300 1302
779 0 694 642 616 781
2.9145 6.8703 24.60 24.605 16.568 2.45
2.9699 . 25.93 24.246 18.106 3.01
-2.5 . . 2.3 -11.9 -2.7
120.8 . . 73.0 68.3 123.3
1968 1301 1995 1870 1854 1971
113 1 0 69 62 112
2.9246 6.8694 25.06 24.511 17.001 2.62
3.1212 8.0000 . 23.794 18.903 3.40
-1.9 -2.2 -1.8 . -6.3 -1.9
173.8 4.1 139.0 . 32.4 180.7
1922 1297 1870 1939 1883 1926
159 5 125 0 33 157
2.9252 6.8682 24.98 24.486 17.012 2.63
3.0569 7.4000 26.22 . 19.909 3.11
-2.0 -1.5 -2.6 .2 . -2.8
206.5 1.0 159.1 55.9 . 212.4
1900 1300 1854 1883 1916 1904
181 2 141 56 0 179
2.9247 6.8697 24.94 24.489 17.062 2.61
3.0457 7.2500 26.67 24.373 . 3.25
-2.1 -.5 -4.3 2.1 -18.5 -4.4
1717.9 86.5 1449.5 1387.7 1073.0 1748.4
1252 1226 1250 1245 1248 1251
829 76 745 694 668 832
2.9094 6.8667 24.52 24.578 16.560 2.43
2.9744 6.9276 25.96 24.320 17.999 3.01
t
df
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
# Present
# Missing
Mean(Present)
Mean(Missing)
hem
og1
agem
perb
gbaude
gead
poid
bebe
hem
og1
agem
perb
g
baude
parite
For each quantitative variable, pairs of groups are formed by indicator variables(present, missing).
Indicator variables with less than 5% missing are not displayed.a.
gead
1302 821 405 76
61.5 97.0 99.8 8.8
38.5 3.0 .2 91.2
1995 844 406 745
94.3 99.8 100.0 86.2
5.7 .2 .0 13.8
1939 840 405 694
91.6 99.3 99.8 80.3
8.4 .7 .2 19.7
1916 842 406 668
90.5 99.5 100.0 77.3
9.4 .4 .0 22.7
.0 .1 .0 .0
Count
Percent
Present
% SysMisMissing
hemog1
Count
Percent
Present
% SysMisMissing
agem
Count
Percent
Present
% SysMisMissing
perbg
Count
Percent
Present
% SysMis
% 60.0
Missing
baude
Tota
l
0 1
SysM
is
Missing
Indicator variables with less than 5% missing are not displayed.
gretum
1302 629 673 0
61.5 58.1 67.4 .0
38.5 41.9 32.6 100.0
1995 1001 977 17
94.3 92.5 97.9 47.2
5.7 7.5 2.1 52.8
1939 974 949 16
91.6 90.0 95.1 44.4
8.4 10.0 4.9 55.6
1916 954 950 12
90.5 88.2 95.2 33.3
9.4 11.7 4.8 66.7
.0 .1 .0 .0
1252 612 639 1
59.2 56.6 64.0 2.8
40.8 43.4 36.0 97.2
Count
Percent
Present
% SysMisMissing
hemog1
Count
Percent
Present
% SysMisMissing
agem
Count
Percent
Present
% SysMisMissing
perbg
Count
Percent
Present
% SysMis
% 60.0
Missing
baude
Count
Percent
Present
% SysMisMissing
gead
Tota
l
.00
1.0
0
SysM
is
Missing
Indicator variables with less than 5% missing are not displayed.
Percent Mismatch of Indicator Variables.a,b
5.72
9.17 8.36
9.59 4.21 9.45
32.84 30.58 29.21 38.47
35.30 33.13 31.76 4.82 40.83
agem
perbg
baude
hemog1
gead
agem
perb
g
baude
hem
og1
gead
The diagonal elements are the percentages missing,and the off-diagonal elements are the mismatchpercentages of indicator variables.
Variables are sorted on missing patterns.a.
Indicator variables with less than 5% missingvalues are not displayed.
b.
Tabulated Patterns
1220
X 1242
X X X 1848
X X 1800
X 1295
X X X 1839
X X X X 1937
X X X 1826
X X X X X 2031
Number of Cases1220
22
46
483
75
38
70
22
31
parite
gre
tum
poid
bebe
agem
perb
g
baude
hem
og1
gead
Missing Patternsa
Com
ple
te if
...
b
Patterns with less than 1% cases (21 or fewer) are not displayed.
Variables are sorted on missing patterns.a.
Number of complete cases if variables missing in thatpattern (marked with X) are not used.
b.
EM Meansa
2.9351 6.8889 25.12 24.489 17.077 2.65
poid
bebe
hem
og1
agem
perb
g
baud
e
parit
e
Little's MCAR test: Chi-Square = 622.509, DF =57, Sig. = .000
a.
Descriptive Statistics
2081 .97 9.04 2.9353 .67960
1302 1.50 9.50 6.8703 1.04820
1995 7 54 25.06 7.115
1939 10.0 40.1 24.486 2.5238
1916 11.7 30.0 17.062 1.6391
1295
poidbebe
hemog1
agem
perbg
baude
Valid N (listwise)
N Minimum Maximum Mean Std. Deviation
EM Correlationsa
1
.050 1
.042 -.039 1
.118 .014 .114 1
.192 .067 .212 .299 1
.063 -.035 .806 .099 .185 1
poidbebe
hemog1
agem
perbg
baude
parite
poid
bebe
hem
og1
agem
perb
g
baud
e
parit
e
Little's MCAR test: Chi-Square = 622.509, DF= 57, Sig. = .000
a.
Correlations
1 .118** .195** .031 -.046
. .000 .000 .165 .095
1995 1870 1854 1968 1301
.118** 1 .287** .111** .017
.000 . .000 .000 .552
1870 1939 1883 1922 1297
.195** .287** 1 .204** .045
.000 .000 . .000 .106
1854 1883 1916 1900 1300
.031 .111** .204** 1 .076**
.165 .000 .000 . .006
1968 1922 1900 2081 1302
-.046 .017 .045 .076** 1
.095 .552 .106 .006 .
1301 1297 1300 1302 1302
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
agem
perbg
baude
poidbebe
hemog1
agem perbg baude poidbebe hemog1
Correlation is significant at the 0.01 level (2-tailed).**.
Valeurs manquantes
EM Deux étapes : E = valeurs attendues des
données manquantes; M = estimation des paramètres (corrélations) comme si les valeurs manquantes avaient été complétées
Avec SPSS MVA, on peut simuler une imputation multiple