Upload
dangtuyen
View
225
Download
0
Embed Size (px)
Citation preview
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Alexander Kowarik,Matthias Templ
Statistik AustriaApril, 2014
NEW FEATURES OF VIM -VISUALIZATION ANDIMPUTATION OF MISSINGVALUES
Kowarik, Templ (www.statistik.at) 1 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Implemented Imputation Methods
I Hot-deck Imputation
I k Nearest Neighbour Imputation
I Regression Imputation
I Iterative Robust Model-Based Imputation
Kowarik, Templ (www.statistik.at) 2 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Hot-deck Imputation
old fashioned, but fast ⇒ new implementation based on data.table soon
hotdeck(data , variable = NULL , ord_var = NULL ,
domain_var = NULL , makeNA = NULL , NAcond = NULL ,
impNA = TRUE , donorcond = NULL , imp_var = TRUE ,
imp_suffix = "imp")
I data - data.frame
I variable - variables to be imputed
I ord var - variables for sorting
I domain var - variables for buildin imputation classes
Kowarik, Templ (www.statistik.at) 3 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
kNN
kNN(data , variable = colnames(data), metric = NULL ,
k = 5, dist_var = colnames(data), weights = NULL ,
numFun = median , catFun = maxCat , makeNA = NULL ,
NAcond = NULL , impNA = TRUE , donorcond = NULL ,
mixed = vector(), mixed.constant = NULL ,
trace = FALSE , imp_var = TRUE , imp_suffix = "imp",
addRandom = FALSE)
I dist var - variables to be used for distance computation
I weights - weights for the distance computation
I numFun, catFun - funtion for aggregating numeric or categoricalvariables (sampleCat, maxCat).
I addRandom - add a random value to the distance computation(with very low weight)
Kowarik, Templ (www.statistik.at) 4 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Regression Imputation
Modeldefinition pro Variable
regressionImp(formula , data , family = "AUTO",
robust = FALSE , imp_var = TRUE , imp_suffix = "imp",
mod_cat = FALSE)
I formula,data - model formula and data frame
I robust - robust or classical regression
I family - default value “Auto” tries t guess the right family
I mod cat - TRUE=category with highest probability, FALSE=sample
Kowarik, Templ (www.statistik.at) 5 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Iterative Robust Model-Based Imputation
1. Initialization (Default: kNN)
2. Cycle through all variables with missing values and impute by amodel with all other variables as predictors
3. Repeat 2. until convergence is reached
Kowarik, Templ (www.statistik.at) 6 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Iterative Robust Model-Based Imputation
I Robust Models
I variable type ⇒ automatic model selection
I semi-continuous variables are modeled in two steps
I stepwise variable selection possible
I multiple imputation possible
I ⇒ fully automatic approach
I future improvement ⇒ define model formula per variable
Kowarik, Templ (www.statistik.at) 7 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Iterative Robust Model-Based Imputation
irmi(x, eps = 5, maxit = 100, mixed = NULL ,
mixed.constant = NULL , count = NULL , step = FALSE ,
robust = FALSE , takeAll = TRUE , noise = TRUE ,
noise.factor = 1, force = FALSE , robMethod = "MM",
force.mixed = TRUE , mi = 1, addMixedFactors = FALSE ,
trace = FALSE , init.method = "kNN")
I robust - robust or classical regression
I step - stepAIC in each iteration
I mixed - column indiced of semi-continuous variables
I count - column indiced of count variables(Poisson)
Kowarik, Templ (www.statistik.at) 8 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Visualization of Missing Values
I Simple plots to visualize the distribution of missingnessP
ropo
rtio
n of
mis
sing
s
0.00
0.02
0.04
0.06
0.08
0.10
py01
0n
py03
5n
py05
0n
py07
0n
py08
0n
py09
0n
py10
0n
py11
0n
py12
0n
py13
0n
py14
0n
Com
bina
tions
py01
0n
py03
5n
py05
0n
py07
0n
py08
0n
py09
0n
py10
0n
py11
0n
py12
0n
py13
0n
py14
0n
3827425119534832231311119865443222222211111111111
Kowarik, Templ (www.statistik.at) 9 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
Visualization of Missing Values II
I Multivariate plotsag
e
P03
3000
py01
0n
py03
5n
Kowarik, Templ (www.statistik.at) 10 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
VIMGUI
I Graphical user interface for imputation and visualization
I compatibel with survey-objects
I R-Script (reproducibility)
Kowarik, Templ (www.statistik.at) 11 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
VIMGUI
Kowarik, Templ (www.statistik.at) 12 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
VIMGUI
Kowarik, Templ (www.statistik.at) 13 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
VIMGUI
Kowarik, Templ (www.statistik.at) 14 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
VIMGUI
Kowarik, Templ (www.statistik.at) 15 / 16 | April 2014
S T A T I S T I K A U S T R I A
D i e I n f o r m a t i o n s m a n a g e r
VIMGUI
Kowarik, Templ (www.statistik.at) 16 / 16 | April 2014