Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Motivation Data Model Simulation studies Real data analysis
Model-based approach for household clustering with mixedscale variables
Luis E. Nieto Barajas
Departament of StatisticsITAM
Trinity College – 22 june 2018
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 1 / 31
Motivation Data Model Simulation studies Real data analysis
Contents
1 Motivation
2 Data
3 Model
4 Simulation studies
5 Real data analysis
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 2 / 31
Motivation Data Model Simulation studies Real data analysis
Context
SEDESOL : Ministry of social development (government dependency)
Aim of SEDESOL : help and improve the social backwardnessTo fulfil it, SEDESOL creates social programmes to target specific needs
Social inclusionLife insurance for single mothersFeeding supportPension for elderly peopleDay care centersMilk provisionetc...
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 3 / 31
Motivation Data Model Simulation studies Real data analysis
Objective
Currently each existing programme has its own selection rules (mainly based onincome)
Objective : Unify the selection procedure by creating clusters of households withsimilar needs and socio-economic conditions.
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 4 / 31
Motivation Data Model Simulation studies Real data analysis
CONEVAL
How do we measure poverty in households ?In 2009, CONEVAL proposed a new methodology based on multiple dimensions :income dimension, social deprivations and social cohesionMultidimensional poverty indicators :
1 income per capita2 education deprivation3 access to health services4 access to social security5 housing quality6 access to basic public services7 access to feeding
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 5 / 31
Motivation Data Model Simulation studies Real data analysis
ENIGH
Every 2 years the National Institute for Official Statistics (INEGI) implements asurvey
The survey is based on a complex design of households, i.e., different selectionprobabilities for each household
Survey is nationwide and includes a module of socio-economic conditions
Clusters will be produced based on these 7-dimensional poverty indicators plusfew other variables
Available data are mixed mode, i.e., continuous, discrete, ordinal and nominal
Challenge : Produce clusters of households using mixed scale data taking intoaccount the complex sampling design of the survey
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 6 / 31
Motivation Data Model Simulation studies Real data analysis
Notation
yi = yij , j = 1 . . . , p, i = 1, . . . , n, is the multivariate response
p = total number of variables, such that
p = c + o + mc = # continuous variables,o = # ordinal variables,m = # nominal variables
In summary, y′i = (yi,1, . . . , yi,c , yi,c+1, . . . , yi,c+o, yi,c+o+1, . . . , yi,c+o+m)
For each yi of dim p ⇐⇒ latent zi of dim q, with q ≥ p
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 7 / 31
Motivation Data Model Simulation studies Real data analysis
Latent variables
If yij is continuous⇒ zij = gj (yij ), where gj (·) a normalising transformation
If yij is ordinal with values in ϑk with Kj different values⇒ zij satisfies yij = ϑk iffγj,k−1 < zij ≤ γj,k and γj,0, . . . , γj,Kj +1 are fixed thresholds
Note : a binary variable is a special case of an ordinal variable with Kj = 2
If yij is nominal with Lj categories we need Lj − 1 latentszij = zil , l = c + o +
∑j−1h=c+o+1(Lh − 1) + 1, . . . , c + o +
∑jh=c+o+1(Lh − 1) such that
yij =
Lj , if maxl (zi,l ) < 0k, if zi,s = maxl (zi,l ) & zi,s > 0
with s = c + o +∑j−1
h=c+o+1(Lh − 1) + k, and k = 1, . . . , Lj − 1
Finally, q = c + o +∑c+o+m
j=c+o+1(Lj − 1)
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 8 / 31
Motivation Data Model Simulation studies Real data analysis
Non iid data
Data come from a complex survey sampleIndividual yi has known sampling probability πiωi = 1/πi are sampling design weights or expansion factors
[Lumley, 2010] : Weighted least squares
minn∑
i=1
1πi
(yi − α− βxi )2
[Chambers and Skinner, 2003] : Likelihood re-weighting
n∏i=1
f (yi | θ)1/πi
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 9 / 31
Motivation Data Model Simulation studies Real data analysis
Model
Observed y ⇐⇒ z latents
We proposezi | µi ,Σ ∼ Nq(µi , κπiΣ)
Plus some estimability constraints on Σ
For j = 1, . . . , c⇒ σ2j > 0
For j = c + 1, . . . , c + o⇒ σ2j > 0 if Kj > 2 and σ2
j = 1 if Kj = 2For j = c + o + 1, . . . ,
∑c+o+mh=c+o+1(Lh − 1)⇒ σ2
j = 1To ensure positive definite : Separation strategy
Σ = ΛΩΛ
with Λ = diag(σ1, . . . , σq) and Ω a correlation matrix
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 10 / 31
Motivation Data Model Simulation studies Real data analysis
Priors
For µi : Nonparametric prior [Pitman and Yor, 1997]
µi |Giid∼ G, for i = 1, . . . , n with G ∼ PD(a, b,G0)
with G0(µ) = N(0,Σµ) and Σµ = diag(σ2µ1, . . . , σ
2µq)
G is a.s. discrete G(·) =∑∞
k=1 wkδξk (·) with ξkiid∼ G0 and w1 = v1,
wk = vk∏
l<k (1− vl ), with vkind∼ Be(1− a, b + ka)
Marginally f (µi | µ−i ) =b+a ri
b+n−1 g0(µi ) +∑ri
j=1
n∗j,i−a
b+n−1 δµ∗j,i(µi )
For σ2j :
σ2j
iid∼ IGa(dz0 , d
z1 )
For Ω : Marginally uniform [Barnard et al., 2000]
f (Ω) ∝ |Ω|q(q−1)/2−1
∏j
|Ωjj |
−(q+1)/2
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 11 / 31
Motivation Data Model Simulation studies Real data analysis
Hyper priors
For (a, b) : a ∈ [0, 1) and b > −a
f (a, b) = f (a)f (b | a)
withf (a) = αI0(a) + (1− α)Be(a | da
0 , da1 )
andf (b | a) = Ga(b + a | db
0 , db1 )
For σ2µj :
σ2µj
iid∼ IGa(dµ0 , dµ1 )
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 12 / 31
Motivation Data Model Simulation studies Real data analysis
Posteriors
Posterior inference is based on MCMC
Apart from the conditional posterior distributions of all model parameters, we alsosample from the conditional predictive distribution of zij which are truncatednormals for the latent variables not associated to the continuous variables
Implemented in the R-package BNPMIXcluster
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 13 / 31
Motivation Data Model Simulation studies Real data analysis
Clustering selection
For each MCMC iteration we have an n× n adjacency matrix (1 if elements i and jshare the same value of µ∗ and 0 otherwise)
Compute a similarity matrix (Montecarlo average of all the adjacency matrices).This represents the “average clustering”.
We select the adjacency matrix of the iteration with minimum squared distancefrom the average similarity matrix [Dahl, 2006]
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 14 / 31
Motivation Data Model Simulation studies Real data analysis
Clustering comparison
Get rid of the scales of the variables, define new variables y∗ij as : for a numericalvariable, yij is standardized across all individuals ; for a categorical variable, if thenumber of categories is two then y∗ij = yij , otherwise define y∗il a latent indicatorvariable for each category l = 1, . . . , Lj
HM(C1, . . . ,Cr ) =r∑
k=1
nk
p∗∑j=1
S2kj , where S2
kj =
nk∑i=1
w (k)i y∗ij
2− nk∑
i=1
w (k)i y∗ij
2
,
(1)with w (k)
i = wi/∑
l∈Ckwl the normalized weights and p∗ is the number of
resulting y∗ij variables
We want small HM and small r .
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 15 / 31
Motivation Data Model Simulation studies Real data analysis
Simulation study 1
3 underlying groups defined by three vars. z′ = (z1, z2, z3)
f (z) =13
N (z | µ1,Σ1) +13
N (z | µ2,Σ2) +13
N (z | µ3,Σ3)
µ1 =
225
µ2 =
642
µ3 =
162
Σ1 =
1 0 00 1 00 0 1
Σ2 =
0.1 0 00 2 00 0 0.1
Σ3 =
2 0 00 0.1 00 0 0.1
n = 100 data points
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 16 / 31
Motivation Data Model Simulation studies Real data analysis
SS1 : Data
−4 −2 0 2 4 6 8
12
34
56
78
0 1 2 3 4 5 6 7 8
Z1
Z2
Z3
−2 0 2 4 6
12
34
56
7
Z1
Z2
−2 0 2 4 6
23
45
67
Z1
Z3
1 2 3 4 5 6 7
23
45
67
Z2
Z3
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 17 / 31
Motivation Data Model Simulation studies Real data analysis
SS1 : Variables
Considered three scenarios :(I) Three continuous variables (y1, y2, y3) : yi = zi , i = 1, 2, 3.
(II) Two binary variables (y1, y3) : y1 = 11(z1 > 5) and y3 = 11(z3 > 3).(III) Two binary variables (y1, y3), as in (II) plus one ordinal variable
y2 = 11(4 < z2 ≤ 5) + 211(z2 > 5), and a continuous variable y4 ∼ N(0, 1)
Cut-off points (γ0, γ1, γ2) = (−∞, 0,∞) for the binary variables, and(γ0, γ1, γ2, γ3) = (−∞, 0, 4,∞) for the ordinal variable with 3 categoriesPriors on variances :
A) dz0 = dµ0 = 0.1 and dz
1 = dµ1 = 0.1B) dz
0 = dµ0 = 1 and dz1 = dµ1 = 1
C) dz0 = dµ0 = 2.1 and dz
1 = dµ1 = 30
α = 0.5, da0 = da
1 = db0 = db
1 = 1, κ = 1 and πi = 1.
Gibbs sampler with 4700 iterations, a burn-in of 200 and a thinning of 3. A total of1500 MCMC draws were kept for inference
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 18 / 31
Motivation Data Model Simulation studies Real data analysis
SS1 : Convergence speed
Real cluster
Mod
el c
lust
erite
r=1
1 2 3
010
2030
4050
60
Real cluster
Mod
el c
lust
erite
r=11
1 2 3
24
68
Real cluster
Mod
el c
lust
erite
r=20
1 2 3
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Real cluster
Mod
el c
lust
erite
r=30
1 2 3
1.0
1.5
2.0
2.5
3.0
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 19 / 31
Motivation Data Model Simulation studies Real data analysis
SS1 : Results
0.00
0.05
0.10
0.15
I (A)
number of clusters
81 83 85 87 89 91 93 95 97
0.00
0.04
0.08
0.12
I (B)
number of clusters
30 33 36 39 42 45 48 52 55
0.0
0.2
0.4
0.6
0.8
I (C)
number of clusters
3 4 5 6
0.00
0.04
0.08
0.12
II (A)
number of clusters
3 6 9 13 17 21 25 29 33 37
0.00
0.04
0.08
0.12
II (B)
number of clusters
3 6 9 13 17 21 25 30 34 38
0.00
0.05
0.10
0.15
0.20
II (C)
number of clusters
3 5 7 9 11 13 15 18
0.00
0.04
0.08
III (A)
number of clusters
74 77 80 83 86 89 92 95 98
0.00
0.04
0.08
0.12
III (B)
number of clusters
54 57 60 63 66 69 72 75
0.00
0.05
0.10
0.15
0.20
III (C)
number of clusters
4 6 8 10 12 14 16
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 20 / 31
Motivation Data Model Simulation studies Real data analysis
Simulation study 2
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
0.12
z
Den
sity
Took n = 200 mutually exclusive intervals Ai = (τi−1, τi ] where τ0 = 0 andτi = τi−1 + 0.25, for i = 1, . . . , n. Calculated pi = Pr(Ai ) under density f (z)
Simulate a single value zi uniformly from Ai , that is zi ∼ Un(τi−1, τi ], and defineyi = zi , i = 1, . . . , n. Clearly, the data yi would look like a uniform sample in theinterval (0, 50].
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 21 / 31
Motivation Data Model Simulation studies Real data analysis
SS2 : Priors
Let N=hypothetical population size, define wi = Npi as in a complex samplingdesign. Here
∑ni=1 wi = N and w = N/n.
We consider three scenarios :(IV) Ignoring the sample design, πi = 1 and κ = 1(V) Acknowledging the sample design, πi = 1/wi and κ = w/15
(VI) Acknowledging the sample design, πi = 1/wi and κ = w/25,
Note that wi and κ affect multiplicatively the variance of the latent variables, andsince w/wi = p/pi , where p = (1/n)
∑ni=1 pi , there is no need to specify N
We took prior (C) for the variances and the same all other priors
The Gibbs sampler was run for 4,700 it with a bi of 200 and thinning of 3
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 22 / 31
Motivation Data Model Simulation studies Real data analysis
SS2 : Results0.
00.
20.
40.
60.
8
IV
number of clusters
1 2 3 4 5 6
0.00
0.10
0.20
0.30
V
number of clusters
3 4 5 6 7 8 9 11 13 15
0.00
0.05
0.10
0.15
VI
number of clusters
3 5 7 9 11 14 17 20 23
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 23 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : EdoMex
Facts of EdoMex :16.2 million inhabitants (13.5% of the Country pop.) Largest State4.2 million householdsENIGH is a sample of 1, 730 households (0.04% of the households)Each of the households in sample represents between 960 and 5, 286 households
w_i
Den
sity
1000 2000 3000 4000 5000
0e+
002e
−04
4e−
046e
−04
8e−
04
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 24 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : Variables
Continuous :Y1 =Income per capita⇒ Z1 = log(Y1 + c), c = ξ0.01
Binary : γj0 = −∞, γj1 = 0, γj2 =∞Y2 =Education deprivation (yes or no)Y3 =Access to health services (yes or no)Y4 =Access to social security (yes or no)Y5 =Housing quality (bad or good)Y6 =Access to basic public services (yes or no)Y7 =Access to feeding (yes or no)
Ordinal : γ8,0 = −∞, γ8,1 = 0, γ8,2 = 4 and γ8,3 =∞Y8 =Education level (0–incomplete primary, 1–incomplete secondary, 2–completesecondary or more)
Nominal :Y9 =Town size (1–[100000,∞), 2–[15000, 100000), 3–2500, 15000), 4–(0, 2500)inhabitants)
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 25 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : Prior specifications
Considered three cases :i) Ignoring the sampling design, πi = 1 and κ = 1ii) Acknowledging the sample design, πi = 1/wi and κ = 2wiii) Acknowledging the sample design, πi = 1/wi and κ = 4w
Prior (C) for the variances and the same all other priors
Gibbs sampler was run for 3,200 it with a bi of 200 and a thinning of 3
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 26 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : (i)
r = 163, HM = 1246
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 27 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : (ii)
r = 35, HM = 2240
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 28 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : (iii)
r = 9, HM = 3000
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 29 / 31
Motivation Data Model Simulation studies Real data analysis
Real data : Group means (iii)
group income feed health house edu serv ss hedu ts :1 ts :2 ts :3 ts :4 size1 5934 0.22 0.35 0.11 0.32 0.11 0.80 1.91 0.59 0.13 0.15 0.13 36.0%2 11374 0.16 0.34 0.07 0.36 0.07 0.71 1.93 0.64 0.12 0.14 0.10 30.5%3 22682 0.06 0.34 0.01 0.23 0.04 0.67 1.99 0.74 0.09 0.10 0.06 12.4%4 3091 0.26 0.42 0.06 0.43 0.24 0.89 1.78 0.54 0.11 0.16 0.20 9.0%5 1783 0.37 0.23 0.09 0.95 0.41 0.60 0.21 0.41 0.05 0.18 0.36 4.0%6 5006 0.25 0.32 0.13 0.94 0.15 0.64 0.53 0.46 0.11 0.26 0.16 4.2%7 44991 0.04 0.09 0.00 0.14 0.02 0.30 2.00 0.63 0.29 0.06 0.02 2.7%8 570 0.69 0.24 0.46 0.68 0.29 1.00 1.46 0.16 0.06 0.48 0.30 0.7%9 219578 0.16 0.27 0.00 0.00 0.00 0.27 2.00 0.89 0.11 0.00 0.00 0.5%
pop 11212 0.19 0.34 0.08 0.38 0.11 0.74 1.79 0.61 0.12 0.15 0.13 4240837
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 30 / 31
Motivation Data Model Simulation studies Real data analysis
References
Barnard, J., McCulloch, R. and Meng, X.-L. (2000). Modeling covariance matrices in terms of standarddeviations and correlations, with application to shrinkage. Statistica Sinica 10, 1281–1311.
Chambers, R. L. and Skinner, C. J. (2003). Analysis of Survey Data. Wiley, Chichester.
Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2017). Model-based approach for household clustering withmixed scale variables. Submitted to Advances in Data Analysis and Classification.
Coneval (2009). Metodología para la medición multidimensional de la pobreza en México. Consejo Nacionalde Evaluación de la Política de Desarrollo Social, México. (In Spanish). Available at :http ://www.coneval.org.mx/rw/resource/Metodologia_Medicion_Multidimensional.pdf
Dahl, D.B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. InBayesian Inference for Gene Expression and Proteomics, Eds. M. Vanucci, K.-A. Do and P. Müller.Cambridge University Press, Cambridge.
Lumley, T. (2010). Complex Surveys. Wiley, New Jersey.
Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stablesubordinator. The Annals of Probability 25, 855–900.
Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 31 / 31