Model-based approach for household clustering with mixed ...allman.rhon.itam.mx/~lnieto/index_archivos/Trinity2018.pdf · Model-based approach for household clustering with mixed

Motivation Data Model Simulation studies Real data analysis

Model-based approach for household clustering with mixedscale variables

Luis E. Nieto Barajas

Departament of StatisticsITAM

Trinity College – 22 june 2018

Luis E. Nieto Barajas Household clustering Trinity College – 22 june 2018 1 / 31


Contents

1 Motivation

2 Data

3 Model

4 Simulation studies

5 Real data analysis



Context

SEDESOL : Ministry of social development (government dependency)

Aim of SEDESOL : help and improve the social backwardnessTo fulfil it, SEDESOL creates social programmes to target specific needs

Social inclusionLife insurance for single mothersFeeding supportPension for elderly peopleDay care centersMilk provisionetc...



Objective

Currently each existing programme has its own selection rules (mainly based onincome)

Objective : Unify the selection procedure by creating clusters of households withsimilar needs and socio-economic conditions.



CONEVAL

How do we measure poverty in households ?In 2009, CONEVAL proposed a new methodology based on multiple dimensions :income dimension, social deprivations and social cohesionMultidimensional poverty indicators :

1 income per capita2 education deprivation3 access to health services4 access to social security5 housing quality6 access to basic public services7 access to feeding



ENIGH

Every 2 years the National Institute for Official Statistics (INEGI) implements asurvey

The survey is based on a complex design of households, i.e., different selectionprobabilities for each household

Survey is nationwide and includes a module of socio-economic conditions

Clusters will be produced based on these 7-dimensional poverty indicators plusfew other variables

Available data are mixed mode, i.e., continuous, discrete, ordinal and nominal

Challenge : Produce clusters of households using mixed scale data taking intoaccount the complex sampling design of the survey



Notation

yi = yij , j = 1 . . . , p, i = 1, . . . , n, is the multivariate response

p = total number of variables, such that

p = c + o + mc = # continuous variables,o = # ordinal variables,m = # nominal variables

In summary, y′i = (yi,1, . . . , yi,c , yi,c+1, . . . , yi,c+o, yi,c+o+1, . . . , yi,c+o+m)

For each yi of dim p ⇐⇒ latent zi of dim q, with q ≥ p



Latent variables

If yij is continuous⇒ zij = gj (yij ), where gj (·) a normalising transformation

If yij is ordinal with values in ϑk with Kj different values⇒ zij satisfies yij = ϑk iffγj,k−1 < zij ≤ γj,k and γj,0, . . . , γj,Kj +1 are fixed thresholds

Note : a binary variable is a special case of an ordinal variable with Kj = 2

If yij is nominal with Lj categories we need Lj − 1 latentszij = zil , l = c + o +

∑j−1h=c+o+1(Lh − 1) + 1, . . . , c + o +

∑jh=c+o+1(Lh − 1) such that

yij =

Lj , if maxl (zi,l ) < 0k, if zi,s = maxl (zi,l ) & zi,s > 0

with s = c + o +∑j−1

h=c+o+1(Lh − 1) + k, and k = 1, . . . , Lj − 1

Finally, q = c + o +∑c+o+m

j=c+o+1(Lj − 1)



Non iid data

Data come from a complex survey sampleIndividual yi has known sampling probability πiωi = 1/πi are sampling design weights or expansion factors

[Lumley, 2010] : Weighted least squares

minn∑

i=1

1πi

(yi − α− βxi )2

[Chambers and Skinner, 2003] : Likelihood re-weighting

n∏i=1

f (yi | θ)1/πi



Model

Observed y ⇐⇒ z latents

We proposezi | µi ,Σ ∼ Nq(µi , κπiΣ)

Plus some estimability constraints on Σ

For j = 1, . . . , c⇒ σ2j > 0

For j = c + 1, . . . , c + o⇒ σ2j > 0 if Kj > 2 and σ2

j = 1 if Kj = 2For j = c + o + 1, . . . ,

∑c+o+mh=c+o+1(Lh − 1)⇒ σ2

j = 1To ensure positive definite : Separation strategy

Σ = ΛΩΛ

with Λ = diag(σ1, . . . , σq) and Ω a correlation matrix



Priors

For µi : Nonparametric prior [Pitman and Yor, 1997]

µi |Giid∼ G, for i = 1, . . . , n with G ∼ PD(a, b,G0)

with G0(µ) = N(0,Σµ) and Σµ = diag(σ2µ1, . . . , σ

2µq)

G is a.s. discrete G(·) =∑∞

k=1 wkδξk (·) with ξkiid∼ G0 and w1 = v1,

wk = vk∏

l<k (1− vl ), with vkind∼ Be(1− a, b + ka)

Marginally f (µi | µ−i ) =b+a ri

b+n−1 g0(µi ) +∑ri

j=1

n∗j,i−a

b+n−1 δµ∗j,i(µi )

For σ2j :

σ2j

iid∼ IGa(dz0 , d

z1 )

For Ω : Marginally uniform [Barnard et al., 2000]

f (Ω) ∝ |Ω|q(q−1)/2−1

∏j

|Ωjj |

−(q+1)/2



Hyper priors

For (a, b) : a ∈ [0, 1) and b > −a

f (a, b) = f (a)f (b | a)

withf (a) = αI0(a) + (1− α)Be(a | da

0 , da1 )

andf (b | a) = Ga(b + a | db

0 , db1 )

For σ2µj :

σ2µj

iid∼ IGa(dµ0 , dµ1 )



Posteriors

Posterior inference is based on MCMC

Apart from the conditional posterior distributions of all model parameters, we alsosample from the conditional predictive distribution of zij which are truncatednormals for the latent variables not associated to the continuous variables

Implemented in the R-package BNPMIXcluster



Clustering selection

For each MCMC iteration we have an n× n adjacency matrix (1 if elements i and jshare the same value of µ∗ and 0 otherwise)

Compute a similarity matrix (Montecarlo average of all the adjacency matrices).This represents the “average clustering”.

We select the adjacency matrix of the iteration with minimum squared distancefrom the average similarity matrix [Dahl, 2006]



Clustering comparison

Get rid of the scales of the variables, define new variables y∗ij as : for a numericalvariable, yij is standardized across all individuals ; for a categorical variable, if thenumber of categories is two then y∗ij = yij , otherwise define y∗il a latent indicatorvariable for each category l = 1, . . . , Lj

HM(C1, . . . ,Cr ) =r∑

k=1

nk

p∗∑j=1

S2kj , where S2

kj =

nk∑i=1

w (k)i y∗ij

2− nk∑

i=1

w (k)i y∗ij

2

,

(1)with w (k)

i = wi/∑

l∈Ckwl the normalized weights and p∗ is the number of

resulting y∗ij variables

We want small HM and small r .



Simulation study 1

3 underlying groups defined by three vars. z′ = (z1, z2, z3)

f (z) =13

N (z | µ1,Σ1) +13

N (z | µ2,Σ2) +13

N (z | µ3,Σ3)

µ1 =

225

µ2 =

642

µ3 =

162

Σ1 =

1 0 00 1 00 0 1

Σ2 =

0.1 0 00 2 00 0 0.1

Σ3 =

2 0 00 0.1 00 0 0.1

n = 100 data points



SS1 : Data

−4 −2 0 2 4 6 8

12

34

56

78

0 1 2 3 4 5 6 7 8

Z1

Z2

Z3

−2 0 2 4 6

12

34

56

7

Z1

Z2

−2 0 2 4 6

23

45

67

Z1

Z3

1 2 3 4 5 6 7

23

45

67

Z2

Z3



SS1 : Variables

Considered three scenarios :(I) Three continuous variables (y1, y2, y3) : yi = zi , i = 1, 2, 3.

(II) Two binary variables (y1, y3) : y1 = 11(z1 > 5) and y3 = 11(z3 > 3).(III) Two binary variables (y1, y3), as in (II) plus one ordinal variable

y2 = 11(4 < z2 ≤ 5) + 211(z2 > 5), and a continuous variable y4 ∼ N(0, 1)

Cut-off points (γ0, γ1, γ2) = (−∞, 0,∞) for the binary variables, and(γ0, γ1, γ2, γ3) = (−∞, 0, 4,∞) for the ordinal variable with 3 categoriesPriors on variances :

A) dz0 = dµ0 = 0.1 and dz

1 = dµ1 = 0.1B) dz

0 = dµ0 = 1 and dz1 = dµ1 = 1

C) dz0 = dµ0 = 2.1 and dz

1 = dµ1 = 30

α = 0.5, da0 = da

1 = db0 = db

1 = 1, κ = 1 and πi = 1.

Gibbs sampler with 4700 iterations, a burn-in of 200 and a thinning of 3. A total of1500 MCMC draws were kept for inference



SS1 : Convergence speed

Real cluster

Mod

el c

lust

erite

r=1

1 2 3

010

2030

4050

60

Real cluster

Mod

el c

lust

erite

r=11

1 2 3

24

68

Real cluster

Mod

el c

lust

erite

r=20

1 2 3

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Real cluster

Mod

el c

lust

erite

r=30

1 2 3

1.0

1.5

2.0

2.5

3.0



SS1 : Results

0.00

0.05

0.10

0.15

I (A)

number of clusters

81 83 85 87 89 91 93 95 97

0.00

0.04

0.08

0.12

I (B)

number of clusters

30 33 36 39 42 45 48 52 55

0.0

0.2

0.4

0.6

0.8

I (C)

number of clusters

3 4 5 6

0.00

0.04

0.08

0.12

II (A)

number of clusters

3 6 9 13 17 21 25 29 33 37

0.00

0.04

0.08

0.12

II (B)

number of clusters

3 6 9 13 17 21 25 30 34 38

0.00

0.05

0.10

0.15

0.20

II (C)

number of clusters

3 5 7 9 11 13 15 18

0.00

0.04

0.08

III (A)

number of clusters

74 77 80 83 86 89 92 95 98

0.00

0.04

0.08

0.12

III (B)

number of clusters

54 57 60 63 66 69 72 75

0.00

0.05

0.10

0.15

0.20

III (C)

number of clusters

4 6 8 10 12 14 16



Simulation study 2

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

0.12

z

Den

sity

Took n = 200 mutually exclusive intervals Ai = (τi−1, τi ] where τ0 = 0 andτi = τi−1 + 0.25, for i = 1, . . . , n. Calculated pi = Pr(Ai ) under density f (z)

Simulate a single value zi uniformly from Ai , that is zi ∼ Un(τi−1, τi ], and defineyi = zi , i = 1, . . . , n. Clearly, the data yi would look like a uniform sample in theinterval (0, 50].



SS2 : Priors

Let N=hypothetical population size, define wi = Npi as in a complex samplingdesign. Here

∑ni=1 wi = N and w = N/n.

We consider three scenarios :(IV) Ignoring the sample design, πi = 1 and κ = 1(V) Acknowledging the sample design, πi = 1/wi and κ = w/15

(VI) Acknowledging the sample design, πi = 1/wi and κ = w/25,

Note that wi and κ affect multiplicatively the variance of the latent variables, andsince w/wi = p/pi , where p = (1/n)

∑ni=1 pi , there is no need to specify N

We took prior (C) for the variances and the same all other priors

The Gibbs sampler was run for 4,700 it with a bi of 200 and thinning of 3



SS2 : Results0.

00.

20.

40.

60.

8

IV

number of clusters

1 2 3 4 5 6

0.00

0.10

0.20

0.30

V

number of clusters

3 4 5 6 7 8 9 11 13 15

0.00

0.05

0.10

0.15

VI

number of clusters

3 5 7 9 11 14 17 20 23



Real data : EdoMex

Facts of EdoMex :16.2 million inhabitants (13.5% of the Country pop.) Largest State4.2 million householdsENIGH is a sample of 1, 730 households (0.04% of the households)Each of the households in sample represents between 960 and 5, 286 households

w_i

Den

sity

1000 2000 3000 4000 5000

0e+

002e

−04

4e−

046e

−04

8e−

04



Real data : Variables

Continuous :Y1 =Income per capita⇒ Z1 = log(Y1 + c), c = ξ0.01

Binary : γj0 = −∞, γj1 = 0, γj2 =∞Y2 =Education deprivation (yes or no)Y3 =Access to health services (yes or no)Y4 =Access to social security (yes or no)Y5 =Housing quality (bad or good)Y6 =Access to basic public services (yes or no)Y7 =Access to feeding (yes or no)

Ordinal : γ8,0 = −∞, γ8,1 = 0, γ8,2 = 4 and γ8,3 =∞Y8 =Education level (0–incomplete primary, 1–incomplete secondary, 2–completesecondary or more)

Nominal :Y9 =Town size (1–[100000,∞), 2–[15000, 100000), 3–2500, 15000), 4–(0, 2500)inhabitants)



Real data : Prior specifications

Considered three cases :i) Ignoring the sampling design, πi = 1 and κ = 1ii) Acknowledging the sample design, πi = 1/wi and κ = 2wiii) Acknowledging the sample design, πi = 1/wi and κ = 4w

Prior (C) for the variances and the same all other priors

Gibbs sampler was run for 3,200 it with a bi of 200 and a thinning of 3



Real data : (i)

r = 163, HM = 1246



Real data : (ii)

r = 35, HM = 2240



Real data : (iii)

r = 9, HM = 3000



Real data : Group means (iii)

group income feed health house edu serv ss hedu ts :1 ts :2 ts :3 ts :4 size1 5934 0.22 0.35 0.11 0.32 0.11 0.80 1.91 0.59 0.13 0.15 0.13 36.0%2 11374 0.16 0.34 0.07 0.36 0.07 0.71 1.93 0.64 0.12 0.14 0.10 30.5%3 22682 0.06 0.34 0.01 0.23 0.04 0.67 1.99 0.74 0.09 0.10 0.06 12.4%4 3091 0.26 0.42 0.06 0.43 0.24 0.89 1.78 0.54 0.11 0.16 0.20 9.0%5 1783 0.37 0.23 0.09 0.95 0.41 0.60 0.21 0.41 0.05 0.18 0.36 4.0%6 5006 0.25 0.32 0.13 0.94 0.15 0.64 0.53 0.46 0.11 0.26 0.16 4.2%7 44991 0.04 0.09 0.00 0.14 0.02 0.30 2.00 0.63 0.29 0.06 0.02 2.7%8 570 0.69 0.24 0.46 0.68 0.29 1.00 1.46 0.16 0.06 0.48 0.30 0.7%9 219578 0.16 0.27 0.00 0.00 0.00 0.27 2.00 0.89 0.11 0.00 0.00 0.5%

pop 11212 0.19 0.34 0.08 0.38 0.11 0.74 1.79 0.61 0.12 0.15 0.13 4240837



References

Barnard, J., McCulloch, R. and Meng, X.-L. (2000). Modeling covariance matrices in terms of standarddeviations and correlations, with application to shrinkage. Statistica Sinica 10, 1281–1311.

Chambers, R. L. and Skinner, C. J. (2003). Analysis of Survey Data. Wiley, Chichester.

Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2017). Model-based approach for household clustering withmixed scale variables. Submitted to Advances in Data Analysis and Classification.

Coneval (2009). Metodología para la medición multidimensional de la pobreza en México. Consejo Nacionalde Evaluación de la Política de Desarrollo Social, México. (In Spanish). Available at :http ://www.coneval.org.mx/rw/resource/Metodologia_Medicion_Multidimensional.pdf

Dahl, D.B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. InBayesian Inference for Gene Expression and Proteomics, Eds. M. Vanucci, K.-A. Do and P. Müller.Cambridge University Press, Cambridge.

Lumley, T. (2010). Complex Surveys. Wiley, New Jersey.

Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stablesubordinator. The Annals of Probability 25, 855–900.


Documents

Model-based approach for household clustering with mixed ...allman.rhon.itam.mx/~lnieto/index_archivos/Trinity2018.pdf · Model-based approach for household clustering with mixed