Cross-sectional Markov model for trend analysis of population characteristics

7/26/2019 Cross-sectional Markov model for trend analysis of population characteristics

1/21

Cross-sectional Markov model for trend analysis ofpopulation characteristics

Agnieszka Werpachowska and Roman Werpachowski

London, United Kingdom

Abstract

We present a stochastic model of population dynamics exploiting cross-sectional data intrend analysis and forecasts for groups and cohorts of a population. While sharing the con-venient features of classic Markov models, it alleviates the practical problems experiencedin longitudinal studies. Based on statistical and information-theoretical analysis, we adopt amaximum likelihood estimation procedure to determine model parameters, facilitating the useof a range of model selection methods. Their application to several synthetic and empiricaldatasets shows that the proposed framework is robust, stable and superior to a regression-based approach. We develop extensions for simulations of memory of the process, distin-guishing its short and long-term trends, as well as helping to avoid the ecological fallacy.Presented model illustrations yield new and interesting results, such as an even rate of weightgain across the generations of English population, suggesting a common driving factor, andyo-yo dieting in the U.S. data.

Keywords:cross-sectional data, incomplete longitudinal data, aggregate data, Markov model,

forecasting, BMI, marijuana

1 Introduction

The abundance of statistical surveys and censuses from past years invites new enhanced methodsfor studying various aspects of the composition and dynamics of populations. Gathered in differ-ent forms, as (repeated) cross-sectional or longitudinal data, they provide information on large,independent or overlapping, sets of subjects observed at several points in time. The first presentsa snapshot of the population for quantitative and comparative analysis, while the latter tracksselected individuals, facilitating cohort and causal inferences. The cross-sectional data is often re-garded inferior to the longitudinal one as it does not capture mechanisms underpinning observedeffects. At the same time, however, it is oblivious to such problems as attrition, conditioning orresponse bias, while its much cheaper and faster collection procedure does not raise concerns aboutthe confidentiality and data protection legislation. For these reasons, it is tempting to search forways of employing it in the longitudinal analysis.

Making inferences about the population dynamics on the basis of severed longitudinal informa-tion gleaned from cross-sectional data requires suitable theoretical approach and modelling tools.Several methods proposed, e.g. [114], are essentially based on regression techniques. In this pa-per we present a cross-sectional Markov (CSM) model for the transition analysis of survey dataexploiting information from time series of cross-sectional samples. While sharing the attractivefeatures of classic Markov models, it avoids the practical problems associated with longitudinaldata, anddue to its focus on population transfer rates between discrete statesit is particularly

[email protected]

1


2/21

well adapted to microsimulation modelling. The presented framework provides a set of versa-tile and robust tools for the analysis and forecasting of data trends with applications in variousdisciplines, including epidemiology, economics, marketing and political sciences.

The following sections introduce the CSM model, presenting its detailed mathematical descrip-tion and placing it in the context of statistics and information theory. We adapt it to analyseageing cohorts and processes with memory, and demonstrate a regularisation method helping toavoid fallacious inferences. The framework is extended to describe incomplete longitudinal data,giving the possibility of fully exploiting all available information, e.g. from aggregated surveys ofdifferent types. We also outline the model selection procedure required in its empirical applicationsas an integral part of any statistical data modelling. Finally, we test the developed methods andillustrate them by examples using actual demographic statistics, obtaining new interesting results.

2 Theoretical description

We begin by expounding the mathematical background and derivation of the CSM model in

Sec. 2.1. Sections 2.2 and 2.3 discuss it within the context of popular methods of statistical infer-ence: Bayesian analysis and maximum likelihood estimation. In Sec. 2.4 we give the information-theoretical interpretation of the maximum likelihood estimation used to determine the CSM modelparameters. Sections 2.5 and 2.6, respectively, present the time-inhomogeneous extension for sim-ulations of ageing cohorts and the process memory. Finally, in Sec. 2.7 we describe model selectionprocedure to be used in the CSM framework applications.

2.1 Model assumptions and parameters

We analyse a time series of observations of a certain characteristic (such as a risk factor, anexposure or a disease) in a studied population. The frequency and pattern of the characteristicin a surveyed representative sample is described by the distribution of a categorical variable X.The variable can takeNdistinct valuesk = 0, 1, . . . , N 1 corresponding to different categories of

the characteristic (e.g. ranges of risk factor values or stages of a disease). The observations {nt}are made at constant time intervals t = 0, . . . , T 1. At each time point t, the sample of size ntconsists of groups of nkt individuals assigned to category k, nt = (n0t, . . . , nN1,t); if data aremissing for any t then nt = 0. Thus, the empirical distribution ofXt is given by pkt := nkt/nt.Our goal is to estimate the true probability pkt := P(Xt = k) and its confidence intervals, as wellas extrapolate the obtained result beyond the surveyed period.

When the survey follows the same individuals (a cohort) over time, as in longitudinal studies,a common approach is to use a (discrete-time) Markov model. It describes directly the stochasticdynamics of Xit, which is the value of variable X for an observed individual i at time t. Theprobability ofXit = k conditioned on the value ofXi at time t 1 is given by a constantN Ntransition matrix with elementskl satisfying the constraints kl [0, 1] andl

N1k=0 kl = 1:

P(Xit = k|Xi,t1 = l) = kl . (1)

From the above it immediately follows that

pkt =

N1l=0

klpl,t1 , (2)

i.e., pt = pt1 in vector notation, where pk,t [0, 1] andN1

k=0 pk,t = 1.While Eq. (1) involves tracking the same individual over time, Eq. (2) is conspicuously free from

this assumption. It transforms the distribution of variable X in a cross-sectional sample collectedat time t 1 into the distribution of this variable in another such sample collected at time t,using the transition matrix . By rearranging its terms, we can trace changes in the frequency of

2


3/21

propertyk between consecutive observations:

nkt

nt

nk,t1

nt1=

N1

l=0

(kl kl)nl,t1

nt1.

where nt and nkt are sizes of the population and its categories and kl is the Kronecker delta.Therefore, Eq. (2) can be the basis of a more robust dynamical model, which we propose in thisarticle. Due to its mathematical construction, we call it the cross-sectional Markov model. Itfacilitates the description of temporal trends of investigated characteristics in ageing cohorts and(groups of ) the population based on repeated cross-sectional data. The CSM model parameters:transition matrix and initial distribution p0, are determined by fitting pt =

tp0 (where t is

the t-th power of matrix ) to the observed distributions pt. In its extended version, the modelutilises information collected in any form, including aggregates of cross-sectional and (incomplete)longitudinal data. It is achieved by maximising the log-likelihood of the data over and p0.Detailed mathematical reasoning and procedures are described in the following sections.

2.2 Bayesian analysis

Estimation of standard Markov model parameters amounts to measuring the initial distribution ofan investigated variable and counting frequencies of transitions between its states at consecutivetime steps. This straightforward procedure owes to the fact that continuous longitudinal trajec-tories provide full information about the dynamics of the observed process. In contrast, repeatedcross-sectional data do not capture the individual transitions. To understand the implicationsof this difference for the CSM model, we perform a Bayesian analysis of the model parameters.We will demonstrate that the missing longitudinal information in repeated cross-sectional dataresults in a more complicated form of posterior distribution ofp0 and . This creates additionalchallenges in its estimation and calculations of confidence intervals.

Within the Bayesian paradigm, the joint probability distribution ofp0 and is inferred fromobserved data I. Before making any observations, we assume that all their values are equallyprobable, i.e., their prior probability density function is (p0, ) = 1, which carries maximumentropy and therefore least extraneous information. Next, by conditioning it on the observationresults we obtain the posterior distribution with the density (p0, |I) representing best our stateof knowledge about the distribution of model parameters given our prior knowledge and the data.

In a longitudinal study, we track changes of the investigated characteristic in a group of thesame individuals over a period of time. From Bayes theorem, we update the posterior with thevalue of variable Xi for each individual i added to the sample:

(p0, |I, Xi,0 = k) pk,0 (p0, |I) (3)

and at each time step t of his or her longitudinal trajectory:

(p0, |I, Xi,t= k Xi,t1 = l) kl (p0, |I) , (4)

where I is the previously observed data. From the above updating rules it follows that thefinal posterior joint density of the standard Markov model parameters is a product of Dirichletdistribution densities:

(p0, |I) = D0(p0)N1k=0

Dk+1(k) , (5)

where D(x) = 1/B()N1

k=0 xk1k is an N-dimensional Dirichlet density with a parameter

vector (i.e., n0,k = 0,k 1 is the number of individuals in categoryk at time 0 and (k+1)l 1

is the overall number of transitions from category k to l),B () =N1

k=0 (k)/(N1

k=0 k) andk is the k-th column vector of . Such Bayesian estimation of for longitudinal data can beeasily implemented numerically.

3


4/21

In the case of cross-sectional data, the same non-informative prior is updated by observationsof the form Xt = k, leading to the posterior density

(p0,

|I, Xt = k) pk,t (p0,

|I) = (t

p)k(p0,

|I).

For example, fort = 1 we obtain(p0, |I, X1 = k)N1

l=0 klpl,0 (p0, |I). Thus, the posteriordensity of model parameters conditioned on the observed cross-sectional data is not a product ofDirichlet densities, like in the longitudinal case, but a mixture of such products:

(p0, |{nt}) =Q1q=0

qq(p0, |{nt}) =Q1q=0

qDq0(p0)N1k=0

Dq,k+1(k) , (6)

where q is the posterior distribution corresponding to each possible trajectory q realising the

observed cross-sectional data, Q =T

t=0

N1k=0 1+(nkt), where 1+(a) equals 1 for a > 0 and 0

otherwise, and

Q1q=0 q = 1, q 0. Consequently, the number of components of for cross-

sectional data grows exponentially with the sample size and number of time steps, making its

direct calculation impossible in practice.The posterior distributions densities (5) and (6) are also characterised by different covariance

structures. To demonstrate it, let xj forj = 0, . . . , N be defined as

xj =

p0 j = 0

j1 1 j N.

For longitudinal data, vectors xj and xj are statistically independent for j = j, due to theproduct structure of the posterior (5). However, for the cross-sectional posterior (6) we obtainE[(xj)k(xj)k] =

qqEq[(xj)k]Eq[(xj)k], where Eq denotes the expectation value calculated

using q(p0, |{nt}) and E[(xj)k] =

qqEq[(xj)k]. Hence, E[(xj)k(xj)k ] = E[(xj)k]E[(xj)k ]for j = j, which indicates that xj and xj are not statistically independent. This makes theanalytic calculation of confidence intervals for estimated probabilities more difficult, unless oneresorts to the delta method approach (postulating that the likelihood function is a multivariateGaussian and expanding ln(pkt/(1 pkt)) up to linear terms in p0 and ), which can significantlyunderestimate their width. In general, due to its complex structure the posterior density forcross-sectional data is broader than for longitudinal data, leading to wider confidence intervals.

For the above reasons (impracticable calculation of the posterior distribution and correlation ofmodel parameters for cross-sectional data) we will use the maximum-likelihood method to obtain and p0, point estimates of and p0 (the next section), and bootstrapping for their confidenceintervals (Appendix A).

2.3 Maximum likelihood estimation

Given the difficulties in calculating the distribution of the CSM model parameters in the Bayesianframework, we attempt instead to find their best values and p

0 by maximising the log-

likelihood l[p0, ] := ln P({nt}|p0, ). For cross-sectional data we obtain

lCS[p0, ] = ln

T1t=0

nt1i=0

P(Xt = kit|p0, ) =T1t=0

nt

N1k=0

pktlnpkt . (7)

From the Bayes theorem, (p0, |{nt}) = P({nt}|p0, )(p0, )/P({nt}), sinceP(p0, |{nt}) = (p0, |{nt})d

Np0dN

). Assuming a flat prior like in Sec. 2.2, we obtain(p0, |) P(|p0, ). Thus, the maximum likelihood estimator of the model parameters is equiv-alent to their maximum posterior density estimator, bringing the maximum likelihood estimationprocedure closer to the Bayesian analysis discussed in the previous section.

In specific situations the CSM model may not be able to uniquely recover the transition matrixbecause of the lack of information about individual transitions in repeated cross-sectional data,

4


5/21

facing the threat of ecological inference fallacy [15]. For example, for K= 2 and a constant timeseriespt (1/2, 1/2), both kl = kl and kl = 1/2 (maximum and minimum correlation betweenXt and Xt1, respectively) are perfect solutions of Eq. (2), and so is their convex combination.

To steer the model estimation procedure towards a particular outcome, one can subtract from lCSa regularisation term 1

N1k,l=0(kl 2kl)

2, where 1 0 is the regularisation strength and2 [0, 1] specifies the type of required solution (from kl = kl for 2 = 1 to kl = 1/2 for2 = 0). Within the Bayesian framework, adding the penalty term of the above form amounts to

replacing the uniform prior with (p0, ) N1

k,l=0 (kl; 2kl, (21)1) where (; , 2) is the

density of normal distribution N(, 2). Consequently, we solve maximum posterior rather thanmaximum likelihood, which are not equivalent in this situation. While realistic time-dependentdata usually sufficiently constrains the solution space for , some form of regularisation may berequired at times, e.g. for purely cross-sectional data exhibiting simple temporal trends.

The CSM framework can be extended to allow analysis of incomplete longitudinal data (dis-torted by attrition or non-adherence). They can be represented as a set = {ti,ki}, consistingofQ independent trajectories, i.e., vectors ki ofs consecutive categories measured in i points intimeti= {ti,s}, wherei = 0, . . . , Q 1 ands = 0, . . . , i 1. This notation enables us to describetrajectories starting and ending at different times and having gaps. The likelihood of observing atrajectory ki [0, N 1]i given p0 and is

P(ki|p0, ) = (ti,0p0)ki,0

i1s=1

ti,sti,s1

ki,s,ki,s1

. (8)

Hence, the log-likelihood of the whole dataset is

lL[p0, ] =

Q1i=0

ln[(ti,0p0)ki,0 ] +

i1s=1

ln[ti,sti,s1

ki,s,ki,s1

]

. (9)

It is easy to notice that for complete longitudinal trajectories, i.e., ti = (0, 1, . . . , i 1) for all i,

the above result reduces to a simple product of Dirichlet distributions discussed in Sec. 2.2.To model an aggregate of cross-sectional and longitudinal data, we maximise the sum of log-

likelihood functions (7) and (9), lCS+ lL, over and p0. In doing this, it is useful to know thatalthough the cross-sectional part can be described by Eq. (9) for i = 1, it is more efficient totreat it as a separate term using Eq. (7). Numerically, the task requires solving a highly non-

linear optimisation problem with the following constraints: lN1

k=0 kl = 1,N1

k=0 pk,0 = 1 andkl, pk,0 [0, 1].

2.4 Connection to information theory

To better understand the relation between the CSM model and other popular approaches tomodelling repeated cross-sectional data (such as regression methods), it is helpful to couch it interms of information theory. Within this framework we represent our procedure of finding the

model parameters as a fitting of estimated probability distributions pt :=tp0 to observed onespt. The norm of the estimation error which is being minimised by the calibration procedure is theKullbackLeibler divergence [16,17],

DKL(ptpt) :=N1k=0

ln

pkt

pkt

pkt =

N1k=0

pktln pktN1k=0

pktlnpkt .

This important information-theoretical measure corresponds to the amount of information lostwhen replacing the probabilities indicated directly by the data with our model estimates. Hence, byminimisingDKL(ptpt) over and p0we ensure that our model exploits the maximum informationfrom the data, without making any distributional assumptions about the error [18,19].

5


6/21

It is easy to show that maximising log-likelihood for cross-sectional data (7) over p0 and isequivalent to minimising the weighted sum of KullbackLeibler divergences DKL(ptpt),

minp0,

T1t=0

nt

N1k=0

ln

pktpkt

pkt . (10)

Similarly, in the simple case of complete longitudinal trajectories with equal lengths i = T,maximising their log-likelihood is equivalent to minimising the KullbackLeibler divergence of twoprobability distributions over the space of all possible trajectories of length T. To demonstratethis, let nk be the number of observations of trajectory k in the dataset . Then, pk = nk/Q isthe distribution of different trajectories, which we want to approximate, an equivalent of pkt inEq. (10). The KullbackLeibler divergence to be minimised overp0 and is

DKL(pkpk) = k

pk(ln P(k|p0, ) ln pk) ,

whereP(k|p0, ) is given by Eq. (8). The first term in this formula is equal to minus log-likelihoodof given by Eq. (9), which confirms our assertion. It is worth noting that if we observe just a smallsubset of possible NT trajectories (e.g. very few individuals with long trajectories), minimisingDKL(pkpk) will require concentrating the mass of the probability distribution P(k|p0, ) in asmall subregion of the available configuration space, making optimisation of model parametersmore difficult numerically than in the case of cross-sectional data.

2.5 Modelling ageing cohorts based on cross-sectional data

The CSM model can be adjusted to analyse trends of characteristics in ageing cohorts basedon repeated cross-sectional data or its aggregates with longitudinal one. Figure 2.5 presents aschematic solution of this problem. It requires the estimation of CSM transition matrices andinitial distributions p0 for each age group and in all time periods covered by the available data.

The age brackets are numbered from 1 to n and assumed to have the same length l for simplicity.Beginning with the initial state for the first bracket, we apply the transition matrix 1 obtainedfor this bracket l times, thus increasing the age of the cohort. This gives the first l years of theCSM model fit p. Next, we switch to the second age bracket and apply the transition matrix2to the current distribution pl; we perform the operation l times obtaining the fit for l subsequentyears. The procedure repeats until the last age bracket, at which point we keep on applying thelast transition matrixn until the end of the desired extrapolation period.

Figure 1: Schematic of the CSM model analysis for an ageing cohort.

2.6 Non-zero memory

In discrete-time Markov models, the memory of the process is introduced by conditioning the nextstate of the variable X on not just the current one, but also one or more preceding states. Forexample, to model a one-step memory we can replace Eq. (1) with

P(Xit = k|Xi,t1 = l Xi,t2 = m) = klm , (11)

with constraintsl,m

N1k=0 klm = 1 and klm [0, 1].

6


7/21

A similar procedure can be performed in the CSM framework by estimating the joint distribu-tion ofXt andXt1 based on repeated cross-sectional data. For this purpose, we define a randomvariable Zt := (Xt, Xt1), noting that Zt and Zt1 are always correlated since both depend on

Xt1. The distribution ofZt is denoted by qt; its dynamics are governed by a transition matrix ,the counterpart of . Hence, has the form (k,l),(l,m) = llklm, where klm satisfies the same

constraints as in Eq. (11). It follows thatq(k,l),t =N1

m=0 klmq(l,m),t1. Since we do not observethe process Zt directly, we have to estimate the initial state q0 and transition matrix based onpt. It requires reducing the dimension of the distributions qt =

tq0 from N

2 to N to the endthat pt = R[qt], where the reduction operator is defined as

R[qt]k :=

N1l=0

q(k,l),t .

The estimatesandq0 are found by minimising the total weighted KullbackLeibler distance,

min,q0

T1t=0

ntDKL(ptR[t

q0])

The above procedure easily extends to models with longer memory and has a straightforwardnumerical implementation.

In the general case of aggregated or incomplete longitudinal data, introducing the memorylength leads to a more complicated form of the likelihood function (8). LetG = [0, N 1]+1 bethe set of values ofZt = (Xt, . . . , X t), containing all possible continuous sequences of states ofvariable Xt spanning the memory length. The transition matrix is indexed by such a sequence G (representing a trajectory over the time interval [t , t]) and a statek [0, N 1] att + 1,that is k. We define (ki)t G as the set of only those sequences for which the event Zt = is not contradicted by the observed trajectory ki. For example, we model a variable with N = 3states and memory = 2 based on three consecutive longitudinal measurements, one of which is

missing: (X0 = 1, X1 = unknown, X2 = 2); then (ki)2 = {(2, 0, 1), (2, 1, 1), (2, 2, 1)}. The finalform of the likelihood function for ki starting atti,0 and ending at ti,i1 is

P(ki|q0, ) =

i1(ki)ti1

. . .

0(ki)tq,0

i1,i2. . . 1,0q0,0 ,

where , = 1,0 ,10. A fully-specified trajectory (without gaps) will always haveonly a single element in every set (ki)t fortq,0+ t tq,i1.

The error estimation by bootstrapping is performed identically as in the case of zero memory.

2.7 Selecting the best model

An integral part of all statistical work with data is choosing a suitable model for their analysis.This section invokes the most popular model selection techniques and incorporates them in theproposed framework. They will enable us to quantitatively compare the performance of the CSMmodel variants and other methods. On this basis, we will decide which model best explains themechanisms underlying the data, providing the most accurate and stable predictions.

Assuming that the calibration procedure converges numerically, one can always improve thefit by increasing the model complexity (e.g. extending the CSM memory length). However, indis-criminately adding new model parameters leads to overfitting. To strike a balance between thesetwo factors, we calculate Akaike Information Criterion (AIC) [20]

AIC = 2k 2lmax (12)

and Bayesian Information Criterion (BIC) [21]

BIC =k(ln n ln 2) 2lmax , (13)

7


8/21

wherek is the number of model degrees of freedom and n is the sample size (i.e., the number ofcollected surveys for cross-sectional and observed trajectories for longitudinal data).1 Both criteriaare constructed as a penalty for the complexity of a candidate model minus twice the maximised

log-likelihood lmax value, but they are derived from different mathematical perspectives. AICintends to minimise the KullbackLeibler distance between the true data-generating probabilitydensity and that predicted by the candidate model, while BIC seeks for the model with maximumposterior probability given the data. Consequently, they do not always select the same bestcandidate. In particular, BIC assures consistency (for very large sample size it will choose thecorrect model with probability approaching 1), while AIC aims at optimality (as more benevolentto complexity, it leads to a lower variance of the estimated model parameters, especially if theirtrue values are close to those from oversimplified models) [22, 23]. The difference between thecriteria becomes evident with growing sample size: AIC allows additional model parameters todescribe the new data (increasing the predictive accuracy if it represents new information, oroverfitting if it totes mostly noise and outliers), while BIC is more stringent and favours smallermodels (with fewer parameters).

While the information criteria focus on accuracy and parsimony, cross-validation analysis is a

natural and practical way of assessing the predictive performance and robustness of the model [24].As a realisation of out-of-sample testing, it does not rely on analytical approximations but exactalgorithms and provides a stronger check against overfitting. We will use its most common vari-ants: leave-one-out (LOOCV) andk-fold cross-validation (kFCV), which under certain conditionsbehave similarly to AIC and BIC, respectively [22], as well as time series cross-validation (TSCV).

The application of the above techniques is straightforward when working with cross-sectionaldata. In the case of LOOCV, given a time sequence ofT observations{pt}, for eacht [1, T) wecalibrate the CSM model to all data points but pt and next use the obtained model to calculatethe approximate value of the omitted point, pt

t. The KullbackLeibler divergenceDLOOCVt :=DKL(ptpt

t) measures the error with which the model recovers pt. The sum ofDLOOCVt over t,

DLOOCV, defines a measure of the model error which additionally discourages overfitting.ThekFCV consists in randomly partitioning, one or several times, the sequence of observations

into k subsets of equal size and performing the leave-one-out analysis on the subsets, adding theerrors up for all folds and averaging the sums over the partitionings. Since in each fold we leaveout more data points, kFCV is a stronger test for overfitting than LOOCV.

Both LOOCV and k FCV are sufficient for testing the models approximation and forecastingabilities, however they may fail at the latter in the setting of highly autocorrelated data [25].To assess the CSM model performance in forecasting highly correlated time series we need aspecialised method such as TSCV [26]. In this approach, we fit the model to first T time pointst= 0, . . . , T 1, whereT 2, and use it to predict the probability distribution for T, pTST. Thetotal TSCV error is the sum of KullbackLeibler divergencesDKL(pTp

TST) over T

[2, T).When working with data containing longitudinal information, we consider each individual

trajectory (complete or with gaps) to be an independent observation. Hence, we performk FCVby dividing the set of observed trajectories (see Sec. 2.3) into k subsets. Due to the multitude oftrajectories, we are able to perform only a limited number of iterations of the procedure, namely

30. We perform TSCV by truncating the trajectories to test the stability of extrapolation results.The error of extrapolation from T 1 to T is measured as a difference of log-likelihoods oftrajectories truncated atT 1 andT, calculated using the model fitted to trajectories truncatedatT 1.

Finally, a relevant measure of how suitable and robust the model is for analysing a particulardataset is whether the extrapolated trends it produces behave reasonably and stably.

1Using the number of subjects as the longitudinal sample size is a very conservative assumption, which can

underestimate the effective size. However, the CSM framework is amenable to this basic approach, as we will

demonstrate in Sec. 3. Additionally, since typical applications of the CSM model concern large n, we neglect the

commonly used small-sample correction to AIC.

8


9/21

3 Tests and empirical illustration

In this section we test the proposed framework using synthetic data (Sec. 3.1) and demonstrate its

practical applications to a selection of repeated cross-sectional and incomplete longitudinal samplescollected from real-life observational studies (Secs. 3.23.4). We compare different variants of theCSM model (with memory and regularisation) and estimate confidence intervals for extrapolatedresults. Our analysis includes the popular multinomial logistic regression (MLR) [13] using thesame, maximum likelihood estimator. The model selection procedure enables us to choose thebest model, representing the most trustworthy and reliable statistical description and predictiontool for the investigated problem.

All models were implemented in C++11, employing Open Source optimisation library NLopt [27],automatic differentiation library Sacado [28] and linear algebra library Eigen [29].

3.1 Synthetic data

To test the CSM framework, we use synthetic data consisting in a set of 1000 longitudinal trajec-

tories, each of length 30, generated by a 3-dimensional Markov process with a one-period memory,as in Eq. (11). The initial state is

Pinput(X1= m X0 = l) :=

0.08 0.14 0.100.14 0.08 0.10

0.08 0.08 0.20

kl

(14)

and the transition matrix is described by Table 1.

m l (input)klm = P(Xt+1= k|Xt= l Xt1 = m)

k= 0 k= 1 k= 20 0 0.8 0.1 0.10 1 0.15 0.75 0.10 2 0.18 0.7 0.12

1 0 0.8 0.19 0.011 1 0.03 0.94 0.031 2 0.01 0.14 0.852 0 0.1 0.6 0.32 1 0.2 0.5 0.32 2 0.09 0.9 0.01

Table 1: Transition matrix coefficients used to generate synthetic data.

We use variants of the CSM model with different memory length to analyse the longitudinaldata and its reduction to cross-sectional form. The latter is obtained by calculating the count ofcategoryl at time t given a set of trajectories (see Sec. 2.3),

nlt =

Q1i=0

i1q=0

1{t}(ti,q)1{l}(ki,q). (15)

The optimisation procedure, detailed in Secs. 2.3 and 2.6, employs the log-likelihood (9) for thelongitudinal case CSM()Land (7) for the reduced, cross-sectional case CSM()CS. Its results willgive us an insight into the models behaviour depending on the nature of the analysed problem.

Figure 2 presents the obtained fit and extrapolation results. All models, except for CSM(0)L,recover very well the observed cross-sectional distribution trends. Their long-term extrapolationsconverge towards the common steady state, defined by the equation p= p, signifying the stabilityof the CSM framework. Similarly, their confidence intervals (see Appendix A), displayed forCSM(1)L, have comparable widths and encompass the majority of data points.

We perform the model selection procedure, as described in Sec. 2.7, to verify our resultsand pick the best candidate for the problem. Table 2 (top) summarises the longitudinal case.

9


10/21

0 5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Time

Probability

^pkt red

green

blue

CSM(0)L

CSM(1)L

CSM(2)L

CSM(0)CS

CSM(1)CS

Figure 2: The CSM model results for the synthetic data generated by the 3-state Markov process.The CSM()L and CSM()CS models have been calibrated to longitudinal and (reduced) cross-sectional data, respectively, with different memory lengths . The 95% confidence intervals areshown for the best longitudinal model, CSM(1)L (see Table 2).

The reported AIC and BIC scores, as well as LOOCV, kFCV and TSCV residuals arerelative to the best model according to the respective statistic (i.e., the model with a zero inthe corresponding column). By rule of thumb, a difference of more than 10 in a BIC value isconsidered to be strongly relevant [30], which we can also apply to other metrics as they depend

on the maximum log-likelihood in a similar way. The variant of the model with memory lengthof one period, CSM(1)L, is unequivocally the best. The remaining candidates, CSM(0)L andCSM(2)L, contain too few or too many parameters to fit the available amount of data, leadingto under- and overfitting, respectively. We conclude that the CSM framework correctly recoversthe properties of the data-generating process and is congruent with standard model selectiontechniques.

Model DOF K-L div. AIC BIC kFCV TSCV

CSM(0)L 8 16290.7 5106.4 5052.5 2551.8 2311.3CSM(1)L 26 13718.9 0 0 0 0CSM(2)L 80 13706.7 96.2 249.4 159.3 512.5

Model K-L div. AIC BIC LOOCV kFCV TSCVCSM(0)CS 32.28 0 0 0 0 0

CSM(1)CS 14.77 1.01 117.45 166.73 546.49 38.15

Table 2: Model selection results for original (longitudinal) and reduced synthetic data. ForkFCV,k= 5 and the averaging is performed over 300 iterations.

Table 2 (bottom) presents the results for the cross-sectional case. Reducing longitudinal trajec-tories to cross-sectional distributions removed most information about the memory of the process.Consequently, we need less parameters to describe the data and the best model is the memory-lessCSM(0)CS.

10


11/21

Lastly, we compare the initial state

PCSML(1)(X1 = m X0 = l) :=0.08 0.19 0.04

0.15 0.08 0.090.04 0.07 0.25

kl

and transition matrix (Table 3) obtained from the best model CSML(1) to Pinput and input,

m l (CSML(1))klmk= 0 k = 1 k= 2

0 0 0.78 0.12 0.10 1 0.15 0.74 0.110 2 0.18 0.73 0.091 0 0.767 0.215 0.0181 1 0.03 0.94 0.031 2 0 0.09 0.912 0 0.11 0.6 0.29

2 1 0.2 0.5 0.32 2 0.09 0.9 0.01

Table 3: Transition matrix coefficients obtained for longitudinal synthetic data, (CSML(1))klm =P(Xt+1 = k|Xt = l Xt1 = m).

demonstrating that the calibration procedure accurately recovers the true parameters of the data-generating process. In particular, the Frobenius distance of the transition matrices is CSM(1)L input = 0.00039. By comparison, for CSM(1)CS calibrated to the reduced data, devoid oflongitudinal information, the corresponding value is predictably higher and equals CSM(1)CS input = 1.75. This affords an additional confirmation of the correct behaviour of employednumerical procedures.

3.2 Cross-sectional BMI data on the English population

We employ the CSM framework to analyse the repeated cross-sectional data on the Body MassIndex (BMI [kg/m2]) collected by the Health Survey for England (HSE) in years 19932013 [31].Twenty independent samples consist in between 3851 and 15303 persons aged 18 and older, eachassigned to one of three BMI categories: normal weight (NW) with BMI 25, overweight (OW)with 25< BMI 30 and obese (OB) with BMI > 30. We investigate the BMI trends in the wholeEnglish population and in selected birth cohorts.

The tested variants of the CSM model include the memory-less CSM(0), the 1-year memoryCSM(1), as well as CSM(0)reg with regularisation penalising jumps by two BMI categories in oneyear (since a persons BMI changes continuously through the ordered categories, it is reasonableto expect that such a jump is less likely than remaining in the same or moving to the neighbouringcategory). The regularisation helps us to obtain realistic conditional probabilities of transitions by

constraining the solution space for the matrix, helping to avoid the ecological fallacy as explainedin Sec. 2.3. Accordingly, we subtract from the log-likelihood l[p0, ] the penalty term

N1k,l=0

dkl2kl , where dkl =

1 |k l|> 1

0 |k l| 1and we set = 103.

The above procedure is an example of Tikhonov regularisation [32] and leads to a new model,which needs to be compared with other candidates using the information criteria (slightly lessreliable in this case as we are no longer inputting into them the logarithm ofmaximum likelihood)and cross-validation. The MLR results are included for comparison.

The projected BMI trends in the adult English population are presented in Fig. 3. All CSMmodels predict an increasing prevalence of excessive body weight: the fraction of obese persons

11


12/21

increases, while that of normal-weight ones decreases; the overweight part exhibits a very milddecrease. The long-term trends flatten converging towards a steady state consisting in 32% NW,38% OW and 31% OB for CSM(0)reg. Other CSM models attain a similar infinite time limit

(despite visible overfitting in the CSM(1) case), proving the stability of the CSM framework. Bycomparison, the MLR method anticipates a much faster persistent growth of the obese group witha simultaneous shrinkage of normal weight and overweight groups, converging to an entirely obesepopulation, which is rather imposed by the mathematical structure of the method than foundedon the data.

1995 2000 2005 2010 2015 2020 2025 2030 2035 20400.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Year

ProbabilityofBMIcategory

NW OW OBCSM(0)

reg

CSM(0)

CSM(1)

MLR

Figure 3: Fits and projections of the BMI trends (with 95% confidence intervals) in the adult

English population obtained by CSM() models (variants with and without regularisation, andmemory length of years) and the MLR method, based on the repeated cross-sectional HSE datafor years 19932013 (the marker area is proportional to the category count). The regularisation inCSM(0)reg decreases the variance of model parameters and thus narrows the confidence intervalsas compared to the penalty-free versions.

The transition matrices implied by the CSM(0) and CSM(0)reg models,

CSM(0)=

0.957 0 0.0480.043 0.945 0.021

0 0.055 0.932

andCSM(0)reg =

0.911 0.072 0.0040.089 0.873 0.065

0 0.055 0.931

, (16)

differ substantially, reflecting the previously invoked problem of multiple solutions for purely cross-

sectional data and the subsequent need for regularisation. Specifically, whereas both modelsagree that BMI of the majority of population does not change, they achieve the observed trends(Fig. 3) in different ways. In CSM(0), normal weight and overweight persons (two first columnsof CSM(0)) are only allowed to put on weight. To balance out this effect, the obese ones (thirdcolumn) are more likely to reduce their BMI by two categories in one year. The regularisedCSM(0)reg avoids this unrealistic effect and attributes most BMI changes to transitions betweenneighbouring categories. For instance, based on its predictions, the Englands 45 million adultpopulation currently consists in 34% NW, 38% OW and 28% OB persons. By the following yearthe obesity rate will increase by 0.2% (88 thousand people), the number by which the normalweight group will decline. The surprisingly rich dynamics behind this process involves 9% ofnormal weight persons becoming overweight, 5.6% of overweight persons turning obese and 7%reducing their weight to normal, as well as 6.5% of obese persons dropping to the overweightcategory. (Changing the regularisation parameter does not affect the result significantly.)

12


13/21

Model selection procedure summarised in Table 4 provides rigorous evaluation of the consideredcandidates. Comparison of the approximation errors shows that CSM models fit the data betterthan MLR, CSM(1) being the best. However, we need to appreciate the fact that it has more free

parameters than the others. Increasing the number of degrees of freedom (DOF) almost alwaysresults in a better fit, but can also make the model less stable and more sensitive to noise andoutliers. This is one of the reasons why the minimised error value or visual inspection of the fitare insufficient to validate the model, and we always need to resort to comprehensive statisticalmethodology. Accordingly, we find that AIC is dominated by the maximum log-likelihood termdue to the large sample size, and thus picks the most complex model, CSM(1), as the best. Atthe same time, BIC, with a stronger penalty term, prefers simpler CSM(0) and CSM(0)reg. Sincewe expect to see simple trends in the data, we interpret the AIC result as overfitting, whichcan indeed be observed in Fig. 3. Furthermore, the cross-validation procedure unequivocallyselects both memory-less CSM models. We conclude that the best models for studying temporaltrends of BMI distributions are CSM(0) and CSM(0)reg, the latter additionally yielding a realisticdescription of individual transitions between BMI categories.

Model DOF K-L div. AIC BIC LOOCV kFCV TSCVCSM(0)reg 8 86.6 21.5 0.7 0 0 0CSM(0) 8 90.5 20.9 0 1.9 1.4 2CSM(1) 26 62.1 0 130 814 14638 41.6MLR 4 160.1 152 97.6 80.2 83 58.9

Table 4: Model selection results for the BMI dataset. For kFCV, k = 5 and the averaging isperformed over 300 iterations.

20 30 40 50 60 70 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Age

ProbabilityofBMIcategor

y

NW OW OB

1822 in 93

1822 in 97

1822 in 01

3034 in 93

Figure 4: The CSM(0)reg fit and projection of BMI trends for four cohorts of the English popula-tion: aged 3034 in 1993, aged 1822 in 1993 (with 95% confidence intervals), aged 1822 in 1997and aged 1822 in 2001, based on the HSE data for years 19932013.

We employ the CSM(0)reg model to calculate BMI trends for birth cohorts, as described inSec. 2.5. We analyse four cohorts: 3034-years-old in 1993, 1822-years-old in 1993, 1822-years-old in 1997 and 1822-years-old in 2001, throughout their adulthood. As presented in Fig. 4, allof them display identical tendencies: the BMI distributions in the cohorts are very similar andweight gain occurs with age at the same rate.

The above CSM model results combined suggest that, contrary to the popular opinion, theexcessive weight gain is experienced by the whole English population throughout the adult life,

13


14/21

not just younger generations. This may suggest that it is driven by a common factor rather thanpertains to individual lifestyles specific to certain cohorts.

3.3 Cross-sectional data on marijuana use among US teenagers

The repeated cross-sectional data on the recreational use of marijuana (in the past month) amongAmerican 12th-graders were collected from Monitoring the Future (MTF) survey in years 19752011 (39 independent samples of varying size, ca. 15000 on average) [33]. Their trends shownin Fig. 5 evince a complex generating mechanism steered by changing state and federal policies,inviting more sophisticated statistical analysis. In this purpose we test five memory variants ofthe CSM model (from memory-less to 5-year memory) and the MLR method.

Figure 5 presents the obtained extrapolated fits of the marijuana use prevalence. All CSMmodels reproduce its overall long-term tendency, converging to a similar steady state of around20% of users. Increasing the memory length enables us to recover more details, but poses the riskof overfitting. In contrast, the MLR method is too constrained to describe the data: it doesntcapture their rich trend and converges fast to a steady state (no drug users in the population)

which is imposed by the mathematical structure of the method rather than founded on the data.

1980 1990 2000 2010 2020 2030 20400.1

0.15

0.2

0.25

0.3

0.35

0.4

Year

Marijuanaprevalence

marijuana users

CSM(0)CSM(1)

CSM(2)

CSM(3)

CSM(4)

CSM(5)

MLR

Figure 5: Fits and extrapolations (with 95% confidence interval) of marijuana use prevalenceamong American 12th-graders based on the MTF survey from years 19752011 obtained by CSM()models with different memory length of years and the MLR method. The marker area isproportional to the observed category count.

We verify our observations by performing the model selection procedure summarised in Table 5.In the setting of complex data trends, AIC and BIC values are dominated by the maximum log-likelihood of the model parameters, allowing to accommodate more subtle effects. They favourCSM models with the longest memory, with more degrees of freedom than observed prevalencevalues. Hence their choice should be treated with caution and a more precise and reliable modelassessment provided by the cross-validation techniques is required. Accordingly, LOOCV picksCSM(2) as the best candidate, with CSM(3) relatively close to it. At the same time, kFCVdueto its insensitivity to the details of observed trendselects CSM(0), but also has a local minimumfor models with non-zero memory at CSM(3). The latter is also preferred by TSCV, with CSM(2)next to it. The MLR method is indeed insufficient to describe the analysed dataset. Figure 6helps to visualise the model selection results, identifying CSM(3) as the best model, followed byCSM(2).

14


15/21

Model DOF K-L div. AIC BIC LOOCV kFCV TSCVMLR 2 4139 7818 6638 2854 1200 3424CSM(0) 3 2590 4723 3553 1422 0 1366CSM(1) 7 2123 3797 2664 6866 53538 1062CSM(2) 15 1203 1972 915 0 15292 320CSM(3) 31 1077 1754 847 223 7526 0CSM(4) 63 533 729 125 4243 15149 2712CSM(5) 127 105 0 0 1533

Table 5: Model selection results for the marijuana dataset. ForkFCV, k = 5 and the averagingis performed over 300 iterations. LOOCV and kFCV results for the memory length = 5 arenot provided due to computational limitations (note that their values increase sharply already for= 4).

MLR

CSM(0)

CSM(1)

CSM(2)

CSM(3)

CSM(4)

CSM(5)

0

1000

2000

3000

4000

5000

6000

7000

8000

AIC

BIC

LOOCV

kFCV/10

TSCV

Figure 6: Model selection results for the marijuana dataset, based on Table 5. Note that kFCVvalues were rescaled for visibility.

3.4 Longitudinal BMI data on the US population

We apply the CSM framework to longitudinal data, affected by attrition and non-adherence, onBMI collected from the National Longitudinal Survey of Youth 1979 run by the U.S. Bureau ofLabor Statistics [34]. The dataset consists of 12686 trajectories belonging to men and women froma cohort born in 195765, interviewed in years 1981, 1982, 1985, 2006, 2008, 2010 and 2012. TheBMI values were calculated from self-reported body weights and heights, and classified to one ofthe three categories: normal weight (NW), overweight (OW) or obese (OB) defined in Sec. 3.2.

The properties of the investigated data are displayed in Fig. 7. Since the interviews were con-ducted at irregular time intervals, we adjust their numbering as illustrated in panel a to preventnumerical instabilities and speed up the calibration process, while introducing minor inaccuracies.Attrition after the first three interviews is significant, apparently due to the intermission betweenyears 1985 and 2006 (9 intervals), leaving only 60% of participants in the last four rounds. Yet

almost all of them completed the survey with full adherence, as indicated by panel b. The longestcontinuous fragments of trajectories consist in 3 or 4 consecutive data points corresponding to theinitial or final rounds of interviews, respectively, and comprise 80% of the set of observed continu-ous trajectory fragments, as presented in panel c. The most frequent gap in the data correspondsto the survey intermission.

We calibrate different variants of the CSM model to the relabelled longitudinal data andtheir cross-sectional reduction defined by Eq. (15); the MLR method is used in the latter casefor comparison. Figure 8 presents the extrapolated fits of the BMI distributions. According tothe model selection procedure for the cross-sectional data summarised in Table 6, BIC and TSCVchoose CSM(0)CS, while LOOCV and kFCV prefer MLR. Models with memory, for instanceCSM(1)CSselected by AIC, are affected by overfitting. However, the two best models give markedlydifferent results: CSM(0)CS achieves the steady state comprised of 37% overweight and 63%

15


16/21

0 1 2 3 4 5 6 70

2000

4000

6000

number of observations for each individual

b

0 5 10 150

5000

10000

length of gaps/continuous fragments in trajectories

cfragm. of trajectory

gap

1 2 3 13 14 15 160

5000

10000

relabelling and number of survey participants

1981

1982

1985

20

06

20

08

20

10

2012

gap in the data

a

Figure 7: Survey data properties: a) data structure and renumbering the years of consecutiveinterviews into labels (1,2,3,13,14,15,16) to improve numerical performance, one period equals ap-proximately 2 years, b) distribution of the number of data points collected for survey participants,c) distribution of lengths of continuous fragments of data and gaps in the trajectories.

obese persons (through transitions between neighbouring categories, making the regularisationunnecessary), whereas MLR converges to an entirely obese population.

Model DOF K-L div. AIC BIC LOOCV kFCV TSCV

MLR 4 94.85 143.55 28.43 0 0 104.27CSM(0)CS 8 62.16 86.17 0 349.02 346.54 0CSM(1)CS 26 1.06 0 44.09 5761 11563 44.05

Model DOF -l.l. AIC BIC kFCV TSCVCSM(0)L 8 40432 5231 4828 2452 2761CSM(1)L 26 38317 1035 733.4 355.7 0CSM(2)L 80 37744 0 0 0 8573

Table 6: Model selection results for reduced and original NLSY79 data. Since calculation ofShannon entropy is numerically difficult for longitudinal data due to the large state space for

trajectories, we present the fit error as minus log-likelihood of data. kFCV results were calculatedfor 300 and 30 iterations for reduced and original data, respectively. LOOCV has been skipped asineffective, while TSCV result for CSM(2)L is distorted by the survey intermission for the latter.

Describing the longitudinal information contained in the original data requires additional modelparameters. This intuition is confirmed by the results of selection procedure carried out for modelswith different memory lengths, the best of which are presented in Table 6. The simplest modelCSM(0)L is unable to accommodate all available information and consequently produces a worsefit of the observed BMI trends than CSM(0)CS, despite having the same number of degrees offreedom (see Fig. 8). The best candidate is CSM(2)L, yielding the most satisfactory overallmodel selection results (we ignore the TSCV outcome distorted for models with long memoryby the survey intermission). Its estimation of the BMI distribution trends shown in Fig. 8 is

16


17/21

1985 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 20400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Year


NW OW OB

CSM(0)L

CSM(1)L

CSM(2)LCSM(0)

CS

MLR

Figure 8: BMI fits and predictions based on the NLSY79 longitudinal survey. We compare modelswith different memory lengths (approx. 2years, see relabelling in Fig. 7) calibrated to the original,longitudinal dataCSM()L, and their cross-sectional reductionCSM()CS, together with theMLR result for the latter.

similar to that obtained from CSM(0)CS, which signifies the stability and effectiveness of theCSM framework. The obtained steady state consists in 3% NW, 17% OW and 80% OB persons.Including the longitudinal information significantly narrows the confidence intervals, giving moreprecise estimates of analysed effects. Additionally, our results suggest that the MLR projections

based on cross-sectional data may be unrealistic and prompt misleading conclusions in general.

1990 2000 2010 2020 20300

0.2

0.4

0.6

0.8

1d

stable

increase

decrease

1990 2000 2010 2020 20300

0.2

0.4

0.6

0.8aCSM(1)

LCSM(0)

CS


1990 2000 2010 2020 20300

0.1

0.2

0.3

0.4b

1990 2000 2010 2020 20300

0.2

0.4

longitudinal data:NW

OW

OB

c

Year

NW

NW BMI changes

NW

Figure 9: Probabilities of moving between BMI categories in a 2 year period (with 95% confidenceinterval) according to CSM(2)L; CSM(0)CS and the observed longitudinal transitions (the markerarea is proportional to the category count) are shown for comparison.

17


18/21

Another valuable information produced by the CSM framework by simply applying the Bayestheorem is the variability of BMI in the population. Figure 9 displays the joint probabilities ofbelonging to particular BMI categories at the current and the preceding time steps, calculated

using CSM(2)L. Panels ac indicate whether a person remains in the same or moves to adifferent category within the following 2 years. The calculated trends match the values deriveddirectly from the data by counting transitions in all available continuous two-point fragments ofobserved trajectories. The summary of the results is presented in panel d: over 80% of thepopulation stays in the same category, while the rest tends to experience a BMI increase ratherthan a decrease in the short period. We also indicate respective CSM(0)CS results to demonstratethat only calibration to longitudinal data can accurately reproduce the joint probability trends.

1990 2000 2010 2020 20300

0.2

0.4

0.6

0.8

longitudinal data:

NW

OW

OB

1990 2000 2010 2020 20300

0.02

0.04

0.06

0.08

1990 2000 2010 2020 20300

0.01

0.02

0.03

1990 2000 2010 2020 20300

0.01

0.02

0.03

yoyo


1990 2000 2010 2020 20300

0.1

0.2

1990 2000 2010 2020 20300

0.02

0.04

0.06

1990 2000 2010 2020 20300

0.01

0.02

0.03

1990 2000 2010 2020 20300

0.01

0.02

0.03 yoyo

Year1990 2000 2010 2020 2030

0

0.2

0.4

NW NW NW OW NW OB

OW NW OW OW OW OB

OB NW OB OW OB OB

Figure 10: Probabilities of moving between the BMI categories in a 4 year period (with 95%confidence interval) according to CSM(2)L, as compared to the observed longitudinal transitions(the marker area is proportional to the category count).

Figure 10 presents a similar analysis for joint probabilities of belonging to particular BMI cat-egories at three contiguous time steps, facilitating the analysis of long-term (approx. 4 years) BMIchanges using CSM(2)L. We can distinguish three stable patterns: BMI remains unchanged for themajority of the population (diagonal panels) in concordance with the short term predictions; mostpersons who have moved to a higher BMI category remain in it (above-diagonal panels); over a halfof those who have managed to reduce their BMI experience the yo-yo effect, i.e., the cyclical lossand gain of weight (below-diagonal panels). These results are compared with the longitudinal databy counting transitions in all available continuous three-point fragments of observed trajectories.Since their number is much smaller than in the previous case (see Fig. 7c), some disagreement islikely. The CSM framework unifies all available data facilitating trend analysis and forecasts ofBMI tendencies in the long period.

18


19/21

4 Summary

The presented CSM framework can utilise any type of available information, from cross-sectional

to longitudinal data and their aggregates, for comprehensive studies of trends in groups andcohorts of the population. Its mathematical structure based on classic Markov models has a cleardynamical interpretation, providing an insight into mechanisms generating the observed processand making it particularly well adapted to microsimulation modelling. The employed maximumlikelihood estimation procedure is compatible with popular model scoring methods, while theefficient numerical implementation facilitates an extensive cross-validation of the model resultsand an accurate estimation of confidence intervals by bootstrapping.

The versatility of the CSM approach enables us to analyse simple dependencies, as well ascomplex trends, obtaining realistic projections and steady states, while avoiding ecological fallacyand overfitting. The provided examples of model applications to real-world data yield new andinteresting results. In particular, the combined results on BMI trends in the English population andits birth cohorts based on cross-sectional data show that the excessive weight problem affects allgenerations equally, suggesting a common driving factor. The rich, shaped by historical policies,

trend of marijuana use among American teenagers has been recovered assuming a 3-year-longmemory of the process and shown to have achieved its steady state of about 20% of users. We havedescribed the interesting dynamics of BMI changes behind the obesity growth in the US populationconcealed in incomplete longitudinal data (e.g. yo-yo effect), which cannot be recovered fromthe longitudinal information alone. The above data analyses have included the model selectionprocedure, which enabled us to choose the most appropriate variants of the CSM model andrevealed that the MLR method can produce incorrect and misleading results.

A Error estimation by bootstrapping

Bootstrapping is a very general method of calculating confidence intervals for quantities estimatedfrom statistical data [35]. In its simplest form, we construct an empirical distribution of such a

quantity by drawing randomly with replacement from the dataset to obtain a new sample, fromwhich a new value of the quantity can be computed. By doing it a sufficient number of times wecan calculate e.g. 95% confidence intervals for the quantity. A more advanced version, parametricbootstrapping, first estimates the distribution of the data based on the observed sample (e.g. usingBayesian inference) and then draws from this distribution.

In our case, the datasets consist in sets of surveys collected at multiple times or observed longi-tudinal trajectories. Our quantity can be the transition matrix, initial state p0 or extrapolateddistribution pt inferred by the CSM model. In the case of cross-sectional data, for each time twith a non-zero number nt of surveys, we assume a flat Dirichlet prior ofpt, and thus its posteriordistribution is Dt with t,k = 1 + nkt. We draw fromDt a new distribution p

t and next draw

from it a set of nt new values ofXt. A resampled distributionpt is obtained by counting n

kt

occurrences of each k . For longitudinal data, given the set of observed trajectories, we draw withreplacement a new set of the same size. We use this simple procedure instead of estimating a

Dirichlet distribution because its dimension can be very large in this case. The above approachrespects the observed trends in the data and is free from assumptions about the error distributionor analytical simplifications (unlike the commonly used delta method mentioned in Sec. 2.2).

Having generated a new input setpt, we obtain a new transition matrix, initial statep0, and

consequently a new extrapolated trend pt following Sec. 2.3. To obtain confidence intervals foreach pkt, we sort the bootstrapped trends p

t by their total KullbackLeibler divergence from pt,T21

t=0 DKL(ptpt), whereT2 T is the number of extrapolated periods, and remove the furthest

1trends. The upper and lower confidence bounds for pkt are the maximum and minimum of theremaining pkt values. This also provides the confidence intervals for p0. A similar algorithm canbe applied to by treating its columns as probability distributions and sorting the bootstrappedmatrices by their total KullbackLeibler divergence from(summing over columns).

19


20/21

References

[1] P. McCullagh and J. A. Nelder.Generalized linear models (Second edition). London: Chapman

& Hall, 1989.[2] S.J. Long.Regression Models for Categorical and Limited Dependent Variables. Sage, Thousand

Oaks, London, New Delhi, 1997.

[3] A. J. Dobson and A. G. Barnett. An Introduction to Generalized Linear Models, Third Edition.Chapman and Hall/CRC, Thousand Oaks, London, New Delhi, 2008.

[4] L. Goodman. Ecological regression and the behaviour of individuals. American SociologicalReview, 18:663664, 1959.

[5] G. King, O. Rosen, and M.A. Tanner. A Solution to the Ecological Inference Problem: Re-constructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton,1997.

[6] M. Penubarti and A.A. Schuessler. Inferring Micro from Macrolevel Change: Ecological PanelInference in Surveys. Los Angeles: University of California, 1998.

[7] R. Moffitt. The effect of the U.S. welfare system on marital status.Journal of Public Economics,41(1):101124, 1990.

[8] R. Moffitt. Identification and estimation of dynamic models with a time series of repeatedcross-sections. Journal of Econometrics, 59(1-2):99123, 1993.

[9] M. D. Collado. Estimating dynamic models from time series of independent cross-sections.Journal of Econometrics, 82(1):3762, 1997.

[10] C.H. Achen and W.P. Shively. Cross-Level Inference. University of Chicago Press, Chicago,1995.

[11] R. Eisinga. Recovering transitions from repeated cross-sectional samples. Methodology: Eu-ropean Journal of Research Methods for the Behavioral and Social Sciences, 4(4):139151, 2008.

[12] B. Pelzer, R. Eisinga, and Ph.H. Franses. Inferring transition probabilities from repeatedcross sections. Political Analysis, 10(2):113133, 2002.

[13] M. Verbeek, F. Vella, and K. U. Leuven. Estimating dynamic models from repeated cross-sections, 2000.

[14] Y. Meng, A. Brennan, R. Purshouse, D. Hill-McManus, C. Angus, J. Holmes, and P.S. Meier.Estimation of own and cross price elasticities of alcohol demand in the UK - A pseudo-panelapproach using the Living Costs and Food Survey 20012009. Journal of Health Economics,34:96 103, 2014.

[15] Ch. Ess and F. Sudweeks.Culture, technology, communication: towards an intercultural globalvillage. SUNY Press, 2001.

[16] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of MathematicalStatistics, 22(1):7986, 03 1951.

[17] S. Kullback and R. A. Leibler. Letter to the editor: The KullbackLeibler distance. TheAmerican Statistician, 4(41):34041, 1987.

[18] S. Kullback. Information theory and statistics. John Wiley and Sons, New York, 1959.

[19] C.D. Manning, P. Raghavan, and H. Schutze. An Introduction to Information Retrieval.Cambridge University Press, Cambridge, 2008.

20


21/21

[20] H. Akaike. A new look at the statistical model identification. Automatic Control, IEEETransactions on, 19(6):716723, Dec 1974.

[21] G. Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):239472, 1978.[22] Y. Yang. Can the strengths of AIC and BIC be shared? A conflict between model indentifi-

cation and regression estimation. Biometrika, 92(4):937950, 2005.

[23] G. Claeskens and N.L. Hjort. Model Selection and Model Averaging. Cambridge UniversityPress, 2008.

[24] S. Arlot and A. Celisse. A survey of cross-validation procedures for model selection.StatisticsSurveys, 4:4079, 2010.

[25] J.D. Hart and P. Vieu. Data-driven bandwidth choice for density estimation based on depen-dent data. The Annals of Statistics, 18(2):pp. 873890, 1990.

[26] J.D. Hart. Automated kernel smoothing of dependent data by using time series cross-

validation.Journal of the Royal Statistical Society. Series B (Methodological), 56(3):pp. 529542,1994.

[27] S.G. Johnson. The NLopt nonlinear-optimization package. http://ab-initio.mit.edu/nlopt.

[28] Sacado. http://trilinos.org/packages/sacado.

[29] G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.

[30] R.E. Kass and A.E. Raftery. Bayes factors. J. Am. Stat. Assoc., 90(430):773795, 1995.

[31] London Department of Health, NHS. Health survey for England, 2011.

[32] A.E. Hoerl. Application of ridge analysis to regression problems. Chemical EngineeringProgress, 58(3):54-9, 1962.

[33] L.D. Johnston, P.M. OMalley, J.G. Bachman, and J.E. Schulenberg. Monitoring the Future.National results on adolescent drug use., 2011.

[34] National Longitudinal Survey of Youth 1979. https://www.nlsinfo.org/content/cohorts/nlsy79.

[35] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, NewYork, NY, 1993.

Documents

Cross-sectional Markov model for trend analysis of population characteristics