38

Dietary assessment and estimation of intake densities

Embed Size (px)

Citation preview

Dietary assessment and estimation of intakedensities

Michael J� Daniels� �

Alicia Carriquiry� �

�Michael Daniels is corresponding author� ���G Snedecor Hall� Department of Statistics� Iowa State

University� Ames� IA ����������� E�mail� mdanielsiastateedu�Michael Daniels is Assistant Professor� Department of Statistics� Iowa State University�Alicia Carriquiry is Associate Professor� Department of Statistics and Center for Agricultural and

Rural Development� Iowa State University�This work was partially funded through contracts number ����� ���� and ����� � �� between the

National Center for Health Statistics� Center for Disease Control and Prevention� and the Department

of Statistics� Iowa State University� and by Research Grant No ������� from FONDECYT� Chile

Summary

The U�S� government has conducted nationwide food consumption surveys since �����

Information obtained from these surveys is used to design food assistance programs�

guide food and nutrition policy� and monitor the dietary status of the population� The

distribution of usual intakes of a nutrient in the population is of interest to policy makers�

Here� usual intake is de�ned as the longrun average intake of a nutrient by an individual�

Usual intakes are not observable in practice� Instead� we observe daily intakes for

a sample of individuals and a small number of days� and assume that observed intakes

measure usual intakes with error� The distributions of observed intakes� however� are

typically very skewed� and the daytoday variability in intakes tends to be large relative

to the betweenindividual variance� and can be heterogeneous across individuals Nusser�

Carriquiry� Dodd� and Fuller� ������

In this paper� we present a Bayesian approach to estimating the distribution of usual

intakes of a nutrient in a population� Starting with a sample of dietary intakes� we model

a function to map the intakes into the normal scale� This function combines a power

transformation and a cubic spline constrained to be monotonic� with unknown number

and location of knots� and is estimated using reversible jump Markov chain Monte Carlo

methods Green� ������ From each draw� of the transformation function we obtain

a transformed set of intakes which are approximately normally distributed� We then

remove the daytoday variability in daily intakes by �tting a measurement error model

to each set of transformed observations� Each set of estimated individual usual intakes is

then mapped back to the original scale using the inverse of the transformation function�

Posterior distributions of percentiles and other attributes of the density for each nutrient

are estimated accounting for all major sources of uncertainty�

We apply these methods to a subset of the �������� Continuing Survey of Food

Intakes by Individuals CSFII� collected by the USDA USDA� ������

Key words� Dietary data� CSFII� Measurement error models� Splines� Re�

versible jump� Markov chain Monte Carlo

� Introduction

The United States government collects dietary intake data since the ����s� Nation

wide food consumption surveys are conducted approximately once a year� where a large

sample of individuals is asked to report their food consumption during the previous ��

hours� Thus� the survey instruments used to collect this information are called ��hour

recalls� Most nationwide food consumption surveys collect replicate ��hour recalls for

at least some of the individuals in the sample� Often� these repeated observations are

not collected on consecutive days� so that multiple observations within an individual can

be considered to be independent�

Other survey instruments� for example food frequency questionnaires FFQs�� can

also be used to collect dietary intake data and to estimate usual nutrient intake distri

butions e�g�� Carroll� Freedman� and Hartman� ������ In this paper� we consider only

the analyses of intake data collected via ��hour recalls�

The information obtained from these dietary surveys is used by policy makers to

design� implement� monitor and evaluate food assistance programs and other nutrition

related policies� For example� policy makers might be interested in comparing the nu

tritional status of children from lowincome households who are enrolled in the School

Lunch program versus that of children who are not� The e�ectiveness of the Food

Stamps program might be evaluated by� for instance� monitoring the proportion of low

income elderly who are consuming enough of some essential nutrient� or the proportion

of teenaged girls who consume adequate amounts of folate or calcium� Since food as

sistance programs managed by the U�S� Department of Agriculture USDA� alone cost

approximately �� billion dollars a year� it is important that information obtained from

dietary data be as accurate as possible� and that measures of uncertainty be available

for all estimates�

How do we obtain information about the intake of a nutrient from a dietary intake

survey� Individuals participating in food consumption surveys are asked to recall their

food including beverages� snacks� and meals� consumption for the previous day� A

database managed by the USDA is then used to map� foods into their nutrient com

ponents� This USDA database contains approximately ����� entries� and is updated

periodically� For example� we can obtain the content of about �� di�erent nutrients of a

lunch composed of a slice of pepperoni pizza� an �ounce can of Diet Coke� and an apple�

It is well known e�g�� Schubert� Holden� and Wolf� ����� Haytowitz� Pehrsson� Smith�

Gebhardt� Mathews and Anderson� ����� that this food database is not error free� We

do not� however� address the issue in this work�

The data we obtain for analysis� then� are replicate observations for at least a

subsample of individuals� of daily intakes of a large set of nutrients for individuals in

the sample� We use Yij to denote the observed intake of a nutrient for individual i on

day j� Because these data are costly to collect� the number of replicate observations is

typically no more than two or three for a subsample of the individuals in the survey� We

use di to denote the number of days of intake information available for each individual

in the sample�

The relationship between diet and health underlies much of the government�s goal of

providing the population with the means to consume an adequate diet� Often� the e�ect

of nutrient consumption on healthrelated outcomes is chronic� so that researchers are

interested in the longrun average intake of a nutrient by an individual� This longrun

average intake is known as the usual intake of a nutrient by an individual� and is denoted

by yi� with i � �� ���� n the number of individuals in the sample� Formally� yi � EfYijjig�

Furthermore� populationlevel assessments such as those described earlier� require that

we estimate the distribution of usual intakes F y� in the group of interest� This usual

intake distribution concept was set forth in a report by the National Research Council

NRC� ������

The problem of estimating usual nutrient intake distributions from dietary survey

data is a challenging one� Usual intakes are not observable in practice� and observed daily

intakes measure usual intakes with error� Furthermore� various characteristics of dietary

intake data described in the next section� prevent the use of standard normaltheory

methods for analysis� Nusser� Carriquiry� Dodd� and Fuller ������ Eckert� Carroll�

and Wang ������ Chen ����� and Carriquiry ������ among others� have recently

proposed approaches for analyzing dietary intake data� In particular� Nusser et al�

����� propose a measurement error model approach on transformed intake data that

results in estimators of usual intake distributions that perform well in simulation studies�

The Nusser et al� methodology� however� is developed from a frequentist viewpoint�

and consists of several steps� Thus� it is not possible to obtain expressions for standard

errors of various estimates that properly incorporate all uncertainties accumulated along

the way� In fact� the estimators of standard errors for percentiles of the usual intake

distribution given in Nusser et al� ����� are obtained under the assumption that the

function used to transform the data into the normal scale� and the variance components

in the measurement error model� are �xed and known�

We revisit the Nusser et al� ����� approach to estimating usual intake distributions

from dietary intake data� and reformulate it within a Bayesian framework� Our objective

is to derive marginal posterior distributions for parameters of the usual intake distribu

tion of a nutrient that are of interest to policy makers and researchers in nutrition� We

focus on the marginal posterior distributions of percentiles of the usual intake distribu

tion� and argue that the posterior variances we obtain re�ect all uncertainties accrued in

the various steps of the procedure� We use Markov chain Monte Carlo methods MCMC�

e�g�� Smith and Roberts� ����� throughout� to perform all computations� As will be

described in Section �� the transformation step involves solving a varyingdimensional

problem� thus� we proceed as in Green ����� and Denison� Mallik� and Smith �����

and use a reversiblejump MCMC algorithm to obtain the transformation function�

The paper is organized as follows� In Section � we brie�y discuss the characteristics

of dietary intake data� and describe a subset of the Continuing Survey of Food Intakes

by Individuals CSFII� USDA� ����� that was used for illustration of the procedure�

The model and proposed estimation strategy are given in Section �� We apply the

methodology to a subset of CSFII and present results in Section �� Finally� Section �

gives a discussion of the approach we propose and of related problems in nutrition that

merit further investigation�

� Characteristics of dietary intake data

We consider dietary intake data obtained via ��hour recalls� by the CSFII carried out in

��������� The CSFII is a nationwide food consumption survey designed as a multistage

strati�ed area probability sample of the �� states and the District of Columbia� and is

intended to be selfweighting� We consider the subset consisting of males and females

aged �� to �� years� who were interviewed between ���� and ����� Two observations

were collected for each individual in the sample� Both observations were obtained by

personal interview if possible� otherwise� the second day interview was done over the

phone� Within an individual� intakes were collected at least a week apart from each

other� thus� we assume that observations within an individual are independent� Because

of nonnegligible attrition rates� regression weights e�g�� Huang and Fuller� ����� were

constructed to adjust for nonresponse� The analyses we present in Section � are per

formed on weighted data� where weights� once computed� are assumed to be �xed and

known�

Observed intake data are a�ected not only by individual� but also by nuisance e�ects

such as day of the week� month of the year� interview sequence �rst or later days�

and interview method in person or by phone�� Prior to analysis� we adjust the data

to remove these nuisance e�ects� We proceed as in Nusser et al� ����� and use a

ratio adjustment based on a regression model to partially remove the e�ects of day of

week and interview method from observed intake data� To avoid carrying the survey

weights throughout our analyses� we linearly transform the intake data to obtain a set

of equal weight� observations� as described in Dodd ������ The unweighted analyses

of the equal weight observations are essentially equivalent to the analyses that would be

conducted on the original observations and their weights� In the remainder� Yij denotes

the adjusted� equalweight intake for individual i on day j�

Dietary intake data have attributes that make their analysis challenging� Observed

daily intakes have skewed distributions� and exhibit both between and withinindividual

variability� In fact� the withinindividual variance in observed intakes of most nutrients

is sometimes larger than or of the same order of magnitude as� betweenindividual

variation� and is heterogeneous across individuals� Typically� as the mean intake of

a nutrient increases� so does the variance of those intakes� Since our objective is to

estimate F y�� the distribution of the usual intakes� we must remove the daytoday

variability from the observed intakes�

An additive relationship between observed intake and usual intake in the normal scale

is often adopted to model transformed� observed intakes� A linear measurement error

model approach that allows for the incorporation of heterogeneous withinindividual

measurement error variances is then appropriate for dietary intake data� How to trans

form observed intakes into the normal scale so that transformed intakes are normally

distributed and the additive relationship holds is a matter of ongoing discussion Nusser

et al� ����� Stefanski and Bay� ����� Chen� ������ Here� we adopt the Nusser et al�

����� approach� and assume that in the normal scale� a linear measurement error model

is a reasonable choice to describe the relationship between observed and usual intakes�

� Model and estimation strategy

We implement a fully Bayesian approach to the problem of estimating the marginal pos

terior distributions of percentiles of the usual intake distribution of dietary components�

The basic approach uses three nested sampling algorithms to properly account for all

uncertainties� �� Transformation of observed dietary intake data to normality� �� Re

moval of measurement error in the normal scale� �� Backtransformation to the original

scale� We now describe each of these steps in detail�

��� Transformation to normality

As discussed in Nusser et al� ������ standard power transformations fail to properly

transform intake data to normality for most nutrients� We use cubic splines to improve

the transformation� Our data consist of pairs� Y �ij � zij� where Y

�ij is the observed intake

for the ith individual on the jth day raised to the power � which provides the best in

terms of minimizing mean squared error� transformation to normality� and the zij are

the corresponding normal scores� We use Blom�s ����� formula to compute the zij�

Our goal is to compute a function g��z��� such that gY �ij � � Xij � where Xij is

approximately normal� We postulate a cubic spline for g��z���� indexed by a vector of

unknown parameters � and contaminated by normal noise� We use maximum likelihood

ML� to estimate the parameters in the model� The model is�

Y �ij � g��z��� � �ij

� �� ��X

p��

�pzpij �

kX

p��

�p��tp � �ij� ��

where tp � zij � rp��Ifzij�rpg and the �ij� j � �� ���� ni� i � �� ���� n� are normal random

variables with mean �� The number of knots in �� is given by k� and their locations are

denoted r�� r�� ���� rk� We de�ne r�k � r�� r�� ���� rk�� and � � k� r�k����

The Y �ij are the sample quantiles of the powertransformed data� and thus cannot be

considered to be iid random variables� As a result� the covariance matrix of � might be

modelled as a scale factor �� times a weight matrix W�� which is proportional to the

asymptotic variance of the sample quantiles see� e�g�� Schervish� ����� pp� ��������

The variance of �ij will take the form ���pij��pij��f�y�pij� where y�pij

is the true sample

quantile and pij corresponds to the pijth percentage point � pij ��� the covariance

between �ij and �kl is given by Cov�ij� �kl� � ���minfpij� pklg�pijpkl��fy�pij�fy�pkl���

To approximate these terms� we use kernel density estimation�

Given the number of knots k and their location r�k� the ML estimate of � is obtained�

via the generalized least squares equations�

Z�W��Z�� � Z�W��Y� �

where Z is an N � k � �� design matrix with N �Pn

i�� di� and Y� is the vector

of powertransformed observations� In the remainder� and to keep notation simple� we

assume that di � d for all individuals� so that N � nd�

In most applications� the weight matrix W is very large equal to the number of

observations N� and therefore computation of its inverse is impractical� To investigate

whether estimates of the parameters in �� are sensitive to a simpli�ed formulation of the

model� we considered an alternative representation forW in our application� a diagonal

matrix obtained by setting all o�diagonal elements of W to ��

We proceed as in Denison et al� ����� and specify prior distributions for the number

of knots� k and the location of the knots� r�k� We chose a discrete uniform prior

distribution for the knot location� conditional on k� so that rjk � discrete Uz��� ���� znd��

with additional constraints�� and a Poisson distribution with rate for the number of

knots k� so that k � Poisson �� In the example given in Section �� we �x at some

known� value�

����� Details of algorithm for transformation

The dimension of the parameter vector � changes with k� the number of knots in model

��� As a result� we use reversible jump MCMC as discussed in Green ����� and

Denison et al� ����� to simulate from the posterior distribution of � which speci�es the

appropriate transformation�

The idea is simple� At each iteration l � �� ����M�� a new knot can be introduced� an

old knot can be deleted� or an old knot can be moved to a new location� Consequently�

each iteration consists of three steps�

�� Choose type of move�

� Birth of a new knot� with probability bk�

� Death of an existing knot� with probability dk�

� New location for a knot� with probability �k�

�� Compute MLE ���k�l

and check monotonicity of g���k�l���

�� Accept move� with probability ��l de�ned below��

For M� large enough� the algorithm converges�� We monitor the behavior of the itera

tions using a mean squared error criterion computed as

MSE�l � nd���Y� � g���l����W��Y� � g���l���� ��

Once the algorithm has converged�� we invert draws l � �� ����m� with m� M�� of

functions g���l��� and evaluate each draw at the set of nd values of Y �ij to obtain a

sample of fXijg�l that are approximately standard normal� That is

fXijg�l � g�lY �

ij � � N�� ���

To compute the MLE of ��k�l we use generalized least squares as described in Section

���� Because g���� must be monotonic� at each step we check that the lth draw satis�es

the condition by evaluating the derivative of g���l�� on a grid of values of z given by

the knots and midpoints between the knots� If these function evaluations are not all

positive� we obtain an estimate of � via linear programming� as the objective function

and all constraints are linear in �� Nonmonotonicity may occur between midpoints and

knots� and thus our approach does not guarantee that g���l�� is monotonic� However�

we are reasonably con�dent that nonmonotonicity will usually be uncovered by focusing

on the grid�

Given k � pk�� and c � ���� we follow Denison et al� ������ and de�ne bk �

cminf�� pk � ���pk�g� dk � cminf�� pk � ���pk�g� and �k � �� bk � dk� where pk�

is the prior density for the number of knots� Note that for k � �� bk � �� and for

k � kmax� bk � �� With this formulation� the probability of accepting the proposed

move has a very simple form�

� � min f�� likelihood ratio�� prior ratio�� proposal ratio�g�

where

�birth� � min f�� likelihood ratio�� k�g

�death� � min f�� likelihood ratio�� k���g

�move� � min f�� likelihood ratio�g�

��

and

k� �nd� �� �k

nd�

The quantity k� is the ratio of the number of locations at which a knot may be

placed to the number of data points� The above result is speci�c to a cubic spline� for

additional details� see Denison et al� ������ p� �����

��� Measurement error model

We make the assumption that the measurement error is additive in the normal scale�

Using m� M� sets of transformed values� fXijg�l� l � �� ����m�� we �t an additive

measurement error model MEM� as proposed by Nusser et al� �����

X�lij � x

�li � u

�lij � ��

where x�li is the usual intake of the nutrient for the ith individual for the lth draw� and

u�lij is the measurement error for the ith individual on the jth day� in the normal scale

for the lth draw�

There may be considerable heterogeneity of the measurement error variances across

individuals see e�g�� Nusser et al�� ������ so we formulate our MEM as a hierarchical

model with three levels� We omit the superscript that denotes draw to keep the notation

simple� but it is important to remember that the hierarchical model is formulated for

each draw fXijg�l� l � �� ����m��

In level �� the individual�s daily intake is modelled as a normally distributed random

variable with mean equal to the individual�s usual intake and with a subjectspeci�c

measurement error variance�

Xijjxi� ��ui � Nxi� �

�ui��

In level �� we model the heterogeneity in the usual intakes and in the measurement

��

error variances across individuals�

xij�x� ��x � N�x� �

�x��

log��ui�j�A� ��A � Nlog�A�� �

�A��

Finally� in level � we place �at priors on the remaining hyperparameters�

�x� ��x� log�A�� �

�A � Uniform�

We use the Gibbs sampler to draw values from the posterior distribution of the

parameters in the hierarchical MEM model� All full conditionals are of standard form�

with the exception being the full conditional distribution of ��ui� which is proportional

to

�log��ui�jxi� �A� ��A�Xi� �

Y

j

��ui����� exp f�

���ui

X

j

Xij � xi��g

� expf��

��Alog��ui�� log�A��

�g�

where Xi � Xi�� ����Xid��� To draw values from �log��ui�jxi� �A� ��A�Xi�� we use a

MetropolisHastings algorithm e�g�� Smith and Roberts� ����� with a normal approxi

mation to the full conditional distribution of log��ui� as a candidate density�

For each transformed sample fXijg�l� l � �� ����m�� we obtained M� draws from the

joint posterior distribution of f�x� ��x� �A� �

�Ag� For m� M� of these� we simulated sets

of �x�si � ���

�s�

ui � i � �� ���� n� s � �� ����m�� from xij��sx � ��

�s�

x and log��ui�j��sA � ��

�s�

A � respec

tively� to transform back to original scale� Note that by sampling from the population

as opposed to transforming back the original subjects� we are accounting for the ad

ditional variability of only having a �nite incomplete� sample of individuals from the

population�

��

��� Transformation back to original scale

As we described earlier� for each of the m� draws� we obtained a sample of n usual

intakes and n measurement error variances �x�si � ���

�s�

ui � in the normal scale� To make

inferences about the quantiles of the intake distribution� we now need to transform the

usual intake draws back to the original scale� By de�nition�

�y � EfY jx � �xg � Efg��x� u�jx � �xg�

To estimate this expectation� for each �xi� ���ui� draw� we generate a large number q of

uij from uij � N�� ���ui� and approximate the expectation using a Monte Carlo mean�

�yi � q��Pq

j�� g���xi � uij�� The number of Monte Carlo replicates q is chosen so as to

obtain the required precision for �yi�

For m� transformations and m� samples of usual intakes from the measurement error

model� we get m� � m� samples of size n� f�yig�t� t � �� ����m� � m�� from which we

can approximate marginal posterior distributions of interest� For example� we derive the

marginal posterior distribution of percentiles of the usual intake distribution of interest�

Pr fy�t � ag � �� for � � ����� ����� � � � � ���� We discuss this further in Section ��

��� Summary of Complete Algorithm

The three stages described in Section � can be summarized as follows�

�� Draw transformations g�lY � � X� l � �� ����M��

�� Obtain transformed intakes X�l�� � ����X

�lnd for l � �� ����m� out of M� draws�

�� Using transformed sample fXijg�l� �t MEM

X�lij � x

�li � u

�lij �

��

via Gibbs� and obtain m� �m� samples m� out of M� draws for MEM�

�x�l�s� � ��

��l�su� �� ���� �x�l�sn � ����l�sun �� l � �� ����m�� s � �� ����m��

�� Backtransform�

y�l�si � E�

qfg���lx� u�jx � �x

�l�si g�

where E�q �� is MC average over draws u�l�sv � N�� ���l�sui �� v � �� ���� q�

�� Obtain marginal posterior distributions of percentiles of f �l�sy� and other relevant

quantities�

� Example

As stated in Section �� we now illustrate the methodology using a cohort of females and

males� ages ���� from CSFII ��������� The female cohort consisted of ��� individuals

each of which had dietary data collected on two noncontiguous days� The male cohort

consisted of ��� individuals also with two nonconsecutive days of dietary intake data

each� We focus on six dietary components� calcium� cholesterol� iron� protein� vitamin

A� and vitamin C� In the case of calcium� iron� protein� vitamin A� and vitamin C� we

are interested in estimating the proportion of teenagers whose usual intakes do not meet

recommendations� In the case of cholesterol� we are concerned with excessive intakes�

and thus focus on the right tail of the distribution�

��� Performance of algorithm

The reversible jump MCMC algorithm worked well� Figures � and � show two realiza

tions l � � and l � �� ���� from the posterior distribution of g�� and the pairs Y �ij � zij�

for females and males respectively� for each dietary component� We see from these �gures

that the WLS procedure places more weight on the center of the distribution and less

��

weight on the tails where there is considerably more variability�� The transformation

draws shown in the �gures correspond to the case where the weight matrixW was taken

to be diagonal� The reversible jump MCMC algorithm converges quickly as monitored

by the MSE �� and to the same value based on multiple starting points not shown in

�gures��

For the prior distribution on the number of knots� we set � �� We chose a small

value for the mean number of knots as the data had already been powertransformed�

and just a few additional knots are likely to be needed to complete the transformation

to normality� Results were not sensitive to changes in the value of � in the range

���� The number of knots drawn from the posterior distribution for the various dietary

components ranged from about two to fourteen�

We monitored the convergence of the Markov chain of the parameters of the measure

ment error model using Gelman and Rubintype statistics Gelman and Rubin� �����

and autocorrelation plots as suggested in Cowles and Carlin� ������ The convergence

again was rather quick within about ��� iterations��

For posterior inference� we sampled m� � �� transformations every ��th iteration

after a burnin of ����� M� � ����� and for each transformation� sampled m� � ��

iterations every ��th iteration after a burnin of ���� M� � ���� from the measurement

error model� for a total of ��� backtransformed samples of size ��� for females ���

for males� of the usual intakes for which we compute posterior medians and ��� cred

ible intervals using the ���th and ����th quantiles of the posterior distribution� of the

quantiles and compute density plots�

��� Choice of weight matrix

As mentioned earlier� the weight matrix W has dimensions N � N � In our example�

N � ��� � � for females� and N � ��� � � for males� As N can be quite large� the

��

inversion of W can be impractical and very time consuming� Thus� we investigated

whether results would be sensitive to using a simpli�ed diagonal� version of W for

computation�

We chose to use the diagonal weight matrix for model�tting as a compromise� since

we can account for the extra variability of the quantiles in the tails and yet keep compu

tations manageable� Because a kernel density estimator is used to estimate the density

at the quantiles� use of the full weight matrix W may result in a procedure that is

not only inconvenient from a computational point of view� but also unstable� as density

estimates at the tails get very small�

To decide whether results are sensitive to the choice of a diagonal version of W viz

a viz the complete� version� we repeated the analyses using both forms of the weight

matrix� for several of the dietary components under consideration� We only show results

obtained for vitamin C females�� which appear in Table � and Figure ��

The e�ect of ignoring the o�diagonal elements of W in the computations had very

little e�ect on �nal results� Estimates of quantities of interest� such as the mean� the

standard deviation� and the quantiles of usual intakes are very similar� regardless of

the weight matrix chosen� For example� every ��� credible interval obtained using the

diagonal weight matrix covers the corresponding point estimate obtained using the full

weight matrix� and in fact� most point estimates are within a standard deviation of each

other�

��� CSFII ��������

We applied the method we propose to dietary intake data collected in the CSFII during

the period ��������� for the two cohorts described in Section �� Figs� � and � display

two estimates of the usual intake distribution of each dietary component for females

and males� respectively� The density estimates drawn in dotted lines correspond to

��

the distribution of individual twoday means� These observed mean� distributions are

skewed for all dietary components except protein� whose empirical mean distribution is

almost symmetric but leptokurtic� As a result� it would not be appropriate to �t a normal

measurement error model to intake data to remove the withinindividual variance� Thus�

a di�erent parametric form must be chosen for the distribution of observed intake means�

or dietary intake data should be transformed into normality prior to variance estimation�

Following the approach described in Section �� we obtained the usual intake density

estimates shown in solid lines in Figs� � and �� The �gures show� as expected� that after

removal of measurement error� the estimated distributions of usual intakes have smaller

variability than the distributions of twoday means�

Tables � and � show the mean� standard deviation� and selected percentiles of the

distribution of observed individual means for each dietary component� for females and

males� respectively� In addition� tables also show the ratio of within to between

individual variances for each dietary component� These variance ratios are all close

to one� indicating that the measurement error variances are of about the same order

of magnitude as the betweenindividual variances� Therefore� these withinindividual

variance components cannot be ignored�

Tables � and � show the mean and the ���th and ����th percentiles of the poste

rior distribution of the mean� standard deviation� and selected percentiles of the usual

intake distribution for each dietary component� for females and males� respectively� A

comparison of the entries in Tables � and � to those in tables � and � con�rmed what

Figs� � and � show� intake distributions have less variability and lighter tails that result

from the model�s removal of the measurement error in the observed daily intakes� The

di�erences between the two estimated densities can be large� ��� credible intervals in

tables � and � often do not contain the corresponding quantile of the observed individual

��

mean distributions� This is particularly noticeable in the upper tail of the distributions�

Table � shows the mean and the ���th and ����th percentiles of the posterior distribu

tion of the prevalence of nutrient inadequacy or� in the case of cholesterol� the prevalence

of excessive intake� for females and males� Here� we estimate the prevalence of nutrient

inadequacy as the proportion of individuals whose usual intake of the dietary component

is less than ��� of the Recommended Dietary Allowance RDA� e�g�� NRC� ����� page

���� for the nutrient see� e�g�� Carriquiry� ����� IOM� ������ For calcium� iron� protein�

vitamin A� and vitamin C� table � shows selected attributes of the posterior distribution

of Pry � �����RDA� for females and males� respectively� In the case of cholesterol� we

show the mean� �th and ��th percentiles of the posterior distribution of Pry � ���mg��

The interpretation of the entries in the table is the usual one� For example� for females�

the point estimate of the prevalence of nutrient inadequacy for calcium is ���� and a

posteriori� the probability that prevalence is between ��� and ��� is ����

� Discussion

The analysis of dietary intake data is challenging� even if we do not take into account

the various sources of biases and errors that are often present in this type of data� It is

recognized see� e�g�� IOM� ����� that individuals tend to underreport the amount of

food they consume� The extent of the underreporting is known to vary by nutrient� and

by genderageethnic group� but little additional information about the direction and

size of the biases is available� Attempts have been made to calibrate reported intake

using various biochemical markers see� e�g�� IOM� ������ These methods� however� are

still in the experimental stage� are very costly� and are useful to adjust energy intakes

at best� Nothing is known about the underreporting of� for example� trace minerals�

It is also known that the USDA databases used to map foods into nutrients are not

��

always errorfree Schubert et al� ����� Haytowitz et al� ������ For example� the USDA

databases lack precise information on folate content of foods� as a national forti�cation

e�ort that adds folate to various food items was implemented only in ���� IOM� ������

In this work� we do not take into account these potential sources of biases in dietary

intake data� Rather� we focus on the problem of developing appropriate methods to

analyze the data�

Estimating usual intake distributions of nutrients from dietary intake data can be

di�cult� as was argued in Section �� The approach we have chosen consists in trans

forming the observed intakes into the normal scale� removing the measurement error in

the normal scale� and then transforming individual estimated usual intakes back into

the original scale� An alternative approach consists in using a parametric model other

than the normal to represent the relationship between observed and usual intakes� For

example� a Weibull or a Gamma distribution might be an appropriate representation

for the distribution of intakes in the population� This approach has the drawback that

each new dietary component would require the identi�cation of the most suitable model�

thereby limiting the usefulness of the method for researchers in nutrition and areas other

than statistics�

The normalscale measurement error model we propose in Section ��� makes an as

sumption that is not necessarily satis�ed� that once observed individual intake means

are transformed into normality� both the usual intake and the measurement error com

ponents are also normally distributed� This is not necessarily so� although informal tests

suggest that for all the dietary components we investigated� the assumptions of model ��

appear to hold� A deconvolution approach that guarantees that both the usual intakes

and the measurement errors are normally distributed has also been proposed Stefan

ski and Carroll� ����� Stefanski and Carroll� ����� Chen� ������ For the speci�c case

��

of dietary intake data� Chen ����� argues that results obtained using a deconvolution

approach are not noticeably di�erent from those obtained by Nusser et al� ����� using

a frequentist version of the method we discuss in this manuscript�

We argue in Section � that a Bayesian framework is the most appropriate in this

estimation problem� as the method for estimating usual nutrient intake distributions

consists of several steps� Because the estimated transformation into normality and the

estimated variance components in the measurement error model are used as if they were

true values� the standard errors for estimators of the parameters of the usual intake dis

tribution in the Nusser et al� ����� approach underestimate the true uncertainty about

the value of those parameters� An advantage of the Bayesian paradigm is that it permits

proper accounting of all uncertainties� so that the posterior variance of� for example� the

prevalence of nutrient inadequacy� re�ects the uncertainty about all parameters in the

model� Thus� we expect that the ��� credible intervals obtained from the marginal

posterior distributions will be wider than the ��� con�dence intervals obtained from a

frequentist analysis such as that presented by Nusser et al� ������ Direct comparison

of the Bayesian and frequentist approaches is not possible as the model used for the

transformation function in this paper is di�erent from the one used in the Nusser et

al� ����� manuscript� Nonetheless� we carried out the analysis using the frequentist

version of the method� Computations were done using C�SIDE Iowa State University�

������ a software developed to implement the Nusser et al� ����� method� Results

obtained from a frequentist viewpoint are presented in Tables � and �� for females and

males� respectively� Point estimates of percentiles are somewhat similar when comparing

both approaches� The ��� credible sets� however� tend to be wider� and need not be

symmetric around the posterior means of the percentiles�

In our example� we estimated the prevalence of nutrient inadequacy in the popula

��

tion as the proportion of individuals with usual intakes below ��� of the RDA NRC�

����� for the nutrient� It has been argued e�g�� Beaton� ����� Carriquiry� ����� that the

appropriate cuto� is the median of the distribution of requirements in the population�

rather than the RDA� The National Academy of Sciences� however� has not yet pub

lished the value of the median requirement for any genderage group� The exception is

calcium� for which the Academy of Sciences has concluded that the median requirement

for any group cannot be determined with the information that is currently available

about calcium intakes and requirements IOM� ����b�� Under simple assumptions� ���

of the RDA is approximately equal to the median requirement of the nutrient�

In Section ���� we used a generalized least squares approach to estimate the parame

ters of the function that transforms daily intakes into normality� An alternative approach

is as follows� de�ne g to be the function g��z� �� such that P Y �ij � y� � P Z � g��y���

where Z is distributed as a standard normal random variable� Again consider a cubic

spline form for the function g� In this case� an iterative procedure is needed to obtain

maximum likelihood estimates of the parameters in the model� If we let � � k� r�k���

as before� the likelihood for this model isQn

i��

Qdij�� fg

��yij� ���� where f denotes a

standard normal density� To estimate the parameters in this model� we obtain an initial

value for the parameters using GLS� and then carry out a single NewtonRaphson step

to approach the MLE using analytic derivatives�

We have discussed the speci�c problem of estimating usual nutrient intake distribu

tions� and presented an application consisting of estimating the prevalence of nutrient

inadequacy among teenagers using dietary intake data collected between ���� and �����

Several related problems still require investigation� An extension of the methods pre

sented here to the case where the usual intake distributions of food intakes is of interest

is not straightforward� Di�culties arise because in the case of foods� it is important

��

to consider not only the amount of a food consumed� but also the probability that the

individual would have consumed the food on the day when the interview was conducted�

For many food items� the probability of consumption is not independent of the amount

consumed� so estimating the marginal distribution of usual intake of foods can be chal

lenging� Yet� the problem is an important one� as the distribution of usual food intakes

is required to assess exposure rates to toxicants found in the food supply in a group�

Ratios of dietary components are also of importance� For example� researchers may

be interested in assessing the proportion of individuals in a group who consume� on the

average� more than ��� of calories from fat� or more than ��� of calories from saturated

fat� The methods presented in this paper for estimating the usual intake distribution for

a nutrient cannot be directly applied to ratios of dietary components as those described

above� Typically� both the numerator and the denominator in the ratio are observed

subject to measurement error� and cannot be assumed to be independent�

References

Beaton� G�H� ����� Criteria of an adequate diet� In� Shils� R�E�� Olson� J�A�� Shike�

M� eds� Modern Nutrition in Health and Disease� Lea and Febiger� Philadelphia�

Blom� G� ����� Statistical Estimates and Transformed Beta Variables� Wiley� New

York�

Carriquiry� A�L� ������ Assessing the prevalence of nutrient inadequacy� Public Health

Nutrition� In press�

Carroll� R�J�� Freedman� L�S�� and Hartman� A�M� ����� Use of semiquantitative food

frequency questionnaires to estimate the distribution of usual intake� American

Journal of Epidemiology� �����������

��

Chen� C� ����� Spline estimators of the distribution function of a variable measured

with error� Doctoral Thesis� Department of Statistics� Iowa State University�

Cowles� K�� and Carlin� B�S� ����� Markov chainMonte Carlo convergence diagnostics�

A comparative review� Journal of the American Statistical Association� ��������

Denison� D�G�T�� Mallik� B�K�� and Smith� A�F�M� ����� Automatic Bayesian curve

�tting� Applied Statistics� ����������

Dodd� K� ����� A Technical Guide to C�SIDE� Technical Report ��TR ��� Dietary

Assessment Research Series Report �� Department of Statistics and Center for Agri

cultural and Rural Development CARD�� Iowa State University� Ames�

Eckert� R�S�� Carroll� R�J�� and Wang� N� ����� Transformations to additivity in

measurement error models� Biometrics� ����������

Gelman� A�� and Rubin� D�B� ����� Inference from iterative simulation using multiple

sequences� Statistical Science� ���������

Green� P�J� ����� Reversible jumpMarkov chainMonte Carlo computation and Bayesian

model determination� Biometrika� ����������

Haytowitz� D�B� Pehrsson� P�R� Smith� J�� Gebhardt� S�E�� Mathews R�H� and Ander

son� B�A� ����� Key foods� setting priorities for nutrient analysis� Journal of Food

Composition and Analysis� ���������

Huang� E�T�� Fuller� W�A� ����� Nonnegative regression estimation for sample survey

data� ASA Proceedings of the Social Statistics Section� �������

Institute of Medicine ����a� Dietary Reference Intakes� Thiamin� Riboavin� Niacin�

Vitamin B� Folate� Vitamin B��� Pantothenic Acid� Biotin� and Choline� Preprint�

��

National Academy Press� Washington� DC�

Institute of Medicine ����b� Dietary Reference Intakes� Calcium� Phosphorus� Mag�

nesium� Vitamin D� and Fluoride� Preprint� National Academy Press� Washington�

DC�

Department of Statistics and Center for Agricultural and Rural Development� Iowa

State University� ����� A Users Guide to C�SIDE� Software for Intake Distribu�

tion� Version ���� Technical Report ��TR ��� Center for Agricultural and Rural

Development� Iowa State University� Ames�

National Research Council ����� Nutrient Adequacy� National Academy Press� Wash

ington� DC�

National Research Council ����� Recommended Dietary Allowances� ��th ed� Na

tional Academy Press� Washington� DC�

Nusser� S�M�� Carriquiry� A�L�� Dodd� K�W�� and Fuller� W�A� ����� A semiparametric

transformation approach to estimating usual daily intake distributions� Journal of

the American Statistical Association� ������������

Schubert� A�� Holden� J�M�� and Wolf� W�R� ����� Selenium content of a core group

of fooods based on a critical evaluation of published analytical data� Journal of the

American Dietetics Association� ����������

Smith� A�F�M�� Roberts� G�O� ����� Bayesian computation via the Gibbs sampler and

related Markov chain Monte Carlo methods� Journal of Royal Statistical Society B�

�������

Schervish� M� ����� Theory of Statistics� SpringerVerlag� New York�

��

Spiegelhalter� D�J� Best� N�G� Gilks� W�R� and Inskip� H� ������ Hepatitis B� a case

study in MCMC methods� in Markov Chain Monte Carlo in Practice� eds� Gilks

WR� Richardson S� Spiegelhalter DJ� Chapman and Hall� pp� �������

Stefanski� L�A�� and Bay� J�M� ����� Simulation extrapolation deconvolution of �nite

population cumulative distribution function estimators� Biometrika ����������

Stefanski� L�A�� and Carroll� R�J� ����� Deconvoluting kernel density estimators�

Statistics� ����������

Stefanski� L�A�� and Carroll� R�J� ����� Deconvolutionbased score tests in measure

ment error models� The Annals of Statistics� ����������

U�S� Department of Agriculture� Agricultural Research Service ������ Continuing

Survey of Food Intakes by Individuals� ��� ������ CSFII Report� Washington� DC�

U�S� Government Printing O�ce�

��

Figure �� Power transformed intake data Y �ij �points�� and two draws of the trans�

formation function� at the �nd �solid line� and �����th iterations �dashed line�� Datacorrespond to females�

��

Figure �� Power transformed intake data Y �ij �points�� and two draws of the trans�

formation function� at the �nd �solid line� and �����th iterations �dashed line�� Datacorrespond to males�

��

Figure �� Densities of the usual intake of vitamin C for females aged ��� estimatedusing the diagonal and non�diagonal forms of the weight matrixW in the transformationinto normality�

��

Figure �� Estimated densities of the usual intake of dietary components in females aged��� The dotted curves correspond to the distribution of two�day means� The solidcurves correspond to the Bayesian estimator described in Section ��

��

Figure �� Estimated densities of the usual intake of dietary components in males aged��� The dotted curves correspond to the distribution of two�day means� The solidcurves correspond to the Bayesian estimator described in Section ��

��

Diagonal W Full W

Mean ���� ��������� ����� ����� �����

Std� Dev� ���� ��������� ����� ����� �����

�st percentile ���� �������� ����� ���� �����

�th percentile ���� ��������� ����� ����� �����

�th percentile ���� ��������� ����� ����� �����

�th percentile ���� ��������� ����� ����� �����

th percentile ����� ����������� ������ ������ ������

�th percentile ����� ����������� ������ ������ ������

th percentile ����� ����������� ������ ������ ������

Table �� Mean� standard deviation� and quantiles of the usual intake distribution ofvitamin C for females ��� Values in parenthesis are the ��� credible intervals� Es�timates were obtained using a diagonal and a full weight matrix for estimation of thetransformation into normality�

��

Calcium Cholesterol Iron Protein Vit A Vit C

mg� mg� mg� g� �g� mg�Mean ��� ��� ���� ���� ��� ����Std Dev ��� ��� ��� ���� ��� ����Ratio �� ��� ��� ��� ��� ����st ��� �� ��� ���� ��� ����th ��� �� ��� ���� ��� �����th ��� �� ��� ���� ��� �����th ��� ��� ���� ���� ��� ����th ���� ��� ���� ���� ���� ������th ���� ��� ���� ����� ���� �����th ���� ��� ���� ����� ���� �����

Table �� Mean� standard deviation� and selected percentiles of the distribution of two�day individual means� and ratio of within� to between�individual variance in intakes forfemales aged ���

Calcium Cholesterol Iron Protein Vit A Vit C

mg� mg� mg� g� �g� mg�Mean ���� ��� ���� ����� ���� �����Std Dev ��� ��� ���� ���� ��� ����Ratio �� ��� ��� ��� ��� ����st ��� �� ��� ���� ��� ����th ��� ��� ��� ���� ��� �����th ��� ��� ���� ���� ��� �����th ���� ��� ���� ���� ��� ����th ���� ��� ���� ����� ���� ������th ���� ��� ���� ����� ���� �����th ���� ��� ���� ����� ���� �����

Table �� Mean� standard deviation� and precentiles of the distribution of two�day indi�vidual means� and ratio of within� to between�individual variances in intakes for malesaged ���

��

Calcium Cholesterol Iron Protein Vit A Vit C

mg� mg� mg� g� �g� mg�Mean ��� ��� ���� ���� ��� ����

���� ���� ���� ���� ����� ����� ����� ����� ���� ���� ����� �����Std� ��� �� ��� ���� ��� ����

���� ���� ��� ��� ���� ���� ����� ����� ���� ���� ����� ������st ��� �� ��� ���� ��� ����

���� ���� ��� ���� ���� ���� ����� ����� ���� ���� ���� ������th ��� ��� ��� ���� ��� ����

���� ���� ��� ���� ���� ���� ����� ����� ���� ���� ����� ������th ��� ��� ��� ���� ��� ����

���� ���� ���� ���� ���� ���� ����� ����� ���� ���� ����� ������th ��� ��� ���� ���� ��� ����

���� ���� ���� ���� ����� ����� ����� ����� ���� ���� ����� �����th ���� ��� ���� ���� ���� �����

���� ����� ���� ���� ����� ����� ����� ����� ����� ����� ������ �������th ���� ��� ���� ���� ���� �����

����� ����� ���� ���� ����� ����� ����� ����� ����� ����� ������ ������th ���� ��� ���� ����� ���� �����

����� ����� ���� ���� ����� ����� ����� ������ ����� ����� ������ ������

Table �� Mean� standard deviation� and selected percentiles of the usual intake distribu�tion of dietary components for females aged ��� Values in parentheses are the lowerand upper bounds of ��� credible intervals�

��

Calcium Cholesterol Iron Protein Vit A Vit C

mg� mg� mg� g� �g� mg�Mean ���� ��� ���� ���� ���� �����

����� ����� ���� ���� ����� ����� ����� ����� ����� ����� ������ ������Std� ��� ��� ��� ���� ��� ����

���� ���� ��� ���� ���� ���� ����� ����� ���� ���� ����� ������st ��� ��� ��� ���� ��� ����

���� ���� ��� ���� ���� ����� ����� ����� ���� ���� ����� ������th ��� ��� ���� ���� ��� ����

���� ���� ���� ���� ����� ����� ����� ����� ���� ���� ����� ������th ��� ��� ���� ���� ��� ����

���� ���� ���� ���� ����� ����� ����� ����� ���� ���� ����� ������th ���� ��� ���� ���� ���� ����

����� ����� ���� ���� ����� ����� ����� ����� ���� ����� ����� ������th ���� ��� ���� ���� ���� �����

����� ����� ���� ���� ����� ����� ����� ����� ����� ����� ������ �������th ���� ��� ���� ���� ���� �����

����� ����� ���� ���� ����� ����� ����� ����� ����� ����� ������ ������th ���� ��� ���� ����� ���� �����

����� ����� ���� ���� ����� ����� ����� ������ ����� ����� ������ ������

Table �� Mean� standard deviation� and selected percentiles of the usual intake distri�bution of dietary components for males aged ��� Values in parentheses are the lowerand upper bounds of ��� credible intervals�

��

Females Males

Calcium RDA ����� mg ����� mgPrevalence ��� ���

���� ���� ���� ����Iron RDA �� mg �� mg

Prevalence ��� ������� ���� ���� ����

Protein RDA �� g �� gPrevalence ��� ���

���� ���� ���� ����Vitamin A RDA ��� �g ����� �g

Prevalence ��� ������� ���� ���� ����

Vitamin C RDA �� mg �� mgPrevalence ��� ���

���� ���� ���� ����

Cholesterol Cutpoint ��� mg ��� mgPrevalence ��� ���

���� ���� ���� ����

Table �� Mean of the posterior distribution of prevalence of nutrient inadequacy amongfemales and males aged ��� and th and � th posterior percentiles� Here� prevalenceis de�ned as Pry � ����RDA�� where the RDA for each nutrient is the value publishedin the ��� NRC report� For cholesterol� we report the mean� th and � th percentilesof the posterior distribution of the prevalence of excessive intakes Pry � ���mg��

��

Calcium Cholesterol Iron Protein Vit A Vit C

mg� mg� mg� g� �g� mg��st ��� ��� ��� ���� ��� ��

���� ���� ��� ���� ���� ���� ����� ����� ���� ���� �� ����th ��� ��� ��� ���� ��� ��

���� ���� ���� ���� ���� ���� ����� ����� ���� ���� ��� ����th ��� ��� ��� ���� ��� ��

���� ���� ���� ���� ���� ���� ����� ����� ���� ���� ��� ����th ��� ��� ���� ���� ��� ��

���� ���� ���� ���� ����� ����� ����� ����� ���� ���� ��� ���th ���� ��� ���� ���� ���� ���

���� ����� ���� ���� ����� ����� ����� ����� ����� ����� ���� �����th ���� ��� ���� ���� ���� ���

����� ����� ���� ���� ����� ����� ����� ����� ����� ����� ���� ����th ���� ��� ���� ����� ���� ���

����� ����� ���� ���� ����� ����� ����� ������ ����� ����� ���� ����

Table �� Selected percentiles of the usual intake distribution of dietary componentsfor females aged ��� estimated using the Nusser et al� ����� frequentist approach�Values in parentheses are the lower and upper bounds of the approximate ��� con�denceintervals computed using a balance repeated replication method�

��

Calcium Cholesterol Iron Protein Vit A Vit C

mg� mg� mg� g� �g� mg��st ��� ��� ��� �� ��� ��

���� ���� ���� ���� ���� ����� ��� ��� ���� ���� ��� ����th ��� ��� ���� �� ��� ��

���� ���� ���� ���� ����� ����� ��� ��� ���� ���� ��� ����th ��� ��� ���� �� ��� ��

���� ���� ���� ���� ����� ����� ��� ��� ���� ���� ��� ����th ���� ��� ���� �� ���� ���

����� ����� ���� ���� ����� ����� ��� ���� ����� ����� ��� ����th ���� ��� ���� ��� ���� ���

����� ����� ���� ���� ����� ����� ���� ���� ����� ����� ���� �����th ���� ��� ���� ��� ���� ���

����� ����� ���� ���� ����� ����� ���� ���� ����� ����� ���� ����th ���� ��� ���� ��� ���� ���

����� ����� ���� ���� ����� ����� ���� ���� ����� ����� ���� ����

Table �� Selected percentiles of the usual intake distribution of dietary componentsfor males aged ��� estimated using the Nusser et al� ����� frequentist approach�Values in parentheses are the lower and upper bounds of the approximate ��� con�denceintervals computed using a balance repeated replication method�

��