Upload
robert-w
View
212
Download
0
Embed Size (px)
Citation preview
This article was downloaded by: [Umeå University Library]On: 21 November 2014, At: 10:55Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK
Journal of Biopharmaceutical StatisticsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/lbps20
DO YOU KNOW YOUR TOTAL CHOLESTEROL (TC)NUMBER?Cong Chen a , Zhigang Zhou a & Robert W. Tipping aa Merck Research Laboratories , BL X-27, P.O. Box 4, West Point, PA, 19486, U.S.A.Published online: 05 Oct 2011.
To cite this article: Cong Chen , Zhigang Zhou & Robert W. Tipping (2002) DO YOU KNOW YOUR TOTAL CHOLESTEROL (TC)NUMBER?, Journal of Biopharmaceutical Statistics, 12:2, 179-192, DOI: 10.1081/BIP-120015742
To link to this article: http://dx.doi.org/10.1081/BIP-120015742
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
DO YOU KNOW YOUR TOTAL CHOLESTEROL(TC) NUMBER?
Cong Chen,* Zhigang Zhou, and Robert W. Tipping
Merck Research Laboratories, BL X-27, P.O. Box 4,
West Point, PA 19486, USA
ABSTRACT
Total cholesterol (TC) measurements are subject to errors primarily because
of temporal variations in cholesterol levels within each individual. These
errors make it difficult to estimate the proportion of study subjects with true
(error-free) TC in a specific range, an extremely important parameter to policy
makers in health care management. To properly address this issue, it is key to
accurately estimate the distribution function of the true TC, which typically
deviates from the normal distribution. To better approximate the distribution
function of the true TC, we propose a constrained maximum likelihood
estimator based on a mixture-of-normals model. A simulation study illustrates
that the proposed estimator performs better than an estimator based on the
normality assumption that is frequently used in the literature to address the
same issue. Finally, the proposed estimator is applied to data from a study, and
its performance is once again compared with that of an estimator based on the
normality assumption.
Key Words: Constrained maximum likelihood estimation; EM algorithm;
Measurement error model; National Cholesterol Education Program;
Response analysis
179
DOI: 10.1081/BIP-120015742 1054-3406 (Print); 1520-5711 (Online)Copyright q 2002 by Marcel Dekker, Inc. www.dekker.com
*Corresponding author. E-mail: [email protected]
JOURNAL OF BIOPHARMACEUTICAL STATISTICS
Vol. 12, No. 2, pp. 179–192, 2002
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
INTRODUCTION
Total cholesterol (TC) measurements are used in clinical studies to evaluate
the efficacy of certain treatments for hypercholesterolemia, or in epidemiologic
studies to assess the risk of a population for coronary heart disease. In these studies
it is often important to estimate the proportion of individuals with TC within a
specific range (for example, ,200 or 200–240). Decision makers frequently rely
on these proportion estimates to make recommendations that may have a great
impact on health care policies. An analysis of this type is sometimes referred to as
a “responder” or “threshold” analysis. One difficulty with this type of analysis
comes from measurement errors. It is well known that TC measurements are
subject to measurement errors because of temporal variations in cholesterol levels
within each individual. These errors could then lead to substantial misclassifi-
cations as noted in Ref. [1] To emphasize the impact of measurement errors on
cholesterol control, current guidelines set by an expert panel of the National
Cholesterol Education Program (NCEP) recommend the use of an average of at
least two measurements of TC obtained from lipoprotein analysis to estimate the
risk of coronary heart disease.[2]
In the bio-statistical literature that deals with the measurement error issue, it
is usually assumed that both the variable of interest (true TC in Ref. [1] or error-
free measurement of bone mineral density in Ref. [3]) and the measurement error
are normally distributed. However, while the measurement error is generally
considered to have a normal distribution, as in Ref. [4] the distribution of the
variable of interest could differ from case to case. When it is the probabilities
(proportions) that are of interest, mis-specification of the distribution function of
the variable of interest may lead to unacceptable bias.
Notice that if both of these variables are indeed normally distributed, the
underlined variable of the observed data, which is the sum of the variable of
interest and the measurement error, would also have a normal distribution. To
check for the normality of the variable of interest, it is therefore helpful to check
for normality on the observed data. From our experience with population-based
studies, the observed data on major cholesterol measurements, such as TC, low
density lipoprotein cholesterol (LDL), or high density cholesterol (HDL), usually
have an empirical distribution that deviates from normality. Therefore, to properly
address the measurement error issue, alternative distribution functions other than
the normal distribution must be considered to better approximate the distribution
of true TC.
To facilitate the discussion, we will maintain the conventional normality
assumption on the measurement error throughout this article. Notice that, in
general, when a variable of interest is measured with large errors from a normal
distribution, the empirical distribution of the observed data could be very different
from the truth. For example, the empirical distribution of the observed data tends
to show less skewness and/or fewer modes than that of the error-free variable. The
reason is that because of the normally distributed errors, the distribution of
CHEN, ZHOU, AND TIPPING180
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
the observed data has essentially incorporated the features of a normal
distribution. As a result, it is very difficult to correctly specify the distribution
of the variable of interest from the observed data.
In nature, the underlined statistical problem is in the framework of
“deconvolution,” which deals with the estimation of the distribution function
of a variable measured with error. The term deconvolution is derived from the
fact that the distribution of the underlined variable of the observed data is a
convolution of the distribution of the variable of interest and that of the
measurement error.
Generally speaking, there are three different forms of estimators for
determining distribution functions: parametric, semiparametric, and nonpara-
metric. Parametric estimation assumes a specific parametric form of the
distribution function (such as normal distributions in Ref. [1] or in Ref. [3]).
Semiparametric estimation assumes that the distribution function is in a class of
certain functionals (such as mixture-of-normals in Ref. [5] and spline functions in
Refs. [6,7]). Nonparametric estimation does not assume any particular functional
form (such as kernel estimators in Ref. [8]).
Using a kernel estimator, Fan shows that the deconvolution problem is one of
the most challenging problems in statistics with an extremely low convergence
rate because of the substantial variance of the estimator.[9] In order to have a
precise estimate, the sample size has to be very large. While nonparametric
estimators are important in the investigation of the theoretical issues of the
deconvolution problem, they are difficult to interpret and could face strong
resistance in the medical community. Besides, from an extensive simulation study
in Ref. [7] it appears that their performance is inferior to that of semiparametric
estimators. From a practitioner’s point of view, when a specific parametric form
requires a very strong assumption and is hard to define, semiparametric estimators
should be considered as viable alternatives because of their flexibility. The spline
function estimator described in Refs. [6,7] performs very well but is difficult to
interpret and is computationally unfriendly. The mixture-of-normals estimator
may be viewed as a direct extension of the normal distribution and is arguably the
most attractive among semiparametric estimators.
In this article we propose a mixture-of-normals estimator for practical use.
This proposed estimator extends and improves upon the mixture-of-normals
estimator described by Cordy and Thomas.[5] The second section of this article
provides the motivation for using this proposed estimator and includes definitions
and notations. “Parameter Estimation” provides the details of the methodology for
using the proposed estimator and the simulation results that demonstrate its
performance. “Application to Real Data” applies the proposed methodology to
data from an actual study. The data example is selected because it carries typical
features of cholesterol measurements, and is presented to illustrate the
performance of the proposed estimator. It is not used to draw any particular
medical conclusions (such as misclassification rates) since the study was not
designed for that purpose. The last section concludes with discussions.
TOTAL CHOLESTEROL NUMBER 181
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
MOTIVATION
In a recent study of label testing to evaluate potential users’ ability to
properly self-select a low-dose statin as a nonprescription adjunct to a healthy diet
and exercise for the treatment of moderate hypercholesterolemia, participants in
the study were administered a questionnaire that asked whether they knew their
specific TC number (“Do You Know Your Total Cholesterol (TC) Number?”). To
determine their eligibility for the test drug, it would be ideal to have their average
TC levels over a short period, such as a week. However, for the sake of label
testing, the participants were only asked to provide their most recent TC levels. If
they did not know the specific number, they were then asked to select one of the
following four possibilities: “ , 200,” “ . 240,” “200–240,” and “don’t know.”
A total of 2416 participants (males: 72%; Caucasians: 80%; average age: 56)
were screened and were administered the questionnaire. Among them, 2094
participants either provided a specific TC number or a TC range, and were
included in our analysis. The rest gave a final answer of “don’t know” and were
excluded. Let Yi represent the reported TC level of the ith individual where
i ¼ 1; . . . ; nð¼ 2094Þ: Notice that Yi could either be a specific number or be in a
range. If it is in a range, we write it as Yi [ Vi where Vi takes one of the following
intervals: “ , 200,” “200–240,” or “ . 240.” Let Xi be the true or error-free TC
for Yi and let ei be the corresponding additive measurement error (i.e.,
Yi ¼ Xi þ ei) independent of Xi where i ¼ 1; . . . ; n: For the purpose of the study, X
might be considered as the average TC over a week, and e might be considered as
the weekly intraindividual variability.
Throughout this article, we assume that participants who reported a specific
TC number and participants who reported a TC range have the same distribution
function in the true TC levels [i.e., Xi ði ¼ 1; . . . ; nÞ are identically distributed]. As
standard in the literature for measurement error problems, e is assumed to have a
normal density with a zero mean and a variance of s 2e ; i.e., Nð0;s 2
e Þ: Certainly, s 2e
cannot be determined from this study since we only have one observation for each
participant. However, the range of s 2e can be reasonably derived from similar
studies. To facilitate the discussion, we assume that s 2e is known. The density
function of a variable from N(·,·) is denoted by f(·,·), and the corresponding
cumulative distribution function (CDF) is denoted by F(·,·). Let fx and fy be the
corresponding probability density functions (PDF) of X and Y, respectively.
Figure 1 presents the normal Quantile–Quantile plot of the specific TC
values from the study. Clearly, the plot shows that the TC level of the study
population does not have a normal distribution. Instead it is skewed, implying
skewness of the distribution ( fx) of the true (error-free) TC level. In addition to the
skewness, the figure is unclear in depicting other features of fx. We use a mixture-
of-normals density function to approximate fx. A PDF in the form of a mixture-of-
normals can, in theory, approximate a density function within any desired error
range. Assume that fx is a p-component mixture-of-normals whose jth component
ð j ¼ 1; . . . ; pÞ is fðmj;s 2Þ with a coefficient of aj ðaj $ 0 andP
jaj ¼ 1Þ:
CHEN, ZHOU, AND TIPPING182
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
Without loss of generality, we assume that mj are in descending order such that
mp . mp21 . · · · . m1. The coefficient aj can be interpreted to be the probability
that X is from fðmj;s 2Þ, j ¼ 1; . . . ; p: Under this assumption,
f yð yÞ ¼ f x*fð0;s 2e Þð yÞ ¼
Xp
j¼ 1
ajfðmj;s 2Þ*fð0;s 2e Þð yÞ
¼Xp
j¼ 1
ajfðmj;s 2þs 2e Þð yÞ ð1Þ
where “*” denotes convolution. Notice that fy is a mixture-of-normals with means,
mjð j ¼ 1; . . . ; pÞ; and common variance, s 2 þ s 2e : Similarly, as in the case of X,
Eq. (1) implies that Y is generated from Nðmj;s2 þ s 2
e Þ with a probability aj,
j ¼ 1; . . . ; p: When Yi [ Vi; fy(Yi) is defined to beRVi
f yðyÞdy; the probability that
Yi falls into Vi.
PARAMETER ESTIMATION
Constrained Maximum Likelihood Estimation by EM Algorithm
The log-likelihood based on the observed data, Y ¼ ðY1; . . . ;YnÞT; is given
by
Lða;s2jYÞ ¼Xn
i¼ 1
log{ f yðYiÞ}
Figure 1. Normal-QQ plot of specific TC levels. It shows clear deviation from a normal
distribution, typical of cholesterol measurements.
TOTAL CHOLESTEROL NUMBER 183
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
where fy is defined in Eq. (1). Certainly, it is difficult to directly maximize
Lða;s2jYÞ to obtain the maximum likelihood estimator (MLE) of fx. However, the
fact that the coefficients have a probabilistic interpretation leads to a simple EM
algorithm for the maximum likelihood estimation. In this subsection, we discuss
the constrained maximum likelihood estimation of a ¼ ða1; . . . ;apÞT and s 2 for
fixed p and mjð j ¼ 1; . . . ; p Þ: In the next subsection we provide guidelines on the
choice of mjð j ¼ 1; . . . ; p Þ and p.
Cordy and Thomas proposed a mixture-of-normals estimator with equally
spaced means and a common variance to approximate the distribution function of
a variable measured with error.[5] The size of the common variance (s 2) and the
choice of p were suggested from a simulation study. One major difference between
our proposed methodology and the approach described by Cordy and Thomas is
that we determine the common variance directly from the data, and make the
choice of p less an issue. Specifically, for a fixed p and mjð j ¼ 1; . . . ; pÞ; we
estimate the parameters, a and s 2, by maximizing the log-likelihood function
ðLða;s2jYÞÞ subject to the constraints on the first two moments:
E{Y} ¼ E{X} ¼Xp
j¼ 1
ajmj and
E{Y 2} ¼ E{X 2} þ s 2e ¼
Xp
j¼ 1
ajm2j þ s2 þ s 2
e
ð2Þ
with E{Y} and E(Y 2) replaced with the corresponding first two empirical moments
of the specific TC levels, E{Y} and E{Y 2}: Notice that under these two constraints
and the natural constraint on coefficients ðP
jaj ¼ 1Þ; the effective number of
parameters to estimate is ( p 2 2). Updated estimates of a and s 2 in an EM
algorithm are obtained through the steps described next. In the following
equations, Kj denotes the number of participants with true TC levels from the jth
normal component of the mixture-of-normals in Eq. (1).
E-Step: Calculate the conditional expectations to estimate Kj ð j ¼ 1; . . . ; pÞ
by
Kj ¼ E{KjjY;aold and s2;old} ¼
Xn
i¼ 1
aoldj fðmj;s 2;oldþs 2
e ÞðYiÞP
laoldl fðml;s 2;oldþs 2
e ÞðYiÞ
!:
In this equation, a old is a prior estimate of a, and s 2,old is a prior estimate of s 2.
In addition, it is understood that when Yi [ Vi; fðml;s 2;oldþs 2e ÞðYiÞ ¼R
Vifðml;s 2;oldþs 2
e Þð yÞdy: Regardless of whether Yi is a specific value or an interval
data, the summand on the right-hand side of the above equation represents the
conditional probability that Yi is from the jth component. Therefore, the EM
algorithm is naturally suitable for interval data.
M-Step: Calculate the maximum likelihood estimate of a, which is referred
to as a new, by maximizing the log-likelihood function of the multinomial
CHEN, ZHOU, AND TIPPING184
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
distribution,P
jKj logðanewj Þ; subject to
Pja
newj ¼ 1 and E{Y} ¼
P pj¼ 1a
newj mj:
For j ¼ 1; . . . ; p; the MLE of the mixture of normal coefficients is given as
anewj ¼
Kj
n 2 l½E{Y} 2 mj�
where l, serving as a Lagrange multiplier, is the unique solution of the following
equation calculated by the standard Newton–Raphson method:
hðlÞ ¼Xp
j¼1
Kj½E{Y} 2 mj�
n 2 l½E{Y} 2 mj�¼ 0:
One way to see the uniqueness of this solution is by noticing that the first
derivative of h(l ) with respect to l is strictly greater than zero. Once a new is
obtained, the maximum likelihood estimate of s 2, referred to as s 2,new, is
calculated from the constraint on the second moment in Eq. (2). That is, s2;new ¼
E{Y 2} 2 s 2e 2
Ppj¼1 a
newj m 2
j :The EM algorithm is very fast and usually converges in less than 20
iterations. Once the EM algorithm converges and the estimates, such as a and s 2,
are obtained, the corresponding estimate of fx is
fxðxÞ ¼Xp
j¼1
ajfðmj;s 2ÞðxÞ: ð3Þ
Choice of p and mj ð j 5 1; . . . ; pÞ
Ideally, we need p to be as large as possible to approximate fx accurately.
However, the variance of fx increases with p. The introduction of the constraint on
the first two moments is partly intended to downplay the effect of p. It is
anticipated that when assisted with the two constraints, the variance of fx does not
increase drastically with p. As a result, a slight mis-specification of p does not
greatly affect the performance of fx. Given the difficulty in specifying p, the
introduction of the two constraints in Eq. (2) plays a remarkable role in the
estimation. In practice, for data of moderate sample size (say, 200–4000) p may be
taken from 5 to 15.
The location of mjð j ¼ 1; . . . ; pÞ plays a less significant role than the
choice of p because generally, the normal components are highly correlated
with each other. Misplacing a normal mean, as long as not too unreasonably,
would not greatly affect the overall performance of the estimator.
Nevertheless, we try to optimize the selection procedure and locate the
means at the average of those from an equal-probability scheme and an equal-
space scheme. The same mixing strategy was used in the location of join
points of spline functions in Refs. [6,7]
TOTAL CHOLESTEROL NUMBER 185
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
In both the equal-probability scheme and the equal-space scheme, m1 is
always fixed at the [0.05n ]-th smallest specific TC and mp is always fixed at
the [0.05n ]-th largest specific TC where [0.05n ] is the closest integer to 0.05n. As
for the rest of means in the equal-probability scheme, mj,ep ð j ¼ 2; . . . ; p 2 1Þ;there is approximately the same number of specific TC numbers in the interval
between two consecutive means. The rest of means in the equal-space scheme,
mj,es ð j ¼ 2; . . . ; p 2 1Þ; are equally spaced between the first and the last. The
intermediate means used in the model are then calculated as mj ¼ 0:5ðmj;ep þ
mj;esÞ; j ¼ 2; . . . ; p 2 1: Under the mixing strategy, mj ð j ¼ 1; . . . ; pÞ tend to be
more effectively located, and the resulting estimates of fx tend to be more stable.
Simulation Study
Data used in our simulation study, {Yi; i ¼ 1; . . . ; n}; take specific values,
and are generated as a sum of true data and measurement error. The true (or error-
free) data, {Xi; i ¼ 1; . . . ; n}; are generated from a predetermined distribution
function and the measurement errors, {ei; i ¼ 1; . . . ; n}; are generated from
Nð0;s 2e Þ with s 2
e being treated as known. In our simulation study, n is held fixed at
2000 to be comparable with that in the previously mentioned study. Two right
skewed density functions, one unimodal and the other bimodal, are used to
generate X. They are 0:2Nð0; 1Þ þ 0:2Nð23=5; 4=9Þ þ 0:6Nð23=2; 25=81Þ and
0:7Nð0; 1Þ þ 0:3Nð23=2; 2=9Þ; respectively.
An important measure of contamination of the data is the ratio (denoted by r )
of the variance of the measurement error to the variance of X. Two different
values, 0.2 and 0.4, were considered for r. The selected values were in a range
typical for TC measurements.[1] The number of normal components ( p )
considered for the proposed estimator ranged from 5 to 11. As expected, the
performance of the proposed estimator varied little with p. We only report the
outcomes based on p ¼ 6 and 10 which result in 4 and 8 effective parameters,
respectively.
Two CDF estimators are used for comparison. The first one is Fx, the
corresponding CDF estimator to fx in Eq. (3), and the second one is ~Fx or Nð �Y; s 2y 2
s 2e Þ where Y is the sample mean of Y and s 2
y is the sample variance of Y. The first
one (Fx) is the proposed estimator, and the second one (Fx) is essentially the same
estimator that is based on the normality assumption as discussed in Refs. [1,3].
Notice that Fx and Fx have the same mean and variance. We compare the bias of
the two CDF estimators at selected percentiles of the underlined distribution
function of X. Denote by ta the 100a-percentile of the distribution function of X,
the bias of an estimator, such as Fx, at ta is ð ~FxðtaÞ2 aÞ:Table 1 reports the average bias and standard deviation in percentage points
(i.e., multiplied by 100) of the two estimators from 1000 replicates. Clearly, the
proposed estimator, either with p ¼ 6 or 10; has much less bias and slightly
greater standard deviation than the estimator based on the normality assumption
CHEN, ZHOU, AND TIPPING186
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
Ta
ble
1.
Em
pir
ical
Bia
s(S
tan
dar
dD
evia
tio
n)
inP
erce
nta
ge
Po
ints
Bas
edo
n1
00
0R
epli
cate
sat
10
0a
-Per
cen
tile
so
fth
eU
nd
erli
ned
Dis
trib
uti
on
of
X
a
Est
imat
or
r0
.97
50
.95
0.9
00
.75
0.5
00
.25
0.1
00
.05
0.0
25
Dis
trib
uti
on
of
X:
skew
edu
nim
od
al
Fx
0.2
21
.6(0
.2)
22
.2(0
.4)
21
.6(0
.7)
4.0
(1.1
)6
.2(1
.0)
0.7
(0.8
)2
3.0
(0.6
)2
3.5
(0.5
)2
3.2
(0.4
)
Fx;
p¼
60
.22
0.1
(0.4
)0
.5(0
.6)
0.3
(0.7
)2
0.1
(1.1
)1
.3(1
.3)
0.2
(0.9
)2
0.8
(0.7
)2
0.8
(0.6
)2
0.6
(0.5
)
Fx;
p¼
10
0.2
20
.1(0
.4)
0.5
(0.5
)0
.3(0
.7)
20
.2(1
.0)
1.4
(1.2
)0
.3(1
.0)
20
.7(0
.7)
20
.8(0
.6)
20
.7(0
.4)
Fx
0.4
21
.6(0
.2)
22
.2(0
.4)
21
.6(0
.7)
4.0
(1.2
)6
.2(1
.1)
0.7
(0.9
)2
3.0
(0.7
)2
3.5
(0.6
)2
3.2
(0.5
)
Fx;
p¼
60
.40
.1(0
.5)
0.8
(0.7
)0
.2(0
.7)
20
.3(1
.2)
1.2
(1.4
)0
.4(1
.2)
20
.6(1
.0)
20
.7(0
.8)
20
.6(0
.6)
Fx;
p¼
10
0.4
0.0
(0.5
)0
.8(0
.7)
0.3
(0.8
)2
0.3
(1.2
)1
.3(1
.4)
0.3
(1.2
)2
0.7
(0.9
)2
0.8
(0.7
)2
0.7
(0.5
)
Dis
trib
uti
on
of
X:
skew
edb
imo
dal
Fx
0.2
20
.4(0
.2)
20
.8(0
.4)
21
.4(0
.6)
21
.9(0
.9)
3.2
(1.0
)4
.0(0
.7)
21
.0(0
.5)
22
.5(0
.4)
22
.7(0
.4)
Fx;
p¼
60
.20
.1(0
.4)
0.3
(0.6
)0
.0(0
.7)
0.1
(1.1
)2
0.4
(1.2
)0
.9(1
.1)
0.1
(0.8
)2
0.4
(0.6
)2
0.5
(0.4
)
Fx;
p¼
10
0.2
0.0
(0.5
)0
.4(0
.6)
0.3
(0.7
)2
0.1
(1.2
)2
0.2
(1.3
)0
.7(1
.2)
20
.4(0
.8)
20
.6(0
.7)
20
.5(0
.5)
Fx
0.4
20
.3(0
.3)
20
.7(0
.4)
21
.3(0
.6)
21
.8(1
.0)
3.3
(1.1
)4
.1(0
.9)
21
.0(0
.6)
22
.4(0
.5)
22
.7(0
.4)
Fx;
p¼
60
.40
.6(0
.5)
0.5
(0.6
)2
0.2
(0.8
)2
0.3
(1.3
)2
0.4
(1.4
)1
.1(1
.3)
0.3
(0.9
)2
0.4
(0.7
)2
0.8
(0.6
)
Fx;
p¼
10
0.4
0.3
(0.7
)0
.7(0
.6)
0.6
(0.7
)2
0.5
(1.1
)2
0.3
(1.4
)1
.3(1
.2)
20
.5(1
.0)
20
.7(0
.8)
20
.9(0
.6)
TOTAL CHOLESTEROL NUMBER 187
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
(Fx). Overall, the proposed estimator is superior to the estimator based on
the normality assumption. As predicted, the performance of the proposed
estimator (Fx) varies little with p.
APPLICATION TO REAL DATA
We now use the data from the previously mentioned study to demonstrate the
magnitude of the difference in key estimates between the mixture-of-normal
approach and the conventional normality approach. Of the 2094 participants
included in the analysis, 1672 (79.8%) provided specific TC levels with a mean of
234 mg/dL and a variance of 952 mg/dL2. The remaining 422 participants
provided a TC range, with 11 (3%) having a TC level below 200 mg/dL, 317
(75%) having a TC level between 200 and 240 mg/dL, and 94 (22%) having a TC
level greater than 240 mg/dL. The normality-based density estimator (fx) for the
data would be Nð234; 952 2 s 2e Þ:
The true TC level that we are interested in is each participant’s average TC
level during the course of a week. It has been estimated from a reliability sub-study
of the well-known Framingham Heart Study that TC measurements taken one
week apart have an intra-individual variance of 238 mg/dL2.[1,10] Clearly, if the
observed TC level is the average of two independent measurements taken within a
week, the intra-individual variance would be 119 mg/dL2.
In our study, the TC levels reported by some of the participants might
actually be based on more than one measurement. We consider both 119 and 238
as the possible values of s 2e : Using the sample variance of the participants’ specific
TC values, the ratio of the variance of the measurement error to that of the true TC
is 33.3% when s 2e ¼ 238 and is 14.3% when s 2
e ¼ 119: It is understood that the
true value of s 2e for the data may not be exactly 119 or 238, but may actually be
somewhere between these two values. Nevertheless, we use these values to
illustrate the performance of the proposed estimator and highlight its difference
from the estimator based on the normality assumption.
The number of normal components ( p ) we have tried ranges from 6 to 10.
Because of the large sample size in this study, the density estimator, fx, changes
little with p, as is demonstrated in the simulation described in the preceding
section. The results presented here are based on an estimator of eight normal
components with the following means: 197.0, 210.9, 221.2, 232.1, 242.0, 250.3,
263.1, and 286. The estimated common variance is s2 ¼ 342:0 when s 2e ¼ 238;
and s2 ¼ 384:9 when s 2e ¼ 119: The estimated coefficients are a ¼
ð0:057; 0:000; 0:099; 0:742; 0:001; 0:000; 0:000; 0:102ÞT when s 2e ¼ 238;and
a ¼ ð0:072; 0:001; 0:223; 0:501; 0:043; 0:020; 0:041; 0:099ÞT when s 2e ¼ 119:
For each s 2e ; the corresponding density estimator fx shows moderate skewness that
was also found in the specific TC levels.
Figure 2 presents the difference of the estimated CDFs ðFx 2 ~FxÞ in
percentage points by s 2e : The difference is higher if there is more measurement
CHEN, ZHOU, AND TIPPING188
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
error, and takes its largest values approximately around 250. We do not know
the true distribution for this data, but judging from the simulation results in the
preceding section, it is reasonable to suggest that the difference was largely due to
the bias of the normality estimator. From the figure it is clear that the normality
approach would very likely over-estimate the proportion of patients with
TC # 200 mg/dL and under-estimate the proportion of patients with TC # 240
mg/dL, each by a couple of percentage points. Given the importance of the two
cut-off points (200 mg/dL and 240/dL) in cholesterol control, this result shows the
normality approach would be inappropriate.
We further compare fx and fx in the evaluation of two misclassification rates;
the false negative rate (FN) and the false positive rate (FP). A false negative
occurs when an individual is classified as having TC # y with y # 240 when
actually the true TC is .240 (“high risk” according to an expert panel of the
NCEP[2]). A false positive occurs when an individual is classified as having
TC $ y with y . 200 when actually the true TC is ,200 (“desirable” according
to an expert panel of the NCEP[2]). The two rates are formally defined in the
following expressions:
FNðyÞ ¼ PrðY # yjX . 240Þ and FPðyÞ ¼ PrðY . yjX , 200Þ:
Figure 2. Difference in percentage points between the two estimated CDFs ðFxðxÞ2 ~FxðxÞÞ: The
left figure is based on s 2e ¼ 238 and the right figure is based on s 2
e ¼ 119:
TOTAL CHOLESTEROL NUMBER 189
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
A serious false negative occurs at y ¼ 200; and a serious false positive occurs at
y ¼ 240:Under the mixture-of-normals assumption, the joint density function of (X,Y )
is estimated to be
fðX;YÞðx; yÞ ¼Xp
j¼ 1
aj fðx j;y jÞðx; yÞ;
where (X j,Y j) has a bivariate normal distribution with mean vector (mj,mj), and a
common variance–covariance matrix with the element corresponding to the
variance of Y j being ðs2 þ s 2e Þ and the rest being s2: The FN and the FP using the
mixture-of-normals estimator are given as
FNfxðyÞ ¼
Ppj¼ 1 aj
R y
0
R1240
fðX j;Y jÞðx; yÞdxdyPpj¼ 1 aj
R1240
fX jðxÞdxand
FPfxðyÞ ¼
Ppj¼ 1 aj
R1y
R 200
0fðX j;Y jÞðx; yÞdxdyPp
j¼ 1 aj
R 200
0fX jðxÞdx
;
respectively.
Under the normality assumption the FN and the FP are written as follows:
FN ~fxð yÞ ¼
R y
0
R1240
~fðX;YÞðx; yÞdx dyR1240
~fXðxÞdxand FP~fx
ðyÞ ¼
R1y
R 200
0~fðX;YÞðx; yÞdx dyR 200
0~fXðxÞdx
:
In these last two equations, (X,Y ) has a bivariate normal distribution with mean
vector (Y,Y ), and a variance–covariance matrix with the element corresponding to
the variance of Y being s 2y and the rest being ðs 2
y 2 s 2e Þ:
Clearly, from the definition of the misclassification rates, the better the
estimator of the density function of X is the closer the estimated misclassification
rates are to the truth. The simulation results in the preceding section imply that the
estimates of the misclassification rates from the mixture-of-normals approach
should be much closer to the truth as compared to those from the normality
approach.
Figure 3 presents the difference in estimated misclassification rates between
the two estimators (FNfx2 FN ~fx
and FPfx2 FP~fx
) for 200 # y # 240: Clearly, the
normality approach under-reports misclassification rates as compared to the
mixture-of-normals approach. Therefore, it is very likely that the normality
approach would also under-report the true misclassification rates and would be
inadequate for the evaluation purpose.
CHEN, ZHOU, AND TIPPING190
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
CONCLUSIONS
A mixture-of-normals estimator is proposed for estimating the distribution
function of a variable measured with errors. The constrained MLEs of the
parameters are obtained by an EM algorithm, which naturally takes care of interval
data. With the two constraints imposed on the two moments, the proposed estimator
directly estimates the common variance of the normal components. More
importantly, the added constraints help strengthen the robustness of the mixture-of-
normals estimator, making it less dependent upon the number of normal
components. Guidelines on the choice of normal means are provided, which make
the estimator particularly appealing to practitioners. In a simulation study, the
proposed estimator performed better than an estimator based on the normality
assumption, which was frequently used in the literature to address similar issues.
The proposed method was applied to the analysis of TC data based on a
population potentially interested in the treatment of moderate hypercholester-
olemia. The difference in key estimates between the proposed approach and the
normality approach could be as large as a couple of percentage points. Judging
from the simulation results, it is reasonable to suggest that the normality approach
would be very likely inadequate for the analysis of such data. Encouraged by
Figure 3. Difference in percentage points between the two estimated FNs ðFNfxðyÞ2 FN ~fx
ðyÞÞ and
between the two false FPs ðFPfxðyÞ2 FP~fx
ðyÞÞ: The left figure is based on s 2e ¼ 238; and the right
figure is based on s 2e ¼ 119:
TOTAL CHOLESTEROL NUMBER 191
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014
the improvement of the mixture-of-normals approach over the conventional
normality approach, research work is underway on extending it to the study of the
joint distribution of two variables (e.g., LDL and HDL, which play a critical role in
the new guidelines of NCEP[2]). A mixture-of-bivariate normals with a common
correlation coefficient has been considered for the approximation of the joint
distribution. Preliminary results show that a constrained MLE with constraints
posed on the variance–covariance matrix has very good performance.
ACKNOWLEDGMENTS
We are thankful to our colleague Dr. John R. Cook for his insightful comments that
have led to a much improved presentation. We also appreciate two anonymous referees for
their helpful suggestions.
REFERENCES
1. Roeback, J.R.; Cook, J.R.; Guess, H.A.; Heyse, J.F. Time-Dependent Variability in
Repeated Measurements of Cholesterol Levels: Clinical Implications for Risk
Misclassification and Intervention Monitoring. J. Clin. Epidemiol. 1993, 46,
1159–1171.
2. National Heart, Lung, and Blood Institute. Executive Summary of the Third National
Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation and
Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III). J. Am.
Med. Assoc. 2001, 285, 2486–2497.
3. Oppenheimer, L.; Kher, U. The Impact of Measurement Error on the Comparison of
Two Treatments Using a Responder Analysis. Stat. Med. 1999, 18, 2177–2188.
4. Fuller, W.A. Measurement Error Models; Wiley: New York, 1987.
5. Cordy, C.B.; Thomas, D.R. Deconvolution of a Distribution Function. J. Am. Stat.
Assoc. 1997, 92, 1459–1465.
6. Chen, C.; Fuller, W.A.; Breidt, J. A Semiparametric Estimator of the Density
Function of a Variable Measured with Error. Commun. Stat.—Theory Methods
2000, 29, 1293–1310.
7. Chen, C.; Fuller, W.A.; Breidt, J. Spline Estimator of the Density Function of a
Variable Measured with Error. Commun. Stat.—Simul. Comput. 2003, 32, in press.
8. Stefanski, L.A.; Carroll, R.J. Deconvoluting Kernel Density Estimators. Statistics
1990, 21, 249–259.
9. Fan, J. On the Optimal Rates of Convergence for Nonparametric Deconvolution
Problem. Ann. Stat. 1991, 19, 1257–1272.
10. Gordon, T.; Kannel, W.B. Introduction and General Background: The Framingham
Study: An Epidemiological Investigation of Cardiovascular Disease. Section 1; NIH
Publications, 1968, p. 188.
Received March 2002
Revised May 2002, June 2002
Accepted June 2002
CHEN, ZHOU, AND TIPPING192
©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016
Dow
nloa
ded
by [
Um
eå U
nive
rsity
Lib
rary
] at
10:
55 2
1 N
ovem
ber
2014