DO YOU KNOW YOUR TOTAL CHOLESTEROL (TC) NUMBER?

This article was downloaded by: [Umeå University Library]On: 21 November 2014, At: 10:55Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biopharmaceutical StatisticsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/lbps20

DO YOU KNOW YOUR TOTAL CHOLESTEROL (TC)NUMBER?Cong Chen a , Zhigang Zhou a & Robert W. Tipping aa Merck Research Laboratories , BL X-27, P.O. Box 4, West Point, PA, 19486, U.S.A.Published online: 05 Oct 2011.

To cite this article: Cong Chen , Zhigang Zhou & Robert W. Tipping (2002) DO YOU KNOW YOUR TOTAL CHOLESTEROL (TC)NUMBER?, Journal of Biopharmaceutical Statistics, 12:2, 179-192, DOI: 10.1081/BIP-120015742

To link to this article: http://dx.doi.org/10.1081/BIP-120015742

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/lbps20

http://www.tandfonline.com/action/showCitFormats?doi=10.1081/BIP-120015742

http://dx.doi.org/10.1081/BIP-120015742

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

DO YOU KNOW YOUR TOTAL CHOLESTEROL(TC) NUMBER?

Cong Chen,* Zhigang Zhou, and Robert W. Tipping

Merck Research Laboratories, BL X-27, P.O. Box 4,

West Point, PA 19486, USA

ABSTRACT

Total cholesterol (TC) measurements are subject to errors primarily because

of temporal variations in cholesterol levels within each individual. These

errors make it difficult to estimate the proportion of study subjects with true

(error-free) TC in a specific range, an extremely important parameter to policy

makers in health care management. To properly address this issue, it is key to

accurately estimate the distribution function of the true TC, which typically

deviates from the normal distribution. To better approximate the distribution

function of the true TC, we propose a constrained maximum likelihood

estimator based on a mixture-of-normals model. A simulation study illustrates

that the proposed estimator performs better than an estimator based on the

normality assumption that is frequently used in the literature to address the

same issue. Finally, the proposed estimator is applied to data from a study, and

its performance is once again compared with that of an estimator based on the

normality assumption.

Key Words: Constrained maximum likelihood estimation; EM algorithm;

Measurement error model; National Cholesterol Education Program;

Response analysis

179

DOI: 10.1081/BIP-120015742 1054-3406 (Print); 1520-5711 (Online)Copyright q 2002 by Marcel Dekker, Inc. www.dekker.com

*Corresponding author. E-mail: [email protected]

JOURNAL OF BIOPHARMACEUTICAL STATISTICS

Vol. 12, No. 2, pp. 179–192, 2002

©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016

Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

INTRODUCTION

Total cholesterol (TC) measurements are used in clinical studies to evaluate

the efficacy of certain treatments for hypercholesterolemia, or in epidemiologic

studies to assess the risk of a population for coronary heart disease. In these studies

it is often important to estimate the proportion of individuals with TC within a

specific range (for example, ,200 or 200–240). Decision makers frequently rely

on these proportion estimates to make recommendations that may have a great

impact on health care policies. An analysis of this type is sometimes referred to as

a “responder” or “threshold” analysis. One difficulty with this type of analysis

comes from measurement errors. It is well known that TC measurements are

subject to measurement errors because of temporal variations in cholesterol levels

within each individual. These errors could then lead to substantial misclassifi-

cations as noted in Ref. [1] To emphasize the impact of measurement errors on

cholesterol control, current guidelines set by an expert panel of the National

Cholesterol Education Program (NCEP) recommend the use of an average of at

least two measurements of TC obtained from lipoprotein analysis to estimate the

risk of coronary heart disease.[2]

In the bio-statistical literature that deals with the measurement error issue, it

is usually assumed that both the variable of interest (true TC in Ref. [1] or error-

free measurement of bone mineral density in Ref. [3]) and the measurement error

are normally distributed. However, while the measurement error is generally

considered to have a normal distribution, as in Ref. [4] the distribution of the

variable of interest could differ from case to case. When it is the probabilities

(proportions) that are of interest, mis-specification of the distribution function of

the variable of interest may lead to unacceptable bias.

Notice that if both of these variables are indeed normally distributed, the

underlined variable of the observed data, which is the sum of the variable of

interest and the measurement error, would also have a normal distribution. To

check for the normality of the variable of interest, it is therefore helpful to check

for normality on the observed data. From our experience with population-based

studies, the observed data on major cholesterol measurements, such as TC, low

density lipoprotein cholesterol (LDL), or high density cholesterol (HDL), usually

have an empirical distribution that deviates from normality. Therefore, to properly

address the measurement error issue, alternative distribution functions other than

the normal distribution must be considered to better approximate the distribution

of true TC.

To facilitate the discussion, we will maintain the conventional normality

assumption on the measurement error throughout this article. Notice that, in

general, when a variable of interest is measured with large errors from a normal

distribution, the empirical distribution of the observed data could be very different

from the truth. For example, the empirical distribution of the observed data tends

to show less skewness and/or fewer modes than that of the error-free variable. The

reason is that because of the normally distributed errors, the distribution of

CHEN, ZHOU, AND TIPPING180



Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

the observed data has essentially incorporated the features of a normal

distribution. As a result, it is very difficult to correctly specify the distribution

of the variable of interest from the observed data.

In nature, the underlined statistical problem is in the framework of

“deconvolution,” which deals with the estimation of the distribution function

of a variable measured with error. The term deconvolution is derived from the

fact that the distribution of the underlined variable of the observed data is a

convolution of the distribution of the variable of interest and that of the

measurement error.

Generally speaking, there are three different forms of estimators for

determining distribution functions: parametric, semiparametric, and nonpara-

metric. Parametric estimation assumes a specific parametric form of the

distribution function (such as normal distributions in Ref. [1] or in Ref. [3]).

Semiparametric estimation assumes that the distribution function is in a class of

certain functionals (such as mixture-of-normals in Ref. [5] and spline functions in

Refs. [6,7]). Nonparametric estimation does not assume any particular functional

form (such as kernel estimators in Ref. [8]).

Using a kernel estimator, Fan shows that the deconvolution problem is one of

the most challenging problems in statistics with an extremely low convergence

rate because of the substantial variance of the estimator.[9] In order to have a

precise estimate, the sample size has to be very large. While nonparametric

estimators are important in the investigation of the theoretical issues of the

deconvolution problem, they are difficult to interpret and could face strong

resistance in the medical community. Besides, from an extensive simulation study

in Ref. [7] it appears that their performance is inferior to that of semiparametric

estimators. From a practitioner’s point of view, when a specific parametric form

requires a very strong assumption and is hard to define, semiparametric estimators

should be considered as viable alternatives because of their flexibility. The spline

function estimator described in Refs. [6,7] performs very well but is difficult to

interpret and is computationally unfriendly. The mixture-of-normals estimator

may be viewed as a direct extension of the normal distribution and is arguably the

most attractive among semiparametric estimators.

In this article we propose a mixture-of-normals estimator for practical use.

This proposed estimator extends and improves upon the mixture-of-normals

estimator described by Cordy and Thomas.[5] The second section of this article

provides the motivation for using this proposed estimator and includes definitions

and notations. “Parameter Estimation” provides the details of the methodology for

using the proposed estimator and the simulation results that demonstrate its

performance. “Application to Real Data” applies the proposed methodology to

data from an actual study. The data example is selected because it carries typical

features of cholesterol measurements, and is presented to illustrate the

performance of the proposed estimator. It is not used to draw any particular

medical conclusions (such as misclassification rates) since the study was not

designed for that purpose. The last section concludes with discussions.

TOTAL CHOLESTEROL NUMBER 181



Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

MOTIVATION

In a recent study of label testing to evaluate potential users’ ability to

properly self-select a low-dose statin as a nonprescription adjunct to a healthy diet

and exercise for the treatment of moderate hypercholesterolemia, participants in

the study were administered a questionnaire that asked whether they knew their

specific TC number (“Do You Know Your Total Cholesterol (TC) Number?”). To

determine their eligibility for the test drug, it would be ideal to have their average

TC levels over a short period, such as a week. However, for the sake of label

testing, the participants were only asked to provide their most recent TC levels. If

they did not know the specific number, they were then asked to select one of the

following four possibilities: “ , 200,” “ . 240,” “200–240,” and “don’t know.”

A total of 2416 participants (males: 72%; Caucasians: 80%; average age: 56)

were screened and were administered the questionnaire. Among them, 2094

participants either provided a specific TC number or a TC range, and were

included in our analysis. The rest gave a final answer of “don’t know” and were

excluded. Let Yi represent the reported TC level of the ith individual where

i ¼ 1; . . . ; nð¼ 2094Þ: Notice that Yi could either be a specific number or be in a

range. If it is in a range, we write it as Yi [ Vi where Vi takes one of the following

intervals: “ , 200,” “200–240,” or “ . 240.” Let Xi be the true or error-free TC

for Yi and let ei be the corresponding additive measurement error (i.e.,

Yi ¼ Xi þ ei) independent of Xi where i ¼ 1; . . . ; n: For the purpose of the study, X

might be considered as the average TC over a week, and e might be considered as

the weekly intraindividual variability.

Throughout this article, we assume that participants who reported a specific

TC number and participants who reported a TC range have the same distribution

function in the true TC levels [i.e., Xi ði ¼ 1; . . . ; nÞ are identically distributed]. As

standard in the literature for measurement error problems, e is assumed to have a

normal density with a zero mean and a variance of s 2e ; i.e., Nð0;s 2

e Þ: Certainly, s 2e

cannot be determined from this study since we only have one observation for each

participant. However, the range of s 2e can be reasonably derived from similar

studies. To facilitate the discussion, we assume that s 2e is known. The density

function of a variable from N(·,·) is denoted by f(·,·), and the corresponding

cumulative distribution function (CDF) is denoted by F(·,·). Let fx and fy be the

corresponding probability density functions (PDF) of X and Y, respectively.

Figure 1 presents the normal Quantile–Quantile plot of the specific TC

values from the study. Clearly, the plot shows that the TC level of the study

population does not have a normal distribution. Instead it is skewed, implying

skewness of the distribution ( fx) of the true (error-free) TC level. In addition to the

skewness, the figure is unclear in depicting other features of fx. We use a mixture-

of-normals density function to approximate fx. A PDF in the form of a mixture-of-

normals can, in theory, approximate a density function within any desired error

range. Assume that fx is a p-component mixture-of-normals whose jth component

ð j ¼ 1; . . . ; pÞ is fðmj;s 2Þ with a coefficient of aj ðaj $ 0 andP

jaj ¼ 1Þ:




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

Without loss of generality, we assume that mj are in descending order such that

mp . mp21 . · · · . m1. The coefficient aj can be interpreted to be the probability

that X is from fðmj;s 2Þ, j ¼ 1; . . . ; p: Under this assumption,

f yð yÞ ¼ f x*fð0;s 2e Þð yÞ ¼

Xp

j¼ 1

ajfðmj;s 2Þ*fð0;s 2e Þð yÞ

¼Xp

j¼ 1

ajfðmj;s 2þs 2e Þð yÞ ð1Þ

where “*” denotes convolution. Notice that fy is a mixture-of-normals with means,

mjð j ¼ 1; . . . ; pÞ; and common variance, s 2 þ s 2e : Similarly, as in the case of X,

Eq. (1) implies that Y is generated from Nðmj;s2 þ s 2

e Þ with a probability aj,

j ¼ 1; . . . ; p: When Yi [ Vi; fy(Yi) is defined to beRVi

f yðyÞdy; the probability that

Yi falls into Vi.

PARAMETER ESTIMATION

Constrained Maximum Likelihood Estimation by EM Algorithm

The log-likelihood based on the observed data, Y ¼ ðY1; . . . ;YnÞT; is given

by

Lða;s2jYÞ ¼Xn

i¼ 1

log{ f yðYiÞ}

Figure 1. Normal-QQ plot of specific TC levels. It shows clear deviation from a normal

distribution, typical of cholesterol measurements.




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

where fy is defined in Eq. (1). Certainly, it is difficult to directly maximize

Lða;s2jYÞ to obtain the maximum likelihood estimator (MLE) of fx. However, the

fact that the coefficients have a probabilistic interpretation leads to a simple EM

algorithm for the maximum likelihood estimation. In this subsection, we discuss

the constrained maximum likelihood estimation of a ¼ ða1; . . . ;apÞT and s 2 for

fixed p and mjð j ¼ 1; . . . ; p Þ: In the next subsection we provide guidelines on the

choice of mjð j ¼ 1; . . . ; p Þ and p.

Cordy and Thomas proposed a mixture-of-normals estimator with equally

spaced means and a common variance to approximate the distribution function of

a variable measured with error.[5] The size of the common variance (s 2) and the

choice of p were suggested from a simulation study. One major difference between

our proposed methodology and the approach described by Cordy and Thomas is

that we determine the common variance directly from the data, and make the

choice of p less an issue. Specifically, for a fixed p and mjð j ¼ 1; . . . ; pÞ; we

estimate the parameters, a and s 2, by maximizing the log-likelihood function

ðLða;s2jYÞÞ subject to the constraints on the first two moments:

E{Y} ¼ E{X} ¼Xp

j¼ 1

ajmj and

E{Y 2} ¼ E{X 2} þ s 2e ¼

Xp

j¼ 1

ajm2j þ s2 þ s 2

e

ð2Þ

with E{Y} and E(Y 2) replaced with the corresponding first two empirical moments

of the specific TC levels, E{Y} and E{Y 2}: Notice that under these two constraints

and the natural constraint on coefficients ðP

jaj ¼ 1Þ; the effective number of

parameters to estimate is ( p 2 2). Updated estimates of a and s 2 in an EM

algorithm are obtained through the steps described next. In the following

equations, Kj denotes the number of participants with true TC levels from the jth

normal component of the mixture-of-normals in Eq. (1).

E-Step: Calculate the conditional expectations to estimate Kj ð j ¼ 1; . . . ; pÞ

by

Kj ¼ E{KjjY;aold and s2;old} ¼

Xn

i¼ 1

aoldj fðmj;s 2;oldþs 2

e ÞðYiÞP

laoldl fðml;s 2;oldþs 2

e ÞðYiÞ

!:

In this equation, a old is a prior estimate of a, and s 2,old is a prior estimate of s 2.

In addition, it is understood that when Yi [ Vi; fðml;s 2;oldþs 2e ÞðYiÞ ¼R

Vifðml;s 2;oldþs 2

e Þð yÞdy: Regardless of whether Yi is a specific value or an interval

data, the summand on the right-hand side of the above equation represents the

conditional probability that Yi is from the jth component. Therefore, the EM

algorithm is naturally suitable for interval data.

M-Step: Calculate the maximum likelihood estimate of a, which is referred

to as a new, by maximizing the log-likelihood function of the multinomial




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

distribution,P

jKj logðanewj Þ; subject to

Pja

newj ¼ 1 and E{Y} ¼

P pj¼ 1a

newj mj:

For j ¼ 1; . . . ; p; the MLE of the mixture of normal coefficients is given as

anewj ¼

Kj

n 2 l½E{Y} 2 mj�

where l, serving as a Lagrange multiplier, is the unique solution of the following

equation calculated by the standard Newton–Raphson method:

hðlÞ ¼Xp

j¼1

Kj½E{Y} 2 mj�

n 2 l½E{Y} 2 mj�¼ 0:

One way to see the uniqueness of this solution is by noticing that the first

derivative of h(l ) with respect to l is strictly greater than zero. Once a new is

obtained, the maximum likelihood estimate of s 2, referred to as s 2,new, is

calculated from the constraint on the second moment in Eq. (2). That is, s2;new ¼

E{Y 2} 2 s 2e 2

Ppj¼1 a

newj m 2

j :The EM algorithm is very fast and usually converges in less than 20

iterations. Once the EM algorithm converges and the estimates, such as a and s 2,

are obtained, the corresponding estimate of fx is

fxðxÞ ¼Xp

j¼1

ajfðmj;s 2ÞðxÞ: ð3Þ

Choice of p and mj ð j 5 1; . . . ; pÞ

Ideally, we need p to be as large as possible to approximate fx accurately.

However, the variance of fx increases with p. The introduction of the constraint on

the first two moments is partly intended to downplay the effect of p. It is

anticipated that when assisted with the two constraints, the variance of fx does not

increase drastically with p. As a result, a slight mis-specification of p does not

greatly affect the performance of fx. Given the difficulty in specifying p, the

introduction of the two constraints in Eq. (2) plays a remarkable role in the

estimation. In practice, for data of moderate sample size (say, 200–4000) p may be

taken from 5 to 15.

The location of mjð j ¼ 1; . . . ; pÞ plays a less significant role than the

choice of p because generally, the normal components are highly correlated

with each other. Misplacing a normal mean, as long as not too unreasonably,

would not greatly affect the overall performance of the estimator.

Nevertheless, we try to optimize the selection procedure and locate the

means at the average of those from an equal-probability scheme and an equal-

space scheme. The same mixing strategy was used in the location of join

points of spline functions in Refs. [6,7]




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

In both the equal-probability scheme and the equal-space scheme, m1 is

always fixed at the [0.05n ]-th smallest specific TC and mp is always fixed at

the [0.05n ]-th largest specific TC where [0.05n ] is the closest integer to 0.05n. As

for the rest of means in the equal-probability scheme, mj,ep ð j ¼ 2; . . . ; p 2 1Þ;there is approximately the same number of specific TC numbers in the interval

between two consecutive means. The rest of means in the equal-space scheme,

mj,es ð j ¼ 2; . . . ; p 2 1Þ; are equally spaced between the first and the last. The

intermediate means used in the model are then calculated as mj ¼ 0:5ðmj;ep þ

mj;esÞ; j ¼ 2; . . . ; p 2 1: Under the mixing strategy, mj ð j ¼ 1; . . . ; pÞ tend to be

more effectively located, and the resulting estimates of fx tend to be more stable.

Simulation Study

Data used in our simulation study, {Yi; i ¼ 1; . . . ; n}; take specific values,

and are generated as a sum of true data and measurement error. The true (or error-

free) data, {Xi; i ¼ 1; . . . ; n}; are generated from a predetermined distribution

function and the measurement errors, {ei; i ¼ 1; . . . ; n}; are generated from

Nð0;s 2e Þ with s 2

e being treated as known. In our simulation study, n is held fixed at

2000 to be comparable with that in the previously mentioned study. Two right

skewed density functions, one unimodal and the other bimodal, are used to

generate X. They are 0:2Nð0; 1Þ þ 0:2Nð23=5; 4=9Þ þ 0:6Nð23=2; 25=81Þ and

0:7Nð0; 1Þ þ 0:3Nð23=2; 2=9Þ; respectively.

An important measure of contamination of the data is the ratio (denoted by r )

of the variance of the measurement error to the variance of X. Two different

values, 0.2 and 0.4, were considered for r. The selected values were in a range

typical for TC measurements.[1] The number of normal components ( p )

considered for the proposed estimator ranged from 5 to 11. As expected, the

performance of the proposed estimator varied little with p. We only report the

outcomes based on p ¼ 6 and 10 which result in 4 and 8 effective parameters,

respectively.

Two CDF estimators are used for comparison. The first one is Fx, the

corresponding CDF estimator to fx in Eq. (3), and the second one is ~Fx or Nð �Y; s 2y 2

s 2e Þ where Y is the sample mean of Y and s 2

y is the sample variance of Y. The first

one (Fx) is the proposed estimator, and the second one (Fx) is essentially the same

estimator that is based on the normality assumption as discussed in Refs. [1,3].

Notice that Fx and Fx have the same mean and variance. We compare the bias of

the two CDF estimators at selected percentiles of the underlined distribution

function of X. Denote by ta the 100a-percentile of the distribution function of X,

the bias of an estimator, such as Fx, at ta is ð ~FxðtaÞ2 aÞ:Table 1 reports the average bias and standard deviation in percentage points

(i.e., multiplied by 100) of the two estimators from 1000 replicates. Clearly, the

proposed estimator, either with p ¼ 6 or 10; has much less bias and slightly

greater standard deviation than the estimator based on the normality assumption




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

Ta

ble

1.

Em

pir

ical

Bia

s(S

tan

dar

dD

evia

tio

n)

inP

erce

nta

ge

Po

ints

Bas

edo

n1

00

0R

epli

cate

sat

10

0a

-Per

cen

tile

so

fth

eU

nd

erli

ned

Dis

trib

uti

on

of

X

a

Est

imat

or

r0

.97

50

.95

0.9

00

.75

0.5

00

.25

0.1

00

.05

0.0

25

Dis

trib

uti

on

of

X:

skew

edu

nim

od

al

Fx

0.2

21

.6(0

.2)

22

.2(0

.4)

21

.6(0

.7)

4.0

(1.1

)6

.2(1

.0)

0.7

(0.8

)2

3.0

(0.6

)2

3.5

(0.5

)2

3.2

(0.4

)

Fx;

p¼

60

.22

0.1

(0.4

)0

.5(0

.6)

0.3

(0.7

)2

0.1

(1.1

)1

.3(1

.3)

0.2

(0.9

)2

0.8

(0.7

)2

0.8

(0.6

)2

0.6

(0.5

)

Fx;

p¼

10

0.2

20

.1(0

.4)

0.5

(0.5

)0

.3(0

.7)

20

.2(1

.0)

1.4

(1.2

)0

.3(1

.0)

20

.7(0

.7)

20

.8(0

.6)

20

.7(0

.4)

Fx

0.4

21

.6(0

.2)

22

.2(0

.4)

21

.6(0

.7)

4.0

(1.2

)6

.2(1

.1)

0.7

(0.9

)2

3.0

(0.7

)2

3.5

(0.6

)2

3.2

(0.5

)

Fx;

p¼

60

.40

.1(0

.5)

0.8

(0.7

)0

.2(0

.7)

20

.3(1

.2)

1.2

(1.4

)0

.4(1

.2)

20

.6(1

.0)

20

.7(0

.8)

20

.6(0

.6)

Fx;

p¼

10

0.4

0.0

(0.5

)0

.8(0

.7)

0.3

(0.8

)2

0.3

(1.2

)1

.3(1

.4)

0.3

(1.2

)2

0.7

(0.9

)2

0.8

(0.7

)2

0.7

(0.5

)

Dis

trib

uti

on

of

X:

skew

edb

imo

dal

Fx

0.2

20

.4(0

.2)

20

.8(0

.4)

21

.4(0

.6)

21

.9(0

.9)

3.2

(1.0

)4

.0(0

.7)

21

.0(0

.5)

22

.5(0

.4)

22

.7(0

.4)

Fx;

p¼

60

.20

.1(0

.4)

0.3

(0.6

)0

.0(0

.7)

0.1

(1.1

)2

0.4

(1.2

)0

.9(1

.1)

0.1

(0.8

)2

0.4

(0.6

)2

0.5

(0.4

)

Fx;

p¼

10

0.2

0.0

(0.5

)0

.4(0

.6)

0.3

(0.7

)2

0.1

(1.2

)2

0.2

(1.3

)0

.7(1

.2)

20

.4(0

.8)

20

.6(0

.7)

20

.5(0

.5)

Fx

0.4

20

.3(0

.3)

20

.7(0

.4)

21

.3(0

.6)

21

.8(1

.0)

3.3

(1.1

)4

.1(0

.9)

21

.0(0

.6)

22

.4(0

.5)

22

.7(0

.4)

Fx;

p¼

60

.40

.6(0

.5)

0.5

(0.6

)2

0.2

(0.8

)2

0.3

(1.3

)2

0.4

(1.4

)1

.1(1

.3)

0.3

(0.9

)2

0.4

(0.7

)2

0.8

(0.6

)

Fx;

p¼

10

0.4

0.3

(0.7

)0

.7(0

.6)

0.6

(0.7

)2

0.5

(1.1

)2

0.3

(1.4

)1

.3(1

.2)

20

.5(1

.0)

20

.7(0

.8)

20

.9(0

.6)




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

(Fx). Overall, the proposed estimator is superior to the estimator based on

the normality assumption. As predicted, the performance of the proposed

estimator (Fx) varies little with p.

APPLICATION TO REAL DATA

We now use the data from the previously mentioned study to demonstrate the

magnitude of the difference in key estimates between the mixture-of-normal

approach and the conventional normality approach. Of the 2094 participants

included in the analysis, 1672 (79.8%) provided specific TC levels with a mean of

234 mg/dL and a variance of 952 mg/dL2. The remaining 422 participants

provided a TC range, with 11 (3%) having a TC level below 200 mg/dL, 317

(75%) having a TC level between 200 and 240 mg/dL, and 94 (22%) having a TC

level greater than 240 mg/dL. The normality-based density estimator (fx) for the

data would be Nð234; 952 2 s 2e Þ:

The true TC level that we are interested in is each participant’s average TC

level during the course of a week. It has been estimated from a reliability sub-study

of the well-known Framingham Heart Study that TC measurements taken one

week apart have an intra-individual variance of 238 mg/dL2.[1,10] Clearly, if the

observed TC level is the average of two independent measurements taken within a

week, the intra-individual variance would be 119 mg/dL2.

In our study, the TC levels reported by some of the participants might

actually be based on more than one measurement. We consider both 119 and 238

as the possible values of s 2e : Using the sample variance of the participants’ specific

TC values, the ratio of the variance of the measurement error to that of the true TC

is 33.3% when s 2e ¼ 238 and is 14.3% when s 2

e ¼ 119: It is understood that the

true value of s 2e for the data may not be exactly 119 or 238, but may actually be

somewhere between these two values. Nevertheless, we use these values to

illustrate the performance of the proposed estimator and highlight its difference

from the estimator based on the normality assumption.

The number of normal components ( p ) we have tried ranges from 6 to 10.

Because of the large sample size in this study, the density estimator, fx, changes

little with p, as is demonstrated in the simulation described in the preceding

section. The results presented here are based on an estimator of eight normal

components with the following means: 197.0, 210.9, 221.2, 232.1, 242.0, 250.3,

263.1, and 286. The estimated common variance is s2 ¼ 342:0 when s 2e ¼ 238;

and s2 ¼ 384:9 when s 2e ¼ 119: The estimated coefficients are a ¼

ð0:057; 0:000; 0:099; 0:742; 0:001; 0:000; 0:000; 0:102ÞT when s 2e ¼ 238;and

a ¼ ð0:072; 0:001; 0:223; 0:501; 0:043; 0:020; 0:041; 0:099ÞT when s 2e ¼ 119:

For each s 2e ; the corresponding density estimator fx shows moderate skewness that

was also found in the specific TC levels.

Figure 2 presents the difference of the estimated CDFs ðFx 2 ~FxÞ in

percentage points by s 2e : The difference is higher if there is more measurement




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

error, and takes its largest values approximately around 250. We do not know

the true distribution for this data, but judging from the simulation results in the

preceding section, it is reasonable to suggest that the difference was largely due to

the bias of the normality estimator. From the figure it is clear that the normality

approach would very likely over-estimate the proportion of patients with

TC # 200 mg/dL and under-estimate the proportion of patients with TC # 240

mg/dL, each by a couple of percentage points. Given the importance of the two

cut-off points (200 mg/dL and 240/dL) in cholesterol control, this result shows the

normality approach would be inappropriate.

We further compare fx and fx in the evaluation of two misclassification rates;

the false negative rate (FN) and the false positive rate (FP). A false negative

occurs when an individual is classified as having TC # y with y # 240 when

actually the true TC is .240 (“high risk” according to an expert panel of the

NCEP[2]). A false positive occurs when an individual is classified as having

TC $ y with y . 200 when actually the true TC is ,200 (“desirable” according

to an expert panel of the NCEP[2]). The two rates are formally defined in the

following expressions:

FNðyÞ ¼ PrðY # yjX . 240Þ and FPðyÞ ¼ PrðY . yjX , 200Þ:

Figure 2. Difference in percentage points between the two estimated CDFs ðFxðxÞ2 ~FxðxÞÞ: The

left figure is based on s 2e ¼ 238 and the right figure is based on s 2

e ¼ 119:




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

A serious false negative occurs at y ¼ 200; and a serious false positive occurs at

y ¼ 240:Under the mixture-of-normals assumption, the joint density function of (X,Y )

is estimated to be

fðX;YÞðx; yÞ ¼Xp

j¼ 1

aj fðx j;y jÞðx; yÞ;

where (X j,Y j) has a bivariate normal distribution with mean vector (mj,mj), and a

common variance–covariance matrix with the element corresponding to the

variance of Y j being ðs2 þ s 2e Þ and the rest being s2: The FN and the FP using the

mixture-of-normals estimator are given as

FNfxðyÞ ¼

Ppj¼ 1 aj

R y

0

R1240

fðX j;Y jÞðx; yÞdxdyPpj¼ 1 aj

R1240

fX jðxÞdxand

FPfxðyÞ ¼

Ppj¼ 1 aj

R1y

R 200

0fðX j;Y jÞðx; yÞdxdyPp

j¼ 1 aj

R 200

0fX jðxÞdx

;

respectively.

Under the normality assumption the FN and the FP are written as follows:

FN ~fxð yÞ ¼

R y

0

R1240

~fðX;YÞðx; yÞdx dyR1240

~fXðxÞdxand FP~fx

ðyÞ ¼

R1y

R 200

0~fðX;YÞðx; yÞdx dyR 200

0~fXðxÞdx

:

In these last two equations, (X,Y ) has a bivariate normal distribution with mean

vector (Y,Y ), and a variance–covariance matrix with the element corresponding to

the variance of Y being s 2y and the rest being ðs 2

y 2 s 2e Þ:

Clearly, from the definition of the misclassification rates, the better the

estimator of the density function of X is the closer the estimated misclassification

rates are to the truth. The simulation results in the preceding section imply that the

estimates of the misclassification rates from the mixture-of-normals approach

should be much closer to the truth as compared to those from the normality

approach.

Figure 3 presents the difference in estimated misclassification rates between

the two estimators (FNfx2 FN ~fx

and FPfx2 FP~fx

) for 200 # y # 240: Clearly, the

normality approach under-reports misclassification rates as compared to the

mixture-of-normals approach. Therefore, it is very likely that the normality

approach would also under-report the true misclassification rates and would be

inadequate for the evaluation purpose.




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

CONCLUSIONS

A mixture-of-normals estimator is proposed for estimating the distribution

function of a variable measured with errors. The constrained MLEs of the

parameters are obtained by an EM algorithm, which naturally takes care of interval

data. With the two constraints imposed on the two moments, the proposed estimator

directly estimates the common variance of the normal components. More

importantly, the added constraints help strengthen the robustness of the mixture-of-

normals estimator, making it less dependent upon the number of normal

components. Guidelines on the choice of normal means are provided, which make

the estimator particularly appealing to practitioners. In a simulation study, the

proposed estimator performed better than an estimator based on the normality

assumption, which was frequently used in the literature to address similar issues.

The proposed method was applied to the analysis of TC data based on a

population potentially interested in the treatment of moderate hypercholester-

olemia. The difference in key estimates between the proposed approach and the

normality approach could be as large as a couple of percentage points. Judging

from the simulation results, it is reasonable to suggest that the normality approach

would be very likely inadequate for the analysis of such data. Encouraged by

Figure 3. Difference in percentage points between the two estimated FNs ðFNfxðyÞ2 FN ~fx

ðyÞÞ and

between the two false FPs ðFPfxðyÞ2 FP~fx

ðyÞÞ: The left figure is based on s 2e ¼ 238; and the right

figure is based on s 2e ¼ 119:




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

the improvement of the mixture-of-normals approach over the conventional

normality approach, research work is underway on extending it to the study of the

joint distribution of two variables (e.g., LDL and HDL, which play a critical role in

the new guidelines of NCEP[2]). A mixture-of-bivariate normals with a common

correlation coefficient has been considered for the approximation of the joint

distribution. Preliminary results show that a constrained MLE with constraints

posed on the variance–covariance matrix has very good performance.

ACKNOWLEDGMENTS

We are thankful to our colleague Dr. John R. Cook for his insightful comments that

have led to a much improved presentation. We also appreciate two anonymous referees for

their helpful suggestions.

REFERENCES

1. Roeback, J.R.; Cook, J.R.; Guess, H.A.; Heyse, J.F. Time-Dependent Variability in

Repeated Measurements of Cholesterol Levels: Clinical Implications for Risk

Misclassification and Intervention Monitoring. J. Clin. Epidemiol. 1993, 46,

1159–1171.

2. National Heart, Lung, and Blood Institute. Executive Summary of the Third National

Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation and

Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III). J. Am.

Med. Assoc. 2001, 285, 2486–2497.

3. Oppenheimer, L.; Kher, U. The Impact of Measurement Error on the Comparison of

Two Treatments Using a Responder Analysis. Stat. Med. 1999, 18, 2177–2188.

4. Fuller, W.A. Measurement Error Models; Wiley: New York, 1987.

5. Cordy, C.B.; Thomas, D.R. Deconvolution of a Distribution Function. J. Am. Stat.

Assoc. 1997, 92, 1459–1465.

6. Chen, C.; Fuller, W.A.; Breidt, J. A Semiparametric Estimator of the Density

Function of a Variable Measured with Error. Commun. Stat.—Theory Methods

2000, 29, 1293–1310.

7. Chen, C.; Fuller, W.A.; Breidt, J. Spline Estimator of the Density Function of a

Variable Measured with Error. Commun. Stat.—Simul. Comput. 2003, 32, in press.

8. Stefanski, L.A.; Carroll, R.J. Deconvoluting Kernel Density Estimators. Statistics

1990, 21, 249–259.

9. Fan, J. On the Optimal Rates of Convergence for Nonparametric Deconvolution

Problem. Ann. Stat. 1991, 19, 1257–1272.

10. Gordon, T.; Kannel, W.B. Introduction and General Background: The Framingham

Study: An Epidemiological Investigation of Cardiovascular Disease. Section 1; NIH

Publications, 1968, p. 188.

Received March 2002

Revised May 2002, June 2002

Accepted June 2002




Dow

nloa

ded

by [

Um

eå U

nive

rsity

Lib

rary

] at

10:

55 2

1 N

ovem

ber

2014

Documents

DO YOU KNOW YOUR TOTAL CHOLESTEROL (TC) NUMBER?