Approximate Local D-optimal Experimental Design for Binary ... dms/GLM_Design/Technical Report...¢ 

  • View
    0

  • Download
    0

Embed Size (px)

Text of Approximate Local D-optimal Experimental Design for Binary ... dms/GLM_Design/Technical...

  • Approximate Local D-optimal Experimental Design for Binary Response

    Technical Report RP-SOR-0501

    Hovav A. Dror and David M. Steinberg

    Department of Statistics and Operations Research

    Raymond and Beverly Sackler Faculty of Exact Sciences

    Tel Aviv University

    Ramat Aviv 69978

    Israel

    Email: dms@post.tau.ac.il

    November 2005

    Abstract: A fast and simple method is proposed that produces approximate multivariate local D-optimal

    designs of high e¢ ciency for models with binary response. The method assumes availability of a D-optimal

    design for a parallel normal response linear problem that has the same linear predictor, with an assumption

    of homogenous variance; the change required to transform the standard design into an e¢ cient one for a

    multivariate logit or probit model is to shift any design point whose probability is very low or very high (and

    is therefore non-informative) into the nearest feasible point of moderate probability.

    KEY WORDS: Generalized Linear Models ; Logistic Response; Logit; Probit; Design of Experiments

    1. INTRODUCTION

    Construction of an experimental design for a generalized linear model presents a level of complexity greater than

    that required for a model with a normally distributed error term of constant variance. Finding an optimal design

    for a linear model is already a numerically intensive optimization problem. However, some linear models can be

    characterized and have a known trivial optimal design, and designs for more complicated models can be sought

    through available software such as "gosset" (Hardin and Sloane 1993), the statistical toolbox in MATLAB (The

    Mathworks, inc), JMP or the SAS Optex procedure. Extension from a linear model to a logistic one does not

    retain these qualities. Trivial solutions to linear model design problems are often factorial designs, utilizing the

    corners of the design region; for a logisitic response, some of these corners have probabilities that are close to

    zero or one, so that the responses are almost deterministic, and are therefore non-informative. Furthermore,

    optimal design for binary response depends on the unknown coe¢ cients, and hence two experiments having

    the same model but di¤erent coe¢ cient values will typically have di¤erent optimal designs. Common software

    packages assume all observations have homogenous variance and so do not provide a remedy for generalized

    linear models.

    1

  • This paper proposes a simple method for constructing an approximate local D-optimal design for a multi-

    variate binary response problem with the logit or probit link.

    2. EXPERIMENTAL DESIGN FOR GENERALIZED LINEAR MODELS

    Most work on Experimental Design focuses on linear models with a continuous response. A common assumption

    is that the error term of the model is normally distributed and that its variance is constant over the design

    region. These assumptions are not met (even asymptotically) with the popular logistic model or with the probit

    model for binary response data. Standard procedure for the analysis of these models uses iteratively reweighted

    least squares, which asymptotically leads to weights that enable solving the problem as a simple linear model

    (McCullagh and Nelder 1989). These weights depend on the parameters of the problem and are generally

    di¤erent for two binary response problems of the same structure but di¤erent true (unknown) coe¢ cients.

    Frequently, Experimental Design uses optimality criteria based on Fishers Information Matrix. Of these,

    D-optimality is the most intensively studied. For the normal response setting with y = � + " and � = F�;

    D-optimality maximizes the determinant of the information matrix FTF and so minimizes the volume of the

    condence ellipsoid for the unknown coe¢ cients �. Atkinson & Donev (1992) note that the assumptions of

    normality and constancy of variance enter the optimal design through the information matrix. They indicate

    that instead of FTF; other matrices are appropriate for non-normal distributions, and that once the appropriate

    matrix has been dened the principles and practice of optimal experimental design are similar to continuous

    response problems. This means that the optimal experimental design is driven from the information matrix

    of the parameters FTWF ; the matrix of weights, W , depends on the unknown parameters and on the design.

    This dependence creates a di¢ culty in constructing an optimal experimental design.

    2.1 Local D-Optimal Designs

    Because a D-optimal design depends on the coe¢ cients, its derivation require us to make some assumptions

    on the values of the unknowns we want to estimate. When we use an initial estimate (rather than a Bayesian

    a-priori distribution, for instance) we produce a design that is optimal only locally, for the coe¢ cient values

    that were adopted. If the initial estimate is poor, the designs performance may be far from optimal. D-optimal

    designs that are based on an initial estimate for the unknown coe¢ cient values are designated as local D-optimal

    designs.

    2.2 Local D-Optimal Designs for a logistic model

    Abdelbasit & Plackett (1983) discuss the construction of local D-optimal designs for binary response with one

    explanatory variable. They show that for an unconstrained univariate problem with the model logit(p) =

    �0 + �1x the local D-optimal design places half the points where the estimated probability is 0.824 and the

    other half at the location with probability 0.176. Univariate binary response has been further considered by

    many authors, see for example Atkinson & Donev (1992), Sitter (1992), Hedayat, Yan & Pezzuto (1997), or

    Mathew & Sinha (2001).

    2

  • Although the nal coordinates creating the optimal design are found numerically, the function to maximize

    is derived through analytical procedures. Increasing the number of unknowns complicates both the analytical

    route and the numerical process. This is probably the reason not much has been published on multivariate

    design problems. Chipman and Welch (1996) compare D-optimal designs for Generalized Linear Models over

    a constrained region to linear regression D-optimal designs, including multifactor problems. Their comparison

    was based on computer generated D-optimal designs, using weighted linear models. This way they show the

    general phenomenon that points in the linear model optimal design may "move in" from the edges of the design

    region for the logistic model. Sitter and Torsney (1995) analyze D- and c- optimal designs for binary response

    experiments with two design variables and various link functions; Smith and Ridout (2003) obtain optimal

    Bayesian designs for bioassays involving two parallel dose response relationships where the main interest is in

    doses of one substance; Atkinson, Demetrio & Zocchi (1995) describe a dose response experiment with two

    variables - one continuous and one indicator variable, considering the special case in which the value of the

    indicator variable is unknown during the process of the experiment, but is available for the posterior analysis.

    Atkinson (2005) gives examples of rst order D-optimal designs with two variables and discusses usage of

    standard factorial design as an approximation, similar to the method proposed here; but - assuming one needs

    to search for the approximate design over irregular design regions Atkinson concludes that "Searching over the

    original design space to nd optimal design for the generalized linear model would both be easier and lead to a

    more e¢ cient design than would trying to nd such a regression approximation".

    It is noted that the treatment described so far was limited to a rst order model with two variables. Woods,

    Lewis, Eccleston and Russel (2005) o¤er a method of creating multivariate compromise designs that are robust

    to uncertainty in aspects of the model, including the uncertainty in the coe¢ cient values, di¤erent choices of

    a link function and various models, including interactions. As can be expected for such complex designs, their

    method requires intensive computation.

    We will now suggest a simple method of constructing approximate local D-optimal designs. This method is

    not limited to a small number of covariates, or to rst-order designs without interactions. It can produce ap-

    proximate designs almost instantaneously, as it does not require intensive computation, and it is straightforward

    to implement. There are two cornerstones for this algorithm; rst, one needs an optimal design for an analogous

    linear model (that is, the optimal design for a problem with the same formulation and a constant variance).

    Second, we use computation of the probability at design points according to the specic initial coe¢ cient values.

    3. REGION OF LINEARITY

    We are considering a binary response with the logistic link. That is pi = e Fi�

    1+eFi� ; Fi being the i-th row of the

    regression matrix F: We attempt to nd an approximation for a local optimal design. As explained, the term

    "local" implies that the design relates to particular values of the coe¢ cients.

    Note that while both � and feasible points for the design space may be of any di