Summer School on Statistical Experimental Designareaestadistica.uclm.es/s3ed/Documents/GOOS.pdfOptimal design of experiments Session 7: Nonlinear models Peter Goos 1 / 25 Binary data

Summer School on Statistical

Experimental Design

Almagro, Spain 7th – 10th June 2011

Lecture: Peter Goos

Copyright © 2008, SAS Institute Inc. All rights res erved.

Design of Industrial Experiments

Peter Goos

Bradley Jones

Who am I ?

� Professor in Statistics• Faculty of Applied Economics, University of

Antwerp

• Erasmus School of Economics, Erasmus University Rotterdam

� early work on optimal design of industrial experiments

� side steps to marketing, health economics, transportation, . . .

Other short course

� annually at City Campus of University of Antwerp

• 19, 20 and 21 September 2011

• http://www.ua.ac.be/peter.goos

� starts from scratch and builds up to Bayesian

� optimal experimental design algorithms, optimality criteria, software, linear and nonlinear models, . . .

� intuitive approach

� accessible for broad audience

� cheap

Today’s course is based on …


Part 1

Blocking Designed Experiments

Goals

1. Introduce the concept of blocking

2. Illustrate the concept with an example

Experiments Run in Blocks

1. Many processes have sources of variability that are not controllable factors.

2. Examples are day-to-day changes in set-up, changing lots of raw material, etc.

3. It is statistically more efficient to group the experimental runs so that the runs within each group are more homogeneous than runs in different groups.

4. The groups are called blocks.

5. The grouping variable is called a blocking factor.

Model for Fixed Blocks

Factor Effects Block Effects

Equivalent matrix form is:

The columns of Z are dummy columns treating the blocks as levels of a

categorical factor. Z has one fewer columns than there are blocks.

Least-Squares Estimates and Variances


A Screening Experiment with Fixed Blocks

Scenario

1. There are 6 factors.

2. The model contains all main effects and two-factor interactions.

3. There is day-to-day variation in the process.

4. Only 4 experimental runs per day are possible.

A Screening Experiment with Fixed Blocks

Factors

1. Boundedness.

2. Oil Red O.

3. Oxybenzone.

4. Beta Carotene.

5. Sulisobenzone

6. Deoxybenzone

Orthogonally Blocked Design

Problem

Three of the two-factor interactions are

confounded with the block effects!

D-optimal Fixed Block Design

Minimize determinant of covariance matrix

Maximize determinant of information matrix


Approach

1. Specify factors

2. Specify number of blocks.

3. Specify number of runs per block.

4. Specify model.

5. Compute D-optimal design.

D-optimal Fixed Block Design

No Problem

All the model terms are estimable as

well as the block effects.

Variance Inflation Due to Non-orthogonal Blocking

Covariance matrix in the presence of blocks

Covariance matrix in the absence of blocks

Variance Inflation Due to Non-orthogonal Blocking


Design and Data Parameter Estimates

Part 1 – Conclusions

1. Blocking is one of Fisher’s four fundamental principles of design.

2. Including blocking variables in a design reduces the amount of random error thus making it easier to detect the effects of the factors of interest.

3. Block effects do not have to be orthogonal to the factor effects to be useful.

Part 2

Designed Experiments with Random Blocks

Goals

1. Introduce the concept of random block effects.

2. Develop a model for the design of blocked experiments with random block effects.

3. Compare random to fixed block effects.

4. Provide an example of an experiment with random block effects.


Random Blocks

1. We call a blocking factor a random block if we consider the blocks chosen run in the experiment to be a representative of a population of blocks.

2. By contrast fixed blocks are viewed as being the only blocks of interest.

3. Inference for random blocks extends to the other blocks in the represented population.

4. Inference for fixed blocks is limited to the observed blocks only.

Model with Random Block Effects

Estimator for β

RSM Design with Random Blocks

Scenario

1. Blocks of 4 runs in a day.

2. Seven days allocated for the experiment.

3. Need to fit full quadratic model in 3 factors.

Factors


Standard Orthogonally Blocked Design D-optimal Random Block Design



Approach

1. Specify factors.

2. Specify number of blocks.

3. Specify number of runs per block.

4. Specify model.

5. Specify ratio of variance components.


7. Study sensitivity to ratio of variance components.

D-optimal Design and Response Data


Model Coefficients and Inference Results

Model Coefficients – Simplified Model

Recommended Factor Settings

Setting the flow rate to 40.5, the moisture content to 22.3 and

the screw speed to 300.4 yields the target expansion indices.

Part 2 – Conclusions

1. Blocking designs using random blocks allows for

wider inference.

2. Random blocks can save resources because you

do not have to estimate a coefficient for each

block.


Part 3

Designed Split-plot Experiments

Goals

1. Introduce the idea behind split-plot experiments.

2. Develop a model for the design of split-plot experiments.

3. Compare blocked to split-plot experiments.

4. Provide an example of a split-plot experiment.

Split-plot Graphic Definition

Sub-Plots

Split-plot Definition

A split-plot experiment is a blocked experiment, where the levels of some of the factors are constant within each block.

Model for Split-plot Experiments

Estimator for β


Split-plot versus Random Blocks

1. Split-plot designs are a special case of random

block design.

2. The difference is that in split-plot designs, certain

factors (the “whole plot” factors) do not change

within the blocks but only between blocks.

3. In ordinary random block designs, all the factors

may change within each block. In split-plot

designs, only “sub-plot” factors change within

blocks.

D-optimal Split-Plot Design



I-optimal Split-Plot Design

Minimize average prediction variance

Approach

1. Specify whole plot factors.

2. Specify sub-plot factors

3. Specify number of whole plots.

4. Specify number of runs per whole plot.

5. Specify model.

6. Specify ratio of variance components.


8. Study sensitivity to ratio of variance components.


Split-Plot Example

Scenario

1. Four factors.

2. Two are hard-to-change and two are easy-to-change.

3. Hard-to-change factor design can only have 10 runs.

4. Budget of 50 runs for the full design.

Factor Table

Ad hoc Design #1 Ad hoc Design #2



FRONT RIDE HEIGHT

RE

AR

RID

E H

EIG

HT

Yaw Angle

Gril

le C

over

age

FRONT RIDE HEIGHT

RE

AR

RID

E H

EIG

HT

Yaw Angle

Gril

le C

over

age


Comparison of Coefficient Variances

Left column is for ad hoc design #2, right column is for I-optimal split-plot design.

OLS vs GLS Data Analysis

OLS Analysis GLS Analysis


Part 3 – Summary

1. Split-plot designs are common in industry.

2. They are not commonly recognized as being split-plot designs.

3. As a result, these designs are mistakenly analyzed using OLS.

4. Explicitly, taking randomization restrictions into account makes the

design process more economical, often more statistically efficient

and more likely to produce valid analytical results.

��

��

��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� � �� !

"

��

� # ��

� # ��

� �� !

� �� $��

� "�� %�� &"� �'��

� (�� )�� *�� *�� *��!��!*��% ��

+

��

� ��%��%�� &�!�� '��% ��

� ��!� �� %��!�� )��

� �� !� ��!�� ,�*�� %��

� �� %�� $�� &�� '�� $��&-� ��*� ��'

.

(��

��

��

��

��

�� !��

�� "��#��$��

/

(��

%�&�� '��

��

��

��

!��

!�� $��

0

#�� $��

� "�� "��

� +�� $��1��!�� &+��%��'*��!�� ,��&+��%��'��!�� &"��%��'

� +�2 +�2 "��-�3�� $��

� /+�� $�� "��

� ,� ��"�� "�� , ��$��4

� "�2 56 �� $��

6

��

� $�� ! ��

� �� $�$ � �!

��

��

��

�

��

�

��

��

�

�

�

�

�

��

��

�

�

��

��

5

��%��

�

�

� ��

� ��

��

��

��

��

��

��

��

�

�

��

�

��

�

�

��

��

��

�

��

�

�

�

�

��

�

�

�

��

��

��

�

�

��

��

��

��

�

�

�

��

�

��

��

�

��

�

�

�

�

�

��

�

��

��

� ��

��

��

��

��

��

��

��

�

��

�

�

��

��

��

�

��

��

��

��

�

�

��

��

�

�

�

�

�

�

�

��

�

��

��

#�� 7�� #��8��8

��

� �� $�� )��,� ��%��

� �� $�� %��%�� $��

� �� #�� ! �� *�� 7�

� �� )�� $!�9�� :�� &"55*�"55/'

� � � � ��

"

� �� !� ��

� �� $� �%�� ! � �� ,��

� �� $�� 1

� �� 7�� $��

� � ��

��

�

�

�

��

��

��$��

+

��

.

��

� ;��

� <��

� ��=�� &� � ��'

� >�� !��$� ��

� ?�� $��

� >�, � ��! ��

� �� &@� �� A�"55B'

� �� &��,�� A�"55B'

;� ��

��

C5*C

/

��

� ��

� �� %�� $�� $��

� $��"1�

D"*�D.*�+D.*�D3*�/D3*�+D3*�6D3*�D0*�BD0*AAA�

� $��+1

D+*�"D+*�DB*�.DB*�6DB*�"DB*�/DB*�3DB*�D"6*AAA�

� �� $�� $!��$ � ��

� �� %��

0

��

� �

�

�

3

?�� $��"

��

� ��

��

B

?�� $��+

��

� � ��

��

"

=��

� ��

� �� $��&E � �� '

� � �� %��! �� ,�� 1

� "�� &$��"'1

&D"*D"'*�&D.*+D.'*�&+D.*D.'*�&D3*/D3'*�&/D3*D3'*�&+D3*+D3'*�&6D3*6D3'*�&D0*/D0'*AAA

� �

� ��

��

�

�

� ��

��

��

� ��

""

=��

� �

�

�

"+

>�� !��$� ��

� � � � ��

� � � � ��

� � � � ��

� � � � ��

� � � � ��

� � ��

� � ��

� � � � ��

� � � � ��

".

>�� !��$� ��

� �

�

�

"0

?�� $��

� � � �� )�� $��

� �� %�� -�&*�*�"*F*��'

� ��

� ��-�/6

� �� )�� A�&"555'�

� �� $!��9�� G�(��&"55.'

� $�)��H� ��

"6

?�� $��

� �� , �� %�� -�&*�*�"*F*��'�� 1

� $�)��H� �� 1

� � � � � � � � ��

�

� ��

��

��

+5

��

� �!��

� ,� �� ,� �� %��

� �� 1

� ��! ,��)� �� $��

� �� !�� "I�

� �� $�� ,� ��

� � �

�

�

� ��

��

��

�

�

�

+"

��

� ��,�� A�&"55B'*�$�� ! G��&B0.'*��!��%�) )� &B35'�� G��7 &BB6'

� �� !��! � �� $��

� �� $!

� ��

� �� %�� &��'��

� � � ��

�

�

��

�

��

��

��

++

J,� ��

� �� 7�� >�� 1

� �� &�!��%�) )� B35'

� �K�� &K�� %��'

� �&�K'D"�� &K�� %��'

� �� !�� /

� � � �

�

��

��

��

��

��

��

+0

?��

� �� 1�� &6��'D"&�K'"&�K"'

� �� 1��"&��'"D�&�K'"&�K"'

+6

� �� !

� .��

� � �� .��3

� �� $��

� ��

� �� %��%� �� *��

� �%��

� &�%��'��% �� %��

� ��$�� , �� 5AL��5A/L�� %��

+3

��% �� # �� .*�� %�� *�0.��,�

�� I ��(>J =(;I ?�>�

+B

��

� �� !�� " #�

$�% � � �$ ��& ��& ��& ��& ��& ��& ��&

�%�� '��& � �� (

$�% � �� $ ��& ��& ��& ��& ��& ��& ��&

�%�� '��& �� (

$�% � � �$ ��& ��& ��& ��& ��& ��& ��&

��)� �� '��& � � � � � (

$�% � �� $ ��& ��& ��& ��& ��& ��& ��&

��)� �� '��& �� (

.5

# ��

� �� ! �� %�� $��

� �� ! � �� ,�� ! ��, ��%��

� =�� $�� ,��! ��

� �� &�A�A�� , �� 0.��,��$�� 5".�� ,�'

.

# ��

� �� $�� $�� *� ��

� ��

� $��

Optimal design of experimentsSession 7: Nonlinear models

Peter Goos

1 / 25

Binary data with logistic link

Ï example:Ï y = 0 or 1 (adhesion or no adhesion)Ï explanatory variable

x = time of plasma etchingÏ n = 2 observations

Ï logistic regression model:

P(Yi = 1) =eβ0+β1xi

1+eβ0+β1xi

P(Yi = 0) =1

1+eβ0+β1xi

2 / 25

Likelihood

Ï likelihood function observation i

Li = P(Yi = yi) =

(eβ0+β1xi

1+eβ0+β1xi

)yi ( 1

1+eβ0+β1xi

)1−yi

=eyi(β0+β1xi)

1+eβ0+β1xi

Ï log likelihood observation i

lnLi = lneyi(β0+β1xi) − ln(1+eβ0+β1xi)

= yi(β0 +β1xi)− ln(1+eβ0+β1xi)

3 / 25

Information matrix

Ï general definition observation i:

Mi =−E

(∂2 lnLi

∂θ∂θT

)

= E

((∂ lnLi

∂θ

)(∂ lnLi

∂θ

)T)

with θ the vector of model parameters

Ï total information matrix

M =

n∑

i=1

Mi

4 / 25

Binary logistic regression

Ï Mi =−E

∂2 lnLi

∂β20

∂2 lnLi

∂β0∂β1

∂2 lnLi

∂β1∂β0

∂2 lnLi

∂β21

Ï lnLi = yi(β0 +β1xi)− ln(1+eβ0+β1xi)

Ï∂ lnLi

∂β0= yi −

eβ0+β1xi

1+eβ0+β1xi

Ï∂ lnLi

∂β1= yixi −

eβ0+β1xi xi

1+eβ0+β1xi

5 / 25

Binary logistic regression

Ï∂2 lnLi

∂β20

=−

(

1+eβ0+β1xi

)

eβ0+β1xi−eβ0+β1xi eβ0+β1xi

(

1+eβ0+β1xi

)2

=−eβ0+β1xi

(

1+eβ0+β1xi

)2

Ï∂2 lnLi

∂β0∂β1=−

(

1+eβ0+β1xi

)

eβ0+β1xi xi−eβ0+β1xi eβ0+β1xi xi(

1+eβ0+β1xi

)2

=−eβ0+β1xi xi

(

1+eβ0+β1xi

)2 =∂2 lnLi

∂β1∂β0

Ï∂2 lnLi

∂β21

=−

(

1+eβ0+β1xi

)

eβ0+β1xi x2i−eβ0+β1xi xie

β0+β1xi xi(

1+eβ0+β1xi

)2

=−eβ0+β1xi x2

i(

1+eβ0+β1xi

)2

6 / 25

Information matrix

Ï observation i

Mi =−E

−eβ0+β1xi(

1+eβ0+β1xi

)2−xie

β0+β1xi(

1+eβ0+β1xi

)2

−xieβ0+β1xi

(

1+eβ0+β1xi

)2

−x2i

eβ0+β1xi

(

1+eβ0+β1xi

)2

Ï total information matrix M =

n∑

i=1

Mi

Ï the information matrix (and thus the amount

of information) on the unknown parameters

depends on the unknown parametersÏ to maximize the information content of your

experiment, you need a guess for β0 and β1

7 / 25

Information matrix

Ï observation i

Mi =

eβ0+β1xi(

1+eβ0+β1xi

)2xie

β0+β1xi(

1+eβ0+β1xi

)2

xieβ0+β1xi

(

1+eβ0+β1xi

)2

x2i

eβ0+β1xi

(

1+eβ0+β1xi

)2

Ï total information matrix M =

n∑

i=1

Mi

Ï the information matrix (and thus the amount

of information) on the unknown parameters

depends on the unknown parametersÏ to maximize the information content of your

experiment, you need a guess for β0 and β1

8 / 25

Locally optimal design

Ï binary.xls

Ï 2 examples are given:{

parameterset 1 : β0 =−2 and β1 =+2

parameterset 2 : β0 =−2 and β1 =+3

Ï set 1 leads to:

{

x1 = 0.228

x2 = 1.772

Ï set 2 leads to:

{

x1 = 0.152

x2 = 1.181

these designs

are called locally

optimal

(optimal for just

one set of β’s)

9 / 25

Bayesian approach

Ï problem with locally optimal designs: they

might not be very good for other β’sÏ a Bayesian (D-)optimal design is a design that

performs well on averageÏ how?

for each βi : βi ∼ NORMAL ( a , b2 )

some density/distribution

I think βi is around a

I’m not that sure, I might be wrong

(small b: I’m pretty sure ↔ large b: unsure)

10 / 25

Simple example

Ï β0 =−2,β1 =

{

2 (50% chance)

3 (50% chance)instead of

normalÏ construct information matrix for every set of

β’sÏ calculate |M| for each set of β’s: |M|1, |M|2Ï what you have to maximize is the Bayesian

D-criterion

0.5 |M|1 +0.5 |M|2 probability second set of β’s

probability first set of β’sÏ example: Bayesian binary.xls

Bayesian D-optimal design:

{

x1 = 0.2

x2 = 1.573

11 / 25

Implementation normal prior

distribution

Ï what if βi ∼ NORMAL?

Ï generate “a lot” of βi’s from the normal

distribution (R = number of draws)

Ï maximize the Bayesian D-criterionR∑

j=1

1

R|M|j

determinant for the jth set

of β’s you randomly drew from

the normal distributions for βi’s

Ï this is done to approximate

∫

Rk|M|j π(β)dβ

joint probability distribution of βi’s

12 / 25

Implementation normal prior

distribution

Ï usually, a Monte Carlo sample is drawn from

the prior distribution

Ï for this to work well, you need to draw a lot of

random samples

Ï this is computationally demandingÏ solution: do not draw samples randomly but

systematicallyÏ Halton sequencesÏ Sobol sequencesÏ Gaussian quadrature

Ï in that case, you need much fewer draws

13 / 25

More on Bayesian optimal design

Ï no Bayesian design:

maximizing |M| and log |M| is the same thing

Ï Bayesian design:

maximizing∑R

j=11R|M|j and

∑Rj=1

1R

log |M|j is

NOT the same thing!

Ï see Bayesian binary (version 2).xls

Bayesian D-optimal design:

{

x1 = 0.179

x2 = 1.419

14 / 25

Maximin designs

Ï designs that have the best possible worst case

performanceÏ how?

Ï for each set of β’s, there is a locally optimal design,

with determinant |M|∗j

for parameter set j

Ï any other design will be worse than |M|∗j

for that

setÏ how bad?

(∣∣M(set j)

∣∣

|M|∗j

)1/p

Ï we compute this quantity for every set of β’sÏ we focus on the smallest / worst value and

maximize that value

15 / 25

Our example

(locally) opt. determ.

β opt. design |M|∗j

set 1 β0 =−2 x1 = 0.228

β1 =+2 x2 = 1.772 |M|∗1 = 0.0501

set 2 β0 =−2 x1 = 0.152

β1 =+3 x2 = 1.181 |M|∗2 = 0.0223

find design with information matrix M that

maximizes

min

{(|M(−2,2)|

|M|∗1

)1/2

,

(|M(−2,3)|

|M|∗2

)1/2}

16 / 25

Our example

Ï maximin binary.xls

Ï maximin design

{

x1 = 0.18

x2 = 1.436

Ï this design is 94.4% efficient for both sets of β’s

Ï this means that

(|M(−2,2)|

|M|∗1

)1/2

=

(|M(−2,3)|

|M|∗2

)1/2

= 0.944

17 / 25

Sequential optimal design

Ï idea1. start with a small design and collect some data

2. update your knowledge on model’s parameters

3. create a new design that uses improved knowledge

4. repeat steps 2 and 3 as often as possible/desirable

Ï avoids constructing a large design based on

poor prior knowledge

Ï this approach performs very well usually

Ï not always feasible

18 / 25

Other considerations

Ï the logistic regression models belong to a class

of generalized linear models

Ï maximum likelihood estimation

Ï for some models, maximum likelihood theory

can not be used to derive an information

matrix

Ï this is what next slides are about

19 / 25

Nonlinear regression models

Ï general model (just one θ)

Y = η(x,θ)+ǫ

E(Y ) = η(x,θ)

Ï Taylor series expansion

E(Y ) = η(x,θ)

= η(x,θ0)+ (θ−θ0)∂η(x,θ)

∂θ

∣∣∣∣θ=θ0

+ . . .

20 / 25

Nonlinear regression models

Ï rewrite as

E(Y )−η(x,θ0)︸︷︷︸

some response

= (θ−θ0)︸︷︷︸parameter

∂η(x,θ)

∂θ

∣∣∣∣θ=θ0

︸︷︷︸

function of exp. var.

Y ∗=βf (x)

Ï nonlinear model with several θ’s

Y ∗=βT

f(x)

21 / 25

Information matrix

Ï information matrix for such a model

M =

n∑

i=1

f(x)fT (x)

Ï here

f(x) =∂η(x,θ)

∂θ

∣∣∣∣θ=θ0

Ï so information matrix depends on unknown

parameters

Ï thus, optimal designs depend on the unknown

parameters

22 / 25

An example: a chemical reaction

Aθ1−→ B

θ2−→ C

Yi =θ1

θ1 −θ2

(

e−θ2ti −e−θ1ti)

Ï Yi = concentration of substance BÏ ti = time = explanatory variableÏ θ1 > θ2

Ï e.g. O2 → H2O2 → H2OÏ suppose n = 4, so you have to choose 4 time

points t1, t2, t3, t4 at which to measure the

presence of substance B

23 / 25

Model matrix X

Ï dimension 4×2Ï what should be in the columns?

∂η

∂θ1

and∂η

∂θ2

here:∂Y

∂θ1

and∂Y

∂θ2

Ï first column:

∂Y

∂θ1

=1

(θ1 −θ2)2

(

(θ2 +θ1(θ1 −θ2)ti)e−θ1ti −θ2e−θ2ti)

Ï second column:

∂Y

∂θ2

=1

(θ1 −θ2)2

(

(θ1 +θ1(θ1 −θ2)ti)e−θ2ti −θ1e−θ1ti)

24 / 25

Locally optimal design

Ï you need some idea about θ1 and θ2 before

you can start

Ï e.g. θ1 = 0.7, θ2 = 0.2, so

∂Y

∂θ1

= (0.8+1.4ti)e−0.7ti −0.8e−0.2ti

∂Y

∂θ2

= (2.8+1.4ti)e−0.2ti −2.8e−0.7ti

Ï see nonlinear.xls

25 / 25

Sponsors

In collaboration with