1 3. Statistical considerations - OECD

3. Statistical considerations 1

3.1 Outline 3

The statistical methods used to analyse results of regulatory ecotoxicology studies must be 4

consistent with regulatory frameworks, they must be statistically robust and maximize effi-5

ciency in terms of animal use, time and costs. Different national and regional risk assessment 6

schemes have often been developed to balance these factors in a variety of ways. For exam-7

ple, in Europe, the environmental hazard assessments of industrial chemicals or biocides fo-8

cus on the calculation of the predicted no effect concentration (PNEC) values. These are typi-9

cally based on either acute effects concentration (ECx) type studies or chronic no observed 10

effect concentration (NOEC) type studies, using safety factors as appropriate (EU 2003). For 11

pesticides, in Europe either acute ECx type studies or chronic NOEC studies are used to de-12

rive toxicity exposure ratios that are employed for risk assessment (Directive 91/414; EU 13

1991). In the United States, the regulatory terminology implies the use of chronic NOEC data 14

as a basis for the calculation of hazard quotients used in the risk assessment of pesticides (US 15

EPA 2004). In contrast, chronic exposure in aquatic algae (e.g. OECD TG 201) studies are 16

typically analysed by calculation of an EC10 and EC20, which are often used alongside aquatic 17

fish studies in the assessment of risk. Also, some probabilistic risk assessment schemes might 18

require information on the slope and confidence limits of the dose-response curve. 19

Regulatory needs, test designs and statistical methods cannot be considered independently, 20

and the impact of change in one of these factors on the others must be considered carefully. 21

For example, in Chapter 4 it is stated that for endocrine screens (OECD TG 229 and OECD 22

TG 230) “the definitive test exposes fish to a suitable range of concentrations maximizing the 23

likelihood of observing the effect. The important distinction being that achieving a NOEC is 24

not the purpose of the test”. This statement accurately reflects the original basis for the design 25

of the test. Yet, as the basis for the original test design fades in memory, there appears to be a 26

tendency to expect the calculation of both ECx and NOEC values based on the results of the 27

screens. It cannot be assumed that the design of this test can support adequate estimates of 28

EC10 and EC20 values without adequate studies of the accuracy and precision of estimates. 29

Alternative testing methods may have new or novel strengths/limitations in terms of statisti-30

cal power compared to standard guidelines (e.g. testing based on upper threshold concept). 31

Statistical methods must be capable of detecting, or modelling, the smallest effects that are 32

biologically meaningful (for discussion, see below). A key issue in the interpretation of fish 33

toxicity tests, as they grow in complexity, is to distinguish between biologically important 34

effects caused by the test chemical versus statistically detectable differences. This aspect of 35

ecotoxicology test guideline data interpretation is identical to the principals developed for 36

many mammalian test guidelines in recent years (Länge et al. 2002 Williams et al. 2007). 37

Against this background, a key element of this chapter is to illustrate key principles of data 38

interpretation (e.g. importance of historical control values for endpoints of interest, adequate 39

replication, etc.) that can be used as required for different fish test guidelines. 40

Toxicological endpoints should not be interpreted in isolation from other information relevant 1

to the test. For example, it is usually assumed that the responses will follow an underlying 2

monotone concentration response pattern (i.e. there is a general tendency for the effect to in-3

crease as concentration increases) in the absence of compelling evidence to assume other-4

wise. Use of the knowledge that responses are likely to follow such a pattern can lead to bet-5

ter statistical tests and allow variations not related to treatment to be identified. For example, 6

this assumption makes the Jonckheere-Terpstra trend test a powerful tool for the calculation 7

of NOEC values and forms the entire basis for calculation of ECx values. 8

There is some controversy on the question of whether hypothesis testing (NOEC/LOEC 9

[no/lowest observed effect concentration]) or regression (ECx) is the better way to evaluate 10

toxicity data (e.g. Chapman et al. 1996, Dhaliwal et al. 1997). It is not the intention to replay 11

that debate here. The intention of this chapter is to indicate how best to do each type of analy-12

sis and to indicate the types of data and experimental designs under which each type of analy-13

sis can be done with reasonable expectation of useful results. Therefore, requirements for the 14

different approaches are considered. 15

The OECD guidance document no. 54 (OECD 2006) describes current approaches to the sta-16

tistical analysis of ecotoxicity data and should be consulted. However, the recent develop-17

ment of new fish test guidelines (e.g. draft Fish Sexual Development Test) has raised addi-18

tional specific issues worthy of discussion in addition to some general considerations. 19

3.1 Biological versus statistical significance 21

The question of what magnitude of effect is biologically important to detect or what effects 22

concentration (ECx) to determine is not a statistical issue. This issue is not unique to fish tests, 23

but is valid for other ecotoxicity test species such as Daphnia and algae. Scientific judgment, 24

grounded in repeated observation of the same response in the same species under the same 25

conditions, is required to specify this (i.e. the understanding of historical control data). Statis-26

tics provides a means to determine the magnitude of effect that a given experimental design 27

can quantify. To put this another way, once an effect size of biological importance has been 28

determined by subject matter scientists, it is possible to design an experiment that has a high 29

likelihood of producing the desired information (i.e. whether an effect of the indicated size 30

occurs at some test concentration or what concentration produces the specified effect). 31

The relationship between biological significance and statistical significance can be under-32

stood in terms of the magnitude of effect that can be detected statistically. For a continuous 33

response, such as growth or fecundity, this in turn depends on the relative magnitude of the 34

between-replicate and within-replicate variances. The standard error of the sample control 35

mean response is given by: 36

SE = SQRT[Var(Rep)/r+Var(ERR)/rn] = σ∙SQRT[R/r+1/rn], 38

where σ is the within-replicate or error standard deviation, R is the ratio of the between-2

replicate variance to the within-replicate variance, r is the number of replicates in the control 3

group and n is the number of fish per replicate, Var(Rep) is the between-replicate variance 4

and Var(ERR) is the within-replicate or fish-to-fish variance. 5

The 95% confidence interval for the mean is, approximately, Mean ± 2∙SE. It is often con-6

venient to express 2*SE as a percent of the mean, so the 95 % confidence interval for the 7

mean can be expressed as Mean ± P %, where P = 200∙SE/Mean. The true mean is statisti-8

cally indistinguishable from any value in this confidence interval for the sample mean. 9

This means that the smallest treatment effect that can be distinguished statistically is P % of 10

the control mean. This holds for both the NOEC and ECx approaches. 11

It is then incumbent on the study director to determine the magnitude of effect, Q, which is 12

judged biologically important. For a given experimental design and endpoint, Q is compared 13

to P. If Q > P, then the experiment is suited for the purpose, otherwise not. 14

For example, in a recent fish full life-cycle (FFLC) test, the control mean for percent male 15

offspring was 69 %, with standard error of the mean = 19 %. Thus, the smallest effect that 16

can be found statistically significant is 19 % and ECx for x < 19 cannot be reliably estimated. 17

Another way of considering this is to observe that the lower bound of the 95 % confidence 18

interval for EC10 and EC20 is 0. Also, the NOEC is a concentration at which a > 19 % effect 19

was observed. (NOTE: The confidence interval for the difference between the control mean 20

and a treatment mean is actually greater than 19 % by a factor of √2). Vitellogenin (VTG) is 21

another highly variable response and only large effects, around 40 %, can be expected to be 22

statistically significant in a practical experiment. Equivalently, EC40 might reasonably be es-23

timated, but not EC25. 24

At the other extreme, the standard error of a growth measurement for Daphnia is often 2 -25

3 % of the control mean, so very small effects can be found statistically significant. For this 26

response, it is quite feasible to estimate EC5. It is a matter of scientific (non-statistical) judg-27

ment whether such small effects are biologically meaningful. Similar findings hold for avian 28

eggshell thickness measurements. 29

From the formula for standard error (SE), it will be evident that there is a trade-off between 30

the number of fish per replicate and the number of replicates per control or test concentration. 31

For example, from the formula, it is evident that if the number of replicates is doubled and 32

the number of fish per replicate is reduced by 50 %, then the second term in brackets is un-33

changed, but the first term is reduced by 50 %. This might indicate a preference for more rep-34

licates of fewer subjects per replicate. However, if Var(REP) is already relatively small, not 35

much is gained by such an approach. Instead, if we cut the number of replicates by 50 %, but 36

increase the number of subjects per replicate by a factor of 4, then the first term remains 37

small but the second term is reduced by 50 %. Thus, whether it is better to have a few repli-38

cates with many subjects in each, or many replicates with few subjects in each, depends on 39

the relative magnitude of the two variances. A general rule is that if the ratio R of variances 1

exceeds 0.5, then more emphasis is given to the number of replicates, otherwise, more em-2

phasis is given to the number of subjects within each replicate. For example, shoot height of 3

some emergent crops (e.g. oat, tomato, rape) tend to have variance ratios exceeding 0.5 (John 4

Green, pers. comm.). 5

As further illustration, recent experiments conducted for the OECD found VTG measure-6

ments to be very highly variable and the within-replicate variance ranged between two and 10 7

times that of the between-replicate variance. Thus, good experimental design would call for 8

relatively few replicates with numerous fish in each. In the case of medaka (Oryzias latipes), 9

it was found that a control and three test concentrations with two replicates per control and 10

test concentration and five fish per replicate were adequate to give 80% power to detect a 11

60 % effect. For fathead minnow (Pimephales promelas), four replicates of four fish each in 12

the control and each test concentration were required to give 80 % power to detect a 94 % 13

effect. While test effect sizes may seem large and the number of fish small, there were con-14

straints on the number of replicates and fish that could be accommodated from practical con-15

siderations. Furthermore, such size effects were observed in high test concentrations in these 16

experiments during validation. 17

A complication is when there are multiple endpoints to be analysed in a single experiment. If 18

the experimental design is optimal for one response, it may be sub-optimal for another re-19

sponse. This may mean that only very large effects can be estimated or detected statistically 20

for one endpoint and perhaps very small, biologically unimportant effects will be found sta-21

tistically significant for another response. It is important to understand this in interpreting the 22

data. For a biologically important effect may be missed in the first instance, which should not 23

be interpreted to mean the chemical in question has no effect on that response; while a sound 24

study might be rejected because of a tiny effect found statistically significant. Any statistical 25

result should be interpreted in light of the biologically important effects determined before 26

the experiment was conducted. 27

To design an experiment with high likelihood of detecting a P % effect (or estimating a 28

meaningful ECP), it suffices to find r and n so that P = 200∙SE/Mean. It is a simple matter to 29

construct a table showing 200∙SE/Mean for various values of r and n and seeking the most 30

practical combination to define the design based on historical control estimates of the two 31

variances. 32

3.2 NOEC/LOEC 34

For the purpose of determining an NOEC or an LOEC, it is important to design the experi-35

ment so as to be able to have a reasonable chance of finding a biologically relevant effect sta-36

tistically significant and minimize the chance of finding biologically irrelevant effects statis-37

tically significant. These two objectives are somewhat incompatible, and judgment will be 38

useful in reaching appropriate regulatory conclusions. A fish study will have a water control 39

group (dilution water control), and if a solvent is used, a solvent control, and at least one test 40

concentration. Unless the design is for a limit test, there will usually be three or more test 1

concentrations approximately equally spaced on a log scale. With very few exceptions, the 2

control(s) and test concentrations should be replicated. Replicate here refers to the test vessel, 3

not to individual animals unless they are housed individually in a test vessel. The trade-off 4

between number of fish per replicate and number of replicates per test concentration and con-5

trol will vary according to the response and species, and a power calculation may be needed 6

to determine the best design. Since multiple responses are usually tested from the same ex-7

periment, it will often not be possible to design an experiment that is optimal for all responses. 8

Judgment is needed to decide on the most important response(s) and experiments should be 9

designed to provide adequate power (75 - 80 %) to detect biologically relevant changes in 10

those responses. Power simply refers to the probability of finding statistically significantly an 11

effect of a given true magnitude, taking into account variability in the response of interest and 12

variability arising from sampling. It is also important to quantify the size effect likely to be 13

found significant in all other responses. This may indicate that there is a need to rethink the 14

objectives of the experiment. There should be no surprises at the end of the study about what 15

can be analysed, by what test, and with what ability to detect effects. 16

Most responses are analysed using 2-sided tests, unless there is a clear biological reason to 17

expect or be concerned only with changes in one direction (e.g. an increase). Furthermore, for 18

most responses, there is an expectation that the concentration-response is approximately 19

monotone. The effect may be measured as an increase or a decrease in some measurement 20

(e.g., weight might decrease, mortality might increase). That being true, a test that is designed 21

for a monotone trend is more powerful than one that simply compares each test group to the 22

control independent of effects in other test groups. Thus, there is a preference for a step-down 23

Williams or Jonckheere-Terpstra test over a Dunnett, Dunn, Mann-Whitney or other pairwise 24

test, provided the data are consistent with a monotone concentration-response. All tests refer-25

enced in this chapter are discussed in detail in OECD (2006). 26

All statistical procedures are based on some data requirements. In addition to the monotonic-27

ity requirement for Williams and Jonckheere-Terpstra for continuous responses and Cochran-28

Armitage for quantal responses, there are additional requirements. The Williams and Dunnett 29

tests require normally distributed data with homogeneous variances. While these tests are ro-30

bust against mild violations of these requirements, they are not impervious and some check-31

ing of these requirements is appropriate. A visual check from a scatter plot may be sufficient 32

to assess monotonicity and variance homogeneity, and even normality. There are also formal 33

tests for all three and OECD (2006) provides details. 34

Where normality or variance homogeneity are violated, a transformation of the data to 35

achieve these requirements can be sought or non-parametric methods employed, which have 36

fewer requirements or are much less sensitive to violations. Be mindful that different agen-37

cies may have different jurisdictions on how, or whether, data will be transformed. Contrary 38

to widely held opinions, non-parametric tests are not always inferior to parametric tests. For 39

example, the power properties of the step-down Jonckheere-Terpstra test are very similar to 40

those of the step-down Williams test, when the data are normally distributed with homogene-41

ous variances, and are superior to Williams, when those conditions are violated. On the other 42

hand, for datasets with few replicates, the power properties of the Mann-Whitney and Dunn 1

tests are worse, sometimes much worse, that those of Dunnett’s test. Fig. 3.1 indicates a typi-2

cal comparison of these tests. 3

Fig. 1 shows the power of seven tests for an experiment with three positive test concentra-4

tions and a single control. The horizontal axis shows the percent change from the control, and 5

the vertical axis shows the probability that size effect will be found statistically significant. 6

The red curve shown with diamonds is the step-down Jonckheere-Terpstra (JT) test (standard 7

asymptotic version), orange with triangles for the exact permutation version of JT, the black 8

dark curve with asterisks is William’s test, blue with circles is for Dennett’s, cyan with aster-9

isks and green with squares are for the standard (asymptotic) and exact permutation versions 10

of Mann-Whitney (also known as the Wilcoxon) test, and grey dots for Dunn’s test. In the top 11

row, the design called for 2 reps of 8 fish in the control and each test concentration, while the 12

bottom row is for two replicates of two fish each. The left column is for a design following 13

the square-root allocation rule (see below), and the right column is for a design with equal 14

replication in control and all treatment groups. On the left, the gray Dunn power curve is hid-15

den by the green and cyan Mann-Whitney curves. The data generated were normally distrib-16

uted with homogeneous variances and with variability that was observed for VTG in some 17

OECD validation experiments for fish endocrine screening studies. It is clear that the power 18

of the Jonckheere test is greater than that of Williams test on the left, whereas William’s test 19

sometimes has slightly greater power on the right and both tests exceed in power that of all 20

the pair-wise tests (Dunnett, Dunn, Mann-Whitney). A striking feature of these results is that 21

Mann-Whitney has zero power to detect effects regardless of magnitude in either design, 22

whereas Dunn’s has zero power under the square-root rule and low power under equal alloca-23

tion. This knowledge is clearly important in deciding on design and test selection. 24

3.3 ECx 26

Standard regression models also depend on data meeting certain requirements. Among these 27

requirements, a key prerequisite is that the observations are mutually independent. This re-28

quirement is violated, for example, if all responses are divided by the control mean in an at-29

tempt to “normalize” the data or reduce variability. While it is not impossible to model corre-30

lated responses, specialized models are required to do so. For a continuous response (i.e. pro-31

portion of males is analysed as a continuous response), the data are assumed to be normally 32

distributed with homogeneous variances. There are modifications to accommodate heteroge-33

neous variances, such as alternative variance-covariance structures or weighting. It is also 34

possible to accommodate some types of non-normality.35

Fig. 3.1: Power of various tests to detect effects

PERCENT CHANGE FROM CONTROL (LOG-TRANSFORMED)

0 50 100 150 200

POWER OF TESTS TO DETECT A VTG EFFECT AT DOSE 3

From Fish Experiment with Control and 3 Positive Doses

CVTYPE=MAX REPS=2 SAMPLSIZE=8 SQRTRULE=YES

PLOT WL3 DT3 DN3 JT3

JTX3 MW3 MWX3

CVTYPE=MAX REPS=2 SAMPLSIZE=2 SQRTRULE=YES

JTX3 MW3 MWX3

CVTYPE=MAX REPS=2 SAMPLSIZE=8 SQRTRULE=NO

JTX3 MW3 MWX3

CVTYPE=MAX REPS=2 SAMPLSIZE=2 SQRTRULE=NO

JTX3 MW3 MWX3

It is recognized that regression is robust against mild deviations from normality and variance 1

homogeneity, but it can be adversely affected by serious violations of these requirements. Thus, 2

some checking of the distribution of responses is in order, either visually from a scatter plot or 3

through formal testing. For quantal data, normality is not required, but typically the data are as-4

sumed to follow a binary distribution within a given treatment group. Quantal data should be 5

checked for extra-binomial variance (more variation than can be accounted for by the simple bi-6

nominal distribution), the quantal analogue of the homogeneous variance requirement for con-7

tinuous responses. If extra-binominal variance is observed, there are statistical test methods 8

which take this into account (e.g. Rao-Scott, as described in OECD 2006). Finally, attention 9

should be paid to goodness-of-fit of the model to the data. There are several ways to assess 10

goodness-of-fit. Among the simplest are (1) visual comparisons of the responses predicted by the 11

model to the observed responses, and (2), where replicates are available, comparing the residual 12

mean square from the model against the pure error mean square. If these residual mean squares 13

are significantly larger than the pure error mean square, then the model does not fit the data well. 14

With small datasets typical in this field, this may not be a powerful test. (3) Confidence bounds 15

on the model predictions are very important to show whether the model predictions have any 16

meaning. If no confidence bounds can be produced or they are very wide, then predictions from 17

the model are scientifically unreliable. It should also be understood that typical confidence 18

bounds do not capture model uncertainty, which is one reason for conducting other goodness-of-19

fit assessments such as items (1) and (2). Model confidence bounds are constructed based on the 20

assumption that the model is correct. If no confidence bounds can be computed or they are very 21

wide, then the model is not internally consistent regardless of how well the model appears to fit 22

the data from a visual inspection. It is also possible to compute an R-square value to judge the 23

proportion of the total sum of squares that is accounted for by the model. While R-square is a 24

useful measure for linear models, it is an unreliable guide for the non-linear models which are 25

most often used to model ecotoxicity responses. For comparing two models for the same data, 26

Akaike’s AIC criteria can be useful. 27

In more general terms, a search of OECD TGs 204, 210, 212, 215, 229 and 230 was made to de-28

termine procedures specified for ECx calculation. Only OECD TGs 212 and 215 describe ECx 29

procedures. OECD TG 212 describes a normalization procedure, but does not specify fitting a 30

regression curve to the normalized percentages from which to estimate ECx. In contrast, OECD 31

TG 215 describes two test designs: one for ECx and one for NOEC determination. Acknowledg-32

ing the necessary differences: “that a design which is optimal (makes best use of resources) for 33

use with one method of statistical analysis is not necessarily optimal for another. The recom-34

mended design for the estimation of a LOEC/NOEC would not therefore be the same as that rec-35

ommended for analysis by regression.” 36

If ECx designs are to be more widely used, consideration of the following will be necessary pre-37

requisites: 38

General guidance should be given on how the need to estimate ECx affects the optimum 1

spacing of concentrations, number of treatments, and number of replicates to be used. 2

This guidance will probably suggest test designs quite different for the minimum designs 3

currently described in the OECD TGs optimized for NOEC determination. 4

Different endpoints may elicit responses at very different concentration levels. Therefore, 5

a strategy is required to handle this within one test. 6

What constitutes a meaningful ECx estimate, and what is the implication, if a meaningful 7

estimate cannot be obtained for one or all endpoints? 8

Validation of the draft Fish Sexual Development Test raised particular concerns about the use of 9

regression analysis for determination of effects on sex ratio. These issues are explored in detail in 10

Appendix 1. From these discussions, it is clear that the ECx approach is inappropriate for the sex 11

ratio endpoint and a NOEC design should be carried forward. 12

3.4 NOEC versus ECx designs 14

It is stated in OECD (2008) that: “[92.] In summary, ANOVA designs for fish testing appear in-15

ferior to regression designs, and the latter are considered to show more promise for fish life-cycle 16

tests given the generally large inherent variability in egg production (fecundity) between indi-17

viduals, which inevitably reduces the power of the ANOVA approach. Final decisions on which 18

design strategy to use should be made on a case-by-case basis, taking into account factors such 19

as the known variability in reproductive output of the species in question.” 20

The example datasets and analyses discussed in Chapter 3.x (Appendix) should serve as a cau-21

tion against overstating the advantages of regression over what is referred to as the ANOVA ap-22

proach. While regression has always been an important tool for statisticians, it is not appropriate 23

for some datasets and can suggest a level of precision that is not supported by the data. Experi-24

ments for regression analysis call for different experimental designs than those for which 25

ANOVA methods are intended. Just as ANOVA methods call for designs with adequate power 26

to detect biologically relevant adverse effects, regression methods call for designs that are capa-27

ble of providing reliable or meaningful estimates of an x % effects concentration and this re-28

quires designing around the specific x or percent effect to be estimated. Basic requirements in-29

clude the following: (1) There should be test concentrations on both sides of ECx. The zero con-30

centration control does not figure into this requirement. However, see the following paragraph. 31

(2) If the 95 % confidence interval for the control response is of the form Mean ± P %, then es-32

timates of ECx are meaningful only for x > P. For example, if the control mean is estimated only 33

with 20 % error, then it is meaningless to estimate EC10. (3) If the confidence interval for ECx is 34

very wide, perhaps spanning several test concentrations, then there can be little or no confidence 35

in the ECx estimate. These basic requirements are important to keep in mind, because once a re-36

gression model is fit to the data, it is a simple mathematical exercise to use the resulting equation 1

to estimate ECx for any percent x, and yet not all such values of x lead to plausible or meaningful 2

estimates. A mathematical equation is not a substitute for valid interpretation of data. This is akin 3

to the requirement in the NOEC approach of adequate power to detect an effect of magnitude 4

deemed biologically important. 5

It can be appropriate to determine both an NOEC (provided there is sufficient power to detect 6

biologically relevant effects) and an ECx (provided x is beyond the range of control variability). 7

Ideally, x should be between two tested concentrations. However, it is important to recognise 8

extrapolation beyond the range of data adds significant uncertainty and needs to be justified. It is 9

permissible to extrapolate modestly beyond the range of tested concentrations, provided this does 10

not violate restriction (2) in the previous paragraph. Such extrapolation necessarily comes with 11

increased uncertainty and assumes the model fit is valid beyond the range of tested concentra-12

tions, something that is untestable from the data. The uncertainty increases the further from the 13

experimental range one extrapolates. 14

Although it is not recommended to combine NOEC and ECx approaches in the same study, there 15

may be some compelling reasons to do so. For certain existing regulatory frameworks, it might 16

be appropriate to focus on NOEC test designs for fish chronic endpoints (e.g. FIFRA: US EPA 17

1996). However, for future regulatory frameworks it could be required to have both ECx and 18

NOEC determinations in fish chronic studies (e.g. draft Sanco 2010 review document). However, 19

the latter has serious implications for experimental design, time and costs, as well as ethical and 20

statistical interpretations. It might not be practical to design tests with multiple endpoints to de-21

termine both the NOEC and ECx values for the endpoints of interest. 22

3.5 Alternate designs (e.g. square root allocation rule) 24

There are several factors that affect the power of a given test. These are experimental design (e.g. 25

number of replicates per control and treatment group, number of fish per replicate, number of 26

treatment groups), shape of the concentration-response, and inherent variability of the response 27

of interest. One simple, but important decision is whether the control and treatment groups 28

should be equally replicated or whether more replicates should be allocated to the control. The 29

argument for the latter is two-fold: First, it gives a better measure of the undisturbed population 30

against which all treatment groups are compared, and, second, it tends to increase the power of 31

the test, in part by increasing the degrees of freedom for the test statistics. 32

Dunnett (1955) showed that the power of his test is optimized using what is called the square-33

root allocation rule, which provides a specific formula for the number of replicates in the control 34

and all treatment groups. Details are given in OECD (2006). Further theoretical published work 35

and extensive power simulation studies have shown that this same rule (or a minor modification) 36

also maximizes the power of the Williams and Jonckheere tests and usually increases the power 1

of the Mann-Whitney and Dunn tests. 2

3.6 Solvent/carrier control 4

One of the issues brought up in the first version of the OECD Fish Sexual Development Test Re-5

view, and which is a consideration on all of the test guidelines, is how the two controls (dilution 6

water and solvent controls) should be used when there is a solvent used in the treatment groups. 7

There are advantages to pooling the two controls to test for treatment effects: (1) By doubling the 8

number of control replicates, the power of the tests for treatment effects is increased, achieving 9

at least part of the advantages of the square-root allocation rule described above. (2) All the data 10

are used and the pooled control provides the best estimate of the background population from the 11

experiment. Permissible solvents are those which have been well-established in fish experiments 12

and have been found to have no practical effect on fish at the concentrations used. A preferable 13

alternative to always pooling the controls is to compare them statistically and pool them, if no 14

significant difference is found, and otherwise use only the solvent control to test for treatment 15

effects. The justification for the latter is that solvent is in all the treatment groups at approxi-16

mately the same concentration as in the solvent control, so that one compares solvent plus treat-17

ment to solvent, the difference being the treatment effect. This is a plausible hypothesis based on 18

the apparent additivity of effects in most aquatic chemical mixtures that is supported by concen-19

tration addition. References on this include Belden et al. (2007), Backhaus et al. (2010), as well 20

as Kortenkamp et al. (2007). The last communication addresses endocrine disrupting chemicals 21

specifically as well as other classes of chemicals. 22

Currently, there is a lack of harmonisation amongst different regulatory authorities on what is the 23

control for statistical analysis (dilution water control or solvent control and if they should be 24

pooled or not). A definitive answer to this question cannot be provided at present, but it has been 25

recommended that a working group to progress this issue. A topic that might be addressed by 26

such a working group is the reduction in the number of animals that could arise by eliminating 27

one of the controls. 28

3.7 Power 30

In the design stage, the primary use of power analysis in toxicity studies is to demonstrate ade-31

quate power to detect effects that are large enough to be deemed important. If our methods have 32

sufficient power, and we find that, at a given concentration, there is no statistically significant 33

effect, we can have some confidence that there is no effect of concern at that concentration. Fail-34

ure to achieve adequate power can result in large effects being found to be statistically insignifi-35

cant. On the other hand, it is also true that a test can be so powerful that it will find statistically 1

significant effects of little importance. 2

Deciding on what effect size is large enough to be important is difficult. In some cases, the effect 3

size may be selected by regulatory agencies or may be specified in guidelines. 4

For design purposes, the background variance can be taken to be the pooled within-experiment 5

variance from a moving frame of reference from a sufficiently long period of historical control 6

data with the same species and experimental conditions. The time-window covered by the mov-7

ing frame of reference should be long enough to average out noise without being so long that un-8

detected experimental drift is reflected in the current average. If available, a three-to-five year 9

moving frame of reference might be appropriate. When experiments must be designed using 10

more limited information on variance, it may be prudent to assume a slightly higher value than 11

what has been observed. Power calculations used in design for quantal endpoints must take the 12

expected background incidence rate into account for the given endpoint, as both the Fisher-Exact 13

and Cochran-Armitage test are sensitive to this background rate, with highest power achieved for 14

a zero background incidence rate. The background incidence rate can be taken to be the inci-15

dence rate in the same moving frame of reference already mentioned. 16

At the design stage, power must, of necessity, be based on historical control data for initial vari-17

ance estimates. It may also be worthwhile to do a post-hoc power analysis to determine whether 18

the actual experiment is consistent with the criteria used at the design stage. If there is signifi-19

cantly higher observed variance (e.g. based on a chi-square or F-test) than that used in planning, 20

then the assumptions made at design time may need to be reassessed. Care must be taken in 21

evaluating post-hoc power against design power. Experiment-to-experiment variation is expected, 22

and variance estimates are more variable than means. The power determination based on histori-23

cal control data for the species and endpoint being studied should be reported. 24

Power Example 26

Suppose we want to determine the NOEC for mortality in an experiment with rainbow trout (On-27

corhynchus mykiss), where past experience with this species suggests that background mortality 28

rate at the relevant age and test duration is near zero. We want to be able to detect a 20 % mortal-29

ity rate and, based on preliminary range-finding experiments, we have decided on an experiment 30

with five test substance concentrations at 50, 100, 200, 400, and 800 ppm, plus a single (non-31

solvent) control. Furthermore, suppose previous experience suggests that extra-binomial variance 32

and within-tank correlations of responses are unlikely, so a standard Cochran-Armitage test can 33

be done treating all fish within a concentration equally (i.e. ignoring any tank or replicate effect). 34

How many fish per concentration should we plan? 35

First, consider designs with the same number, n, of fish in each concentration as in the control. 1

The power of the Cochran-Armitage test depends on the shape of the concentration-response 2

curve, which we do not know. Powers have been simulated for numerous shapes. Based on an 3

examination of the various power plots, a reasonable choice for design purposes is the linear 4

concentration-response shape. In addition, the power depends on the threshold of toxicity. For 5

design purposes, we will assume that is zero. The following plots will help (Figs. 3.2 and 3.3). 6

Fig.3.2: Power versus maximum change of 100∙Delta % for n = 5 fish per concentration. 8

Fig. 3.2 shows that 5 fish per concentration would give very low power (about 25 %) to detect a 9

20 % change in the high concentration. There is little point in conducting the experiment for the 10

purpose. 11

Consider a design with 20 fish per concentration: This sample size gives a power of 82 % to de-12

tect a 20 % mortality in the 800 ppm concentration (Fig. 3.3). This may well be adequate. What 13

is the power to detect a 20 % in lower concentrations? Fortunately, we do not lose much power 14

as we step down. The power to detect a 20 % mortality rate at 400 ppm is 80 %, at 200 ppm it is 15

78 %, and at 100 ppm it is 76 %. Notice, however, that if the background incidence rate were 10 16

%, then the power to detect an increase in mortality rate of 20 % drops to around 40 %, which 17

would be inadequate for most purposes. 18

Fig.3.3: Power versus maximum change of 100∙Delta % for n = 20 fish per concentration. 2

3.8 Replicates 4

Decisions on the number of fish per tank and number of tanks per group should be based on 5

power calculations using historical control data to estimate the relative magnitude of within- and 6

among-tank variations and correlations. If there is only one tank per test concentration, then there 7

is no way to distinguish housing effects from concentration effects and neither between- and 8

within-group variances or correlations can be estimated, nor is it possible to apply any of the sta-9

tistical tests described for continuous responses to tank means. Thus, a minimum of two tanks 10

per concentration is recommended; three tanks are much better than two; four tanks are better 11

than three. Some non-parametric tests (e.g. not Mann Whitney) require a minimum of four repli-12

cates. The improvement in modelling falls off substantially as the number of tanks increases be-13

yond four. (This can be understood on the following grounds: The modelling is improved, if we 14

get better estimates of both among- and within-tank variances. The quality of a variance estimate 15

improves as the number of observations on which it is based increases. Either sample variance 16

will have, at least approximately, a chi-squared distribution. The quality of a variance estimate 17

can be measured by the width of its confidence interval and a look at a chi-squared table will ver-18

ify the statements made). For further discussion and other tests (parametric and non-parametric), 19

see OECD (2006). 20

The number of tanks per concentration and fish per tank should be chosen to provide adequate 21

power to detect an effect of magnitude judged important to detect. This power determination 22

should be based on historical control data for the species and endpoint being studied. 23

Since the control group is used in every comparison of treatment to control, consider allocating 1

more fish to the control group than to the treatment groups in order to optimize power for a given 2

total number of fish. The optimum allocation depends on the statistical test to be used. A widely 3

used allocation rule was given by Dunnett (1955), which states that for a total of n fish and k 4

treatments to be compared to a common control, if the same number, n, of fish are allocated to 5

every treatment group, then the number, n0, to allocate to the control to optimize power is deter-6

mined by the so-called square-root rule. By this rule, the value of n is (the integer part of) the 7

solution of the equation N= kn + nk, and n0 = N - kn. [It is almost equivalent to say n0 = nk.]. 8

This has been shown to optimise power for Dunnett’s test. It is used, often without formal justifi-9

cation, for other pairwise tests, such as the Mann-Whitney and Fisher exact test. Williams (1972) 10

showed that the square-root rule may be somewhat sub-optimal for his test and optimum power 11

is achieved when k in the above equation is replaced by something between 1.1k and 1.4k. 12

Computer simulations show that for the step-down Jonckheere-Terpstra and Cochran-Armitage 13

tests, power gains of up to 25 % can be realized under the square-root rule compared to results 14

from equal sample sizes. 15

Power example, continued 17

What if we used the square-root rule in the above power example? Based on the above, we will 18

examine the case where a total of 120 fish are used (20 per concentration and control in the 19

above design). Under the square-root rule, we solve 120 = 5n + n5 for n to get n = 16. Then 20

n0 = 120 - 5 ∙ 16 = 40. The following power plot is based on this allocation (Fig. 3.4). Note that 21

the power to detect a 20 % increase in mortality rate in the 800 ppm group is now 92 %. So, with 22

the same number and spacing of concentrations and the same total number of fish, the power to 23

detect a 20 % increase in mortality rate has increased from 82 % to 92 % by using the square-24

root allocation rule instead of equal sample sizes. An alternative way to use the square-root rule 25

would be to reduce the total number of fish required without loss of power. Indeed, power curves 26

for nominal sample size N = 15 under the square-root rule show the power to detect a 20 % in-27

crease in mortality is 86 %. Thus, with a smaller total number of fish allocated optimally, the 28

power to detect a 20 % increase is actually increased. This result underscores the importance of 29

good experimental design and test selection. 30

Figure 3.4: For details, see text. 1

In experiments where two controls (dilution water and solvent controls) are used and controls are 3

combined for further testing, a doubling of the control sample size is already achieved. Since ex-4

perience suggests that most experiments will find no significant difference between the two con-5

trols, the optimum strategy for allocating fish is not necessarily immediately clear. This of course 6

would not be a consideration, if a practice of pooling of controls is not followed. 7

The reported power increases from the square root rule do not consider the effect of any increase 8

in variance as concentration increases. One alternative is to add additional fish to the control 9

group without subtracting from treatment groups. There are practical reasons for considering this, 10

since a study is much more likely to be considered invalid when there is loss of information in 11

the controls than in treatment groups. 12

The square-root allocation rule holds little, if any, advantage for regression analysis. The reason 13

for this is that the curve fitting activity only happens once, with all data. There is no special con-14

sideration or use of the controls. 15

3.9 Appendix 1: Detailed consideration of regression analysis for sex ratio endpoints 1

3.9.1 Comparison of alternative models 2

When using regression models based on treating the proportion male or female as a continuous 3

response, there will be a need to select one model from among several candidate models fit to the 4

data. There are both formal and informal selection criteria appropriate for the class of models 5

that will be used for sex ratio data. The simplest approach to comparing models should be used 6

in all cases, even when formal tools will also be used. This is visually inspecting the fits to the 7

data. It is important to identify regions of concentrations where each model provides a poor fit. 8

Next, the widths of the confidence bounds about the fitted curves should be compared. Generally, 9

a model that gives narrower confidence bounds is preferred, but this does not outweigh the fit of 10

the model to the data. There are situations where narrow confidence bounds are obtained for a 11

model that clearly does not fit the data. Next, confidence bounds for all estimated parameters 12

should be examined. If the confidence interval for a model parameter contains zero, then the 13

model is suspect, as that parameter is evidently not required. Beyond that, preference is given to 14

the model where the confidence intervals for the parameters are smallest. Where replicate data 15

are available, the residual mean square from the model should be compared to the pure error 16

mean square, which can be obtained from an ANOVA. Finally, Akaiki’s or Schwartz’s informa-17

tion criterion can be used. The preferred form for Akaiki’s information criterion is given by the 18

formula below. A good discussion of AICc is given in Motulsky and Christopoulos (2004). 19

knnRSSLnAICc , 21

where RSS is the residual sum of squares from the model, n is the total number of observations, 22

and k is the number of parameters estimated for the model. 23

In general, the model with the smaller AICc is preferred. If the values of AICc are close for two 24

models, it is helpful to compute the probability that model A is better than model B using the fol-25

lowing formula: 26

AAICcBAICcD

Here AICc(B) denotes the value of AICc for model B. If the probability is high, then model A is 29

favoured. The following plot of these probabilities may be helpful (Fig. 3.5). 30

Fig. 3.5: Probability of correct model selection 2

It will be observed that if AICc(B) - AICc(A) ≥ 10, the model A is almost certainly better than 3

model B. This criterion is limited, however, where weighted fits are used, as two models can be 4

compared using this criterion only if they use the same weights. So, in comparing two un-5

weighted model fits (i.e., weight=1), the criterion is sound. For weighted models where the 6

weights depend on the function being estimated, the criterion is not appropriate and comparing 7

an unweighted to a weighted model fit is certainly not appropriate. Some discussion of weights 8

in using AICc is given in http://www.boomer.org/Manual/ch05.html and 9

http://www.micromath.com/products.php?p=scientist&m=statistical_analysis. 10

3.9.2 The meaning of an x % effect 12

The third issue concerns the meaning of an x % effect. For incidence data (such as percent 13

males), there are three distinct concepts that are sometimes confused: 14

Absolute risk is when x % of the population is affected. 15

Additional risk is when x % above the “background” is affected, so that if the background 16

incidence rate is c %, then the total risk is (x+c) %. 17

Relative risk is when x % of the population that would “normally” not be affected is af-18

fected, or c % + (1 - c/100) ∙ x %. 19

To illustrate the difference, consider the meaning of EC50 when the background incidence is 1

20 %, i.e., C = 0.2 (background incidence rate). 2

50% Absolute Risk

0.01 0.1 1 10 100

50% Absolute Risk

0.01 0.1 1 10 100

Fig. 3.6: Absolute risk: 50 % of the population is affected (only 30 % above background) 4

50% Added Risk

0.01 0.1 1 10 100

50% Added Risk

0.01 0.1 1 10 100

Fig. 3.7: Added risk: 70 % is affected (total = 20 % + 50 %) 7

50% Relative Risk

0.01 0.1 1 10 100

(backgro

50% Relative Risk

0.01 0.1 1 10 100

(backgro

0.01 0.1 1 10 100

(backgro

Fig. 3.8: Extra (or relative) risk: 60 % is affected (total = 20 % + 50 ∙ (1 - 0.2) % = 2

20 % + 40 % = 60 %) 3

Probit analysis of incidence data is based on the concept of relative risk and care must be taken 4

to arrive at the correct ECx estimate, if there is background incidence. For the sex ratio, if there is 5

no background proportion male (i.e. no males in the control) and there are only males and fe-6

males, then EC50 for males is the same as EC50 for females. However, if there is a background 7

incidence of males, the two approaches are not equivalent. This is because it is important in pro-8

bit analysis to analyse an increasing proportion, when there is background incidence. Probit 9

analysis can handle background incidence for an increasing function, but gets thoroughly con-10

fused accounting for background in a decreasing function, for what does a “background” inci-11

dence rate of 70 % mean when the incidence rate in a treatment group is 40 %? 12

Glossary 1

NOEC - Definition 2

ECx – Definition 3

Chi-square - Definition and reference to the statistical test 4

Cochran-Armitage – Definition and reference to the statistical test 5

Dunn – Definition and reference to the statistical test 6

Dunnett – Definition and reference to the statistical test 7

Fisher exact test - Definition and reference to the statistical test 8

Jonckheere-Terpstra – Definition and reference to the statistical test 9

Mann-Whitney – Definition and reference to the statistical test 10

Rao-Scott - Definition and reference to the statistical test 11

Williams - Definition and reference to the statistical test 12

References 14

Backhaus, T., Blanck, H., Faust, M. (2010) Hazard and risk assessment of chemical mixtures un-15

der REACH – State of the art, gaps and options for improvement, Swedish Chemicals Agency, 16

Sundbyberg. 17

Belden, J.B., Gilliom, R.J., Lydy, M.J. (2007) How well can we predict the toxicity of pesticide 18

mixtures to aquatic life? Integr. Environ. Assess. Man. 3:364-372. 19

Chapman, P., Crane, M., Wiles, J., Noppert, F., McIndoe, E. (1996) Improving the quality of sta-20

tistics in regulatory ecotoxicity tests. Ecotoxicology 5: 169-186. 21

Dhaliwal, B., Dolan, R., Batts, C., Kelly, J., Smith, R., Johnson, S. (1997) Warning: Replacing 22

NOECs with point estimates may not solve regulatory contradictions. Environmental Toxicology 23

and Chemistry 16:124-126. 24

EU (1991) Council Directive 91/414/EEC of 15 July 1991 concerning the placing of plant-25

protection on the market. 26

EU (2003) EU Technical Guidance Document in support of Commission Directive 93/67/EEC 27

on risk assessment for new notified substances, Commission Regulation (EC) No 1488/94 on 28

risk assessment for existing substances and Directive 98/8/EC of the European Parliament and of 1

the Council concerning the placing of biocidal products on the market. 2

Kortenkamp, A., Faust, M., Scholze, M., Backhaus, T. (2007). Low-level exposure to multiple 3

chemicals: reason for human health concerns? Environ. Health Persp. 115 Suppl. 1:106-114. 4

Länge, R., Hutchison, T.H., Croudance, C., Siegmund, F., Schweinfurth, H., Hampe, P., Panter, 5

G.H., Sumpter, J.P. (2001) Effects of the synthetic estrogen 17α-ethinylestradiol on the life-cycle 6

of the fathead minnow (Pimephales promelas). Environ. Toxicol. Chem. 20: 1216-1227. 7

OECD (2006) Current Approaches in the Statistical Analysis of Ecotoxicity Data: a guidance to 8

application. OECD Series on Testing and Assessment. Guidance Document No. 54. Organisation 9

for Economic Cooperation and Development, Paris, 146 pp. 10

OECD (2008) Detailed review paper on fish life-cycle tests, OECD Series on Testing and As-11

sessment No. 95, ENV/JM/MONO(2008)22. Organisation for Economic Cooperation and De-12

velopment, Paris. 13

US EPA (1996) Federal Insecticide, Fungicide, and Rodenticide Act 7 U.S.C. §136 et seq., 14

http://agriculture.senate.gov/Legislation/Compilations/Fifra/FIFRA.pdf. 15

US EPA (2004) Overview of the ecological risk sssessment process in the Office of Pesticide 16

Programs, U.S. Environmental Protection Agency. Endangered and threatened species effects 17

determinations. Office of Prevention, Pesticides and Toxic Substances, Office of Pesticide Pro-18

grams, Washington, D.C., January 23, 2004. 19

Sanco (2010) Draft SANCO 11802/2010 working document. Revision July 2010. 20

Williams, T.D., Caunter, J.E., Lillicrap, A.D., Hutchinson, T.H., Gillings, E.G., Duffell, S. (2007) 21

Evaluation of the reproductive effects of tamoxifen citrate in partial and full life-cycle studies 22

using fathead minnows (Pimephales Promelas). Environ. Toxicol. Chem. 26: 695-707. 23

1 3. Statistical considerations - OECD

Documents

Statistical Reporting System OECD DAC

Presentation - Statistical Considerations in Setting Acceptance Criteria

STATISTICAL ISSUES RELATED TO OECD IN VITRO GENE MUTATION …

Statistical considerations in small proof-of-concept trials, including crossover designs

The Statistical Metadata System: its role in a statistical organization Jana Meliskova Joint UNECE / Eurostat / OECD Work Session on Statistical Metadata

Statistical Considerations for Phase II Trials P2_Enhancing... · Statistical Considerations for Phase II Trials Emily Van Meter, PhD ... Workshop P2: Enhancing the Design and Conduct

OECD COUNCIL RECOMMENDATION ON GOOD STATISTICAL PRACTICE€¦ · OECD Council Recommendation on Good Statistical Practice: Australian self-assessment Please find attached the Australian

Gil Harari Statistical considerations in clinical trials

Statistical and methodological considerations for examining program effectiveness

Statistical Considerations in Setting Product Specifications

Program Design Statistical Considerations Laboratory

OECD Economic Surveys: Italy 2011 - aspeninstitute.it · This work is published on the OECD iLibrary, which gathers all OECD books, periodicals and statistical databases. Visit ,

Statistical Graphics Considerations

Statistical considerations in exploratory studies · Session 2: Statistical considerations in exploratory studies I am speaking today on behalf of EFSPI, and any views or opinions

Statistical Considerations for Educational Screening & Diagnostic Assessments

Presentation - Setting specifications, statistical considerations

2015 OECD Environmental Performance Reviews Mainstreaming ... · This work is published on the OECD iLibrary, which gathers all OECD books, periodicals and statistical databases

Statistical Considerations for Phase II Trials and ... · Statistical Considerations for Phase II Trials and Adaptive Designs J. Jack Lee, Ph.D., D.D.S. Department of Biostatistics

Management of Statistical Metadata at the OECD

Statistical Considerations for Establishing Acceptance ...pqri.org/wp-content/uploads/2015/10/03-Viehmann-PQRI_CUsession.pdf · Statistical Considerations for Establishing Acceptance