View
4
Download
0
Category
Preview:
Citation preview
1
3. Statistical considerations 1
2
3.1 Outline 3
The statistical methods used to analyse results of regulatory ecotoxicology studies must be 4
consistent with regulatory frameworks, they must be statistically robust and maximize effi-5
ciency in terms of animal use, time and costs. Different national and regional risk assessment 6
schemes have often been developed to balance these factors in a variety of ways. For exam-7
ple, in Europe, the environmental hazard assessments of industrial chemicals or biocides fo-8
cus on the calculation of the predicted no effect concentration (PNEC) values. These are typi-9
cally based on either acute effects concentration (ECx) type studies or chronic no observed 10
effect concentration (NOEC) type studies, using safety factors as appropriate (EU 2003). For 11
pesticides, in Europe either acute ECx type studies or chronic NOEC studies are used to de-12
rive toxicity exposure ratios that are employed for risk assessment (Directive 91/414; EU 13
1991). In the United States, the regulatory terminology implies the use of chronic NOEC data 14
as a basis for the calculation of hazard quotients used in the risk assessment of pesticides (US 15
EPA 2004). In contrast, chronic exposure in aquatic algae (e.g. OECD TG 201) studies are 16
typically analysed by calculation of an EC10 and EC20, which are often used alongside aquatic 17
fish studies in the assessment of risk. Also, some probabilistic risk assessment schemes might 18
require information on the slope and confidence limits of the dose-response curve. 19
Regulatory needs, test designs and statistical methods cannot be considered independently, 20
and the impact of change in one of these factors on the others must be considered carefully. 21
For example, in Chapter 4 it is stated that for endocrine screens (OECD TG 229 and OECD 22
TG 230) “the definitive test exposes fish to a suitable range of concentrations maximizing the 23
likelihood of observing the effect. The important distinction being that achieving a NOEC is 24
not the purpose of the test”. This statement accurately reflects the original basis for the design 25
of the test. Yet, as the basis for the original test design fades in memory, there appears to be a 26
tendency to expect the calculation of both ECx and NOEC values based on the results of the 27
screens. It cannot be assumed that the design of this test can support adequate estimates of 28
EC10 and EC20 values without adequate studies of the accuracy and precision of estimates. 29
Alternative testing methods may have new or novel strengths/limitations in terms of statisti-30
cal power compared to standard guidelines (e.g. testing based on upper threshold concept). 31
Statistical methods must be capable of detecting, or modelling, the smallest effects that are 32
biologically meaningful (for discussion, see below). A key issue in the interpretation of fish 33
toxicity tests, as they grow in complexity, is to distinguish between biologically important 34
effects caused by the test chemical versus statistically detectable differences. This aspect of 35
ecotoxicology test guideline data interpretation is identical to the principals developed for 36
many mammalian test guidelines in recent years (Länge et al. 2002 Williams et al. 2007). 37
Against this background, a key element of this chapter is to illustrate key principles of data 38
interpretation (e.g. importance of historical control values for endpoints of interest, adequate 39
replication, etc.) that can be used as required for different fish test guidelines. 40
2
Toxicological endpoints should not be interpreted in isolation from other information relevant 1
to the test. For example, it is usually assumed that the responses will follow an underlying 2
monotone concentration response pattern (i.e. there is a general tendency for the effect to in-3
crease as concentration increases) in the absence of compelling evidence to assume other-4
wise. Use of the knowledge that responses are likely to follow such a pattern can lead to bet-5
ter statistical tests and allow variations not related to treatment to be identified. For example, 6
this assumption makes the Jonckheere-Terpstra trend test a powerful tool for the calculation 7
of NOEC values and forms the entire basis for calculation of ECx values. 8
There is some controversy on the question of whether hypothesis testing (NOEC/LOEC 9
[no/lowest observed effect concentration]) or regression (ECx) is the better way to evaluate 10
toxicity data (e.g. Chapman et al. 1996, Dhaliwal et al. 1997). It is not the intention to replay 11
that debate here. The intention of this chapter is to indicate how best to do each type of analy-12
sis and to indicate the types of data and experimental designs under which each type of analy-13
sis can be done with reasonable expectation of useful results. Therefore, requirements for the 14
different approaches are considered. 15
The OECD guidance document no. 54 (OECD 2006) describes current approaches to the sta-16
tistical analysis of ecotoxicity data and should be consulted. However, the recent develop-17
ment of new fish test guidelines (e.g. draft Fish Sexual Development Test) has raised addi-18
tional specific issues worthy of discussion in addition to some general considerations. 19
20
3.1 Biological versus statistical significance 21
The question of what magnitude of effect is biologically important to detect or what effects 22
concentration (ECx) to determine is not a statistical issue. This issue is not unique to fish tests, 23
but is valid for other ecotoxicity test species such as Daphnia and algae. Scientific judgment, 24
grounded in repeated observation of the same response in the same species under the same 25
conditions, is required to specify this (i.e. the understanding of historical control data). Statis-26
tics provides a means to determine the magnitude of effect that a given experimental design 27
can quantify. To put this another way, once an effect size of biological importance has been 28
determined by subject matter scientists, it is possible to design an experiment that has a high 29
likelihood of producing the desired information (i.e. whether an effect of the indicated size 30
occurs at some test concentration or what concentration produces the specified effect). 31
The relationship between biological significance and statistical significance can be under-32
stood in terms of the magnitude of effect that can be detected statistically. For a continuous 33
response, such as growth or fecundity, this in turn depends on the relative magnitude of the 34
between-replicate and within-replicate variances. The standard error of the sample control 35
mean response is given by: 36
37
SE = SQRT[Var(Rep)/r+Var(ERR)/rn] = σ∙SQRT[R/r+1/rn], 38
3
1
where σ is the within-replicate or error standard deviation, R is the ratio of the between-2
replicate variance to the within-replicate variance, r is the number of replicates in the control 3
group and n is the number of fish per replicate, Var(Rep) is the between-replicate variance 4
and Var(ERR) is the within-replicate or fish-to-fish variance. 5
The 95% confidence interval for the mean is, approximately, Mean ± 2∙SE. It is often con-6
venient to express 2*SE as a percent of the mean, so the 95 % confidence interval for the 7
mean can be expressed as Mean ± P %, where P = 200∙SE/Mean. The true mean is statisti-8
cally indistinguishable from any value in this confidence interval for the sample mean. 9
This means that the smallest treatment effect that can be distinguished statistically is P % of 10
the control mean. This holds for both the NOEC and ECx approaches. 11
It is then incumbent on the study director to determine the magnitude of effect, Q, which is 12
judged biologically important. For a given experimental design and endpoint, Q is compared 13
to P. If Q > P, then the experiment is suited for the purpose, otherwise not. 14
For example, in a recent fish full life-cycle (FFLC) test, the control mean for percent male 15
offspring was 69 %, with standard error of the mean = 19 %. Thus, the smallest effect that 16
can be found statistically significant is 19 % and ECx for x < 19 cannot be reliably estimated. 17
Another way of considering this is to observe that the lower bound of the 95 % confidence 18
interval for EC10 and EC20 is 0. Also, the NOEC is a concentration at which a > 19 % effect 19
was observed. (NOTE: The confidence interval for the difference between the control mean 20
and a treatment mean is actually greater than 19 % by a factor of √2). Vitellogenin (VTG) is 21
another highly variable response and only large effects, around 40 %, can be expected to be 22
statistically significant in a practical experiment. Equivalently, EC40 might reasonably be es-23
timated, but not EC25. 24
At the other extreme, the standard error of a growth measurement for Daphnia is often 2 -25
3 % of the control mean, so very small effects can be found statistically significant. For this 26
response, it is quite feasible to estimate EC5. It is a matter of scientific (non-statistical) judg-27
ment whether such small effects are biologically meaningful. Similar findings hold for avian 28
eggshell thickness measurements. 29
From the formula for standard error (SE), it will be evident that there is a trade-off between 30
the number of fish per replicate and the number of replicates per control or test concentration. 31
For example, from the formula, it is evident that if the number of replicates is doubled and 32
the number of fish per replicate is reduced by 50 %, then the second term in brackets is un-33
changed, but the first term is reduced by 50 %. This might indicate a preference for more rep-34
licates of fewer subjects per replicate. However, if Var(REP) is already relatively small, not 35
much is gained by such an approach. Instead, if we cut the number of replicates by 50 %, but 36
increase the number of subjects per replicate by a factor of 4, then the first term remains 37
small but the second term is reduced by 50 %. Thus, whether it is better to have a few repli-38
cates with many subjects in each, or many replicates with few subjects in each, depends on 39
4
the relative magnitude of the two variances. A general rule is that if the ratio R of variances 1
exceeds 0.5, then more emphasis is given to the number of replicates, otherwise, more em-2
phasis is given to the number of subjects within each replicate. For example, shoot height of 3
some emergent crops (e.g. oat, tomato, rape) tend to have variance ratios exceeding 0.5 (John 4
Green, pers. comm.). 5
As further illustration, recent experiments conducted for the OECD found VTG measure-6
ments to be very highly variable and the within-replicate variance ranged between two and 10 7
times that of the between-replicate variance. Thus, good experimental design would call for 8
relatively few replicates with numerous fish in each. In the case of medaka (Oryzias latipes), 9
it was found that a control and three test concentrations with two replicates per control and 10
test concentration and five fish per replicate were adequate to give 80% power to detect a 11
60 % effect. For fathead minnow (Pimephales promelas), four replicates of four fish each in 12
the control and each test concentration were required to give 80 % power to detect a 94 % 13
effect. While test effect sizes may seem large and the number of fish small, there were con-14
straints on the number of replicates and fish that could be accommodated from practical con-15
siderations. Furthermore, such size effects were observed in high test concentrations in these 16
experiments during validation. 17
A complication is when there are multiple endpoints to be analysed in a single experiment. If 18
the experimental design is optimal for one response, it may be sub-optimal for another re-19
sponse. This may mean that only very large effects can be estimated or detected statistically 20
for one endpoint and perhaps very small, biologically unimportant effects will be found sta-21
tistically significant for another response. It is important to understand this in interpreting the 22
data. For a biologically important effect may be missed in the first instance, which should not 23
be interpreted to mean the chemical in question has no effect on that response; while a sound 24
study might be rejected because of a tiny effect found statistically significant. Any statistical 25
result should be interpreted in light of the biologically important effects determined before 26
the experiment was conducted. 27
To design an experiment with high likelihood of detecting a P % effect (or estimating a 28
meaningful ECP), it suffices to find r and n so that P = 200∙SE/Mean. It is a simple matter to 29
construct a table showing 200∙SE/Mean for various values of r and n and seeking the most 30
practical combination to define the design based on historical control estimates of the two 31
variances. 32
33
3.2 NOEC/LOEC 34
For the purpose of determining an NOEC or an LOEC, it is important to design the experi-35
ment so as to be able to have a reasonable chance of finding a biologically relevant effect sta-36
tistically significant and minimize the chance of finding biologically irrelevant effects statis-37
tically significant. These two objectives are somewhat incompatible, and judgment will be 38
useful in reaching appropriate regulatory conclusions. A fish study will have a water control 39
group (dilution water control), and if a solvent is used, a solvent control, and at least one test 40
5
concentration. Unless the design is for a limit test, there will usually be three or more test 1
concentrations approximately equally spaced on a log scale. With very few exceptions, the 2
control(s) and test concentrations should be replicated. Replicate here refers to the test vessel, 3
not to individual animals unless they are housed individually in a test vessel. The trade-off 4
between number of fish per replicate and number of replicates per test concentration and con-5
trol will vary according to the response and species, and a power calculation may be needed 6
to determine the best design. Since multiple responses are usually tested from the same ex-7
periment, it will often not be possible to design an experiment that is optimal for all responses. 8
Judgment is needed to decide on the most important response(s) and experiments should be 9
designed to provide adequate power (75 - 80 %) to detect biologically relevant changes in 10
those responses. Power simply refers to the probability of finding statistically significantly an 11
effect of a given true magnitude, taking into account variability in the response of interest and 12
variability arising from sampling. It is also important to quantify the size effect likely to be 13
found significant in all other responses. This may indicate that there is a need to rethink the 14
objectives of the experiment. There should be no surprises at the end of the study about what 15
can be analysed, by what test, and with what ability to detect effects. 16
Most responses are analysed using 2-sided tests, unless there is a clear biological reason to 17
expect or be concerned only with changes in one direction (e.g. an increase). Furthermore, for 18
most responses, there is an expectation that the concentration-response is approximately 19
monotone. The effect may be measured as an increase or a decrease in some measurement 20
(e.g., weight might decrease, mortality might increase). That being true, a test that is designed 21
for a monotone trend is more powerful than one that simply compares each test group to the 22
control independent of effects in other test groups. Thus, there is a preference for a step-down 23
Williams or Jonckheere-Terpstra test over a Dunnett, Dunn, Mann-Whitney or other pairwise 24
test, provided the data are consistent with a monotone concentration-response. All tests refer-25
enced in this chapter are discussed in detail in OECD (2006). 26
All statistical procedures are based on some data requirements. In addition to the monotonic-27
ity requirement for Williams and Jonckheere-Terpstra for continuous responses and Cochran-28
Armitage for quantal responses, there are additional requirements. The Williams and Dunnett 29
tests require normally distributed data with homogeneous variances. While these tests are ro-30
bust against mild violations of these requirements, they are not impervious and some check-31
ing of these requirements is appropriate. A visual check from a scatter plot may be sufficient 32
to assess monotonicity and variance homogeneity, and even normality. There are also formal 33
tests for all three and OECD (2006) provides details. 34
Where normality or variance homogeneity are violated, a transformation of the data to 35
achieve these requirements can be sought or non-parametric methods employed, which have 36
fewer requirements or are much less sensitive to violations. Be mindful that different agen-37
cies may have different jurisdictions on how, or whether, data will be transformed. Contrary 38
to widely held opinions, non-parametric tests are not always inferior to parametric tests. For 39
example, the power properties of the step-down Jonckheere-Terpstra test are very similar to 40
those of the step-down Williams test, when the data are normally distributed with homogene-41
ous variances, and are superior to Williams, when those conditions are violated. On the other 42
6
hand, for datasets with few replicates, the power properties of the Mann-Whitney and Dunn 1
tests are worse, sometimes much worse, that those of Dunnett’s test. Fig. 3.1 indicates a typi-2
cal comparison of these tests. 3
Fig. 1 shows the power of seven tests for an experiment with three positive test concentra-4
tions and a single control. The horizontal axis shows the percent change from the control, and 5
the vertical axis shows the probability that size effect will be found statistically significant. 6
The red curve shown with diamonds is the step-down Jonckheere-Terpstra (JT) test (standard 7
asymptotic version), orange with triangles for the exact permutation version of JT, the black 8
dark curve with asterisks is William’s test, blue with circles is for Dennett’s, cyan with aster-9
isks and green with squares are for the standard (asymptotic) and exact permutation versions 10
of Mann-Whitney (also known as the Wilcoxon) test, and grey dots for Dunn’s test. In the top 11
row, the design called for 2 reps of 8 fish in the control and each test concentration, while the 12
bottom row is for two replicates of two fish each. The left column is for a design following 13
the square-root allocation rule (see below), and the right column is for a design with equal 14
replication in control and all treatment groups. On the left, the gray Dunn power curve is hid-15
den by the green and cyan Mann-Whitney curves. The data generated were normally distrib-16
uted with homogeneous variances and with variability that was observed for VTG in some 17
OECD validation experiments for fish endocrine screening studies. It is clear that the power 18
of the Jonckheere test is greater than that of Williams test on the left, whereas William’s test 19
sometimes has slightly greater power on the right and both tests exceed in power that of all 20
the pair-wise tests (Dunnett, Dunn, Mann-Whitney). A striking feature of these results is that 21
Mann-Whitney has zero power to detect effects regardless of magnitude in either design, 22
whereas Dunn’s has zero power under the square-root rule and low power under equal alloca-23
tion. This knowledge is clearly important in deciding on design and test selection. 24
25
3.3 ECx 26
Standard regression models also depend on data meeting certain requirements. Among these 27
requirements, a key prerequisite is that the observations are mutually independent. This re-28
quirement is violated, for example, if all responses are divided by the control mean in an at-29
tempt to “normalize” the data or reduce variability. While it is not impossible to model corre-30
lated responses, specialized models are required to do so. For a continuous response (i.e. pro-31
portion of males is analysed as a continuous response), the data are assumed to be normally 32
distributed with homogeneous variances. There are modifications to accommodate heteroge-33
neous variances, such as alternative variance-covariance structures or weighting. It is also 34
possible to accommodate some types of non-normality.35
7
Fig. 3.1: Power of various tests to detect effects
POWER
0
10
20
30
40
50
60
70
80
90
100
PERCENT CHANGE FROM CONTROL (LOG-TRANSFORMED)
0 50 100 150 200
POWER
0
10
20
30
40
50
60
70
80
90
100
PERCENT CHANGE FROM CONTROL (LOG-TRANSFORMED)
0 50 100 150 200
POWER
0
10
20
30
40
50
60
70
80
90
100
PERCENT CHANGE FROM CONTROL (LOG-TRANSFORMED)
0 50 100 150 200
POWER
0
10
20
30
40
50
60
70
80
90
100
PERCENT CHANGE FROM CONTROL (LOG-TRANSFORMED)
0 50 100 150 200
POWER OF TESTS TO DETECT A VTG EFFECT AT DOSE 3
From Fish Experiment with Control and 3 Positive Doses
CVTYPE=MAX REPS=2 SAMPLSIZE=8 SQRTRULE=YES
PLOT WL3 DT3 DN3 JT3
JTX3 MW3 MWX3
POWER OF TESTS TO DETECT A VTG EFFECT AT DOSE 3
From Fish Experiment with Control and 3 Positive Doses
CVTYPE=MAX REPS=2 SAMPLSIZE=2 SQRTRULE=YES
PLOT WL3 DT3 DN3 JT3
JTX3 MW3 MWX3
POWER OF TESTS TO DETECT A VTG EFFECT AT DOSE 3
From Fish Experiment with Control and 3 Positive Doses
CVTYPE=MAX REPS=2 SAMPLSIZE=8 SQRTRULE=NO
PLOT WL3 DT3 DN3 JT3
JTX3 MW3 MWX3
POWER OF TESTS TO DETECT A VTG EFFECT AT DOSE 3
From Fish Experiment with Control and 3 Positive Doses
CVTYPE=MAX REPS=2 SAMPLSIZE=2 SQRTRULE=NO
PLOT WL3 DT3 DN3 JT3
JTX3 MW3 MWX3
8
It is recognized that regression is robust against mild deviations from normality and variance 1
homogeneity, but it can be adversely affected by serious violations of these requirements. Thus, 2
some checking of the distribution of responses is in order, either visually from a scatter plot or 3
through formal testing. For quantal data, normality is not required, but typically the data are as-4
sumed to follow a binary distribution within a given treatment group. Quantal data should be 5
checked for extra-binomial variance (more variation than can be accounted for by the simple bi-6
nominal distribution), the quantal analogue of the homogeneous variance requirement for con-7
tinuous responses. If extra-binominal variance is observed, there are statistical test methods 8
which take this into account (e.g. Rao-Scott, as described in OECD 2006). Finally, attention 9
should be paid to goodness-of-fit of the model to the data. There are several ways to assess 10
goodness-of-fit. Among the simplest are (1) visual comparisons of the responses predicted by the 11
model to the observed responses, and (2), where replicates are available, comparing the residual 12
mean square from the model against the pure error mean square. If these residual mean squares 13
are significantly larger than the pure error mean square, then the model does not fit the data well. 14
With small datasets typical in this field, this may not be a powerful test. (3) Confidence bounds 15
on the model predictions are very important to show whether the model predictions have any 16
meaning. If no confidence bounds can be produced or they are very wide, then predictions from 17
the model are scientifically unreliable. It should also be understood that typical confidence 18
bounds do not capture model uncertainty, which is one reason for conducting other goodness-of-19
fit assessments such as items (1) and (2). Model confidence bounds are constructed based on the 20
assumption that the model is correct. If no confidence bounds can be computed or they are very 21
wide, then the model is not internally consistent regardless of how well the model appears to fit 22
the data from a visual inspection. It is also possible to compute an R-square value to judge the 23
proportion of the total sum of squares that is accounted for by the model. While R-square is a 24
useful measure for linear models, it is an unreliable guide for the non-linear models which are 25
most often used to model ecotoxicity responses. For comparing two models for the same data, 26
Akaike’s AIC criteria can be useful. 27
In more general terms, a search of OECD TGs 204, 210, 212, 215, 229 and 230 was made to de-28
termine procedures specified for ECx calculation. Only OECD TGs 212 and 215 describe ECx 29
procedures. OECD TG 212 describes a normalization procedure, but does not specify fitting a 30
regression curve to the normalized percentages from which to estimate ECx. In contrast, OECD 31
TG 215 describes two test designs: one for ECx and one for NOEC determination. Acknowledg-32
ing the necessary differences: “that a design which is optimal (makes best use of resources) for 33
use with one method of statistical analysis is not necessarily optimal for another. The recom-34
mended design for the estimation of a LOEC/NOEC would not therefore be the same as that rec-35
ommended for analysis by regression.” 36
If ECx designs are to be more widely used, consideration of the following will be necessary pre-37
requisites: 38
9
General guidance should be given on how the need to estimate ECx affects the optimum 1
spacing of concentrations, number of treatments, and number of replicates to be used. 2
This guidance will probably suggest test designs quite different for the minimum designs 3
currently described in the OECD TGs optimized for NOEC determination. 4
Different endpoints may elicit responses at very different concentration levels. Therefore, 5
a strategy is required to handle this within one test. 6
What constitutes a meaningful ECx estimate, and what is the implication, if a meaningful 7
estimate cannot be obtained for one or all endpoints? 8
Validation of the draft Fish Sexual Development Test raised particular concerns about the use of 9
regression analysis for determination of effects on sex ratio. These issues are explored in detail in 10
Appendix 1. From these discussions, it is clear that the ECx approach is inappropriate for the sex 11
ratio endpoint and a NOEC design should be carried forward. 12
13
3.4 NOEC versus ECx designs 14
It is stated in OECD (2008) that: “[92.] In summary, ANOVA designs for fish testing appear in-15
ferior to regression designs, and the latter are considered to show more promise for fish life-cycle 16
tests given the generally large inherent variability in egg production (fecundity) between indi-17
viduals, which inevitably reduces the power of the ANOVA approach. Final decisions on which 18
design strategy to use should be made on a case-by-case basis, taking into account factors such 19
as the known variability in reproductive output of the species in question.” 20
The example datasets and analyses discussed in Chapter 3.x (Appendix) should serve as a cau-21
tion against overstating the advantages of regression over what is referred to as the ANOVA ap-22
proach. While regression has always been an important tool for statisticians, it is not appropriate 23
for some datasets and can suggest a level of precision that is not supported by the data. Experi-24
ments for regression analysis call for different experimental designs than those for which 25
ANOVA methods are intended. Just as ANOVA methods call for designs with adequate power 26
to detect biologically relevant adverse effects, regression methods call for designs that are capa-27
ble of providing reliable or meaningful estimates of an x % effects concentration and this re-28
quires designing around the specific x or percent effect to be estimated. Basic requirements in-29
clude the following: (1) There should be test concentrations on both sides of ECx. The zero con-30
centration control does not figure into this requirement. However, see the following paragraph. 31
(2) If the 95 % confidence interval for the control response is of the form Mean ± P %, then es-32
timates of ECx are meaningful only for x > P. For example, if the control mean is estimated only 33
with 20 % error, then it is meaningless to estimate EC10. (3) If the confidence interval for ECx is 34
very wide, perhaps spanning several test concentrations, then there can be little or no confidence 35
in the ECx estimate. These basic requirements are important to keep in mind, because once a re-36
10
gression model is fit to the data, it is a simple mathematical exercise to use the resulting equation 1
to estimate ECx for any percent x, and yet not all such values of x lead to plausible or meaningful 2
estimates. A mathematical equation is not a substitute for valid interpretation of data. This is akin 3
to the requirement in the NOEC approach of adequate power to detect an effect of magnitude 4
deemed biologically important. 5
It can be appropriate to determine both an NOEC (provided there is sufficient power to detect 6
biologically relevant effects) and an ECx (provided x is beyond the range of control variability). 7
Ideally, x should be between two tested concentrations. However, it is important to recognise 8
extrapolation beyond the range of data adds significant uncertainty and needs to be justified. It is 9
permissible to extrapolate modestly beyond the range of tested concentrations, provided this does 10
not violate restriction (2) in the previous paragraph. Such extrapolation necessarily comes with 11
increased uncertainty and assumes the model fit is valid beyond the range of tested concentra-12
tions, something that is untestable from the data. The uncertainty increases the further from the 13
experimental range one extrapolates. 14
Although it is not recommended to combine NOEC and ECx approaches in the same study, there 15
may be some compelling reasons to do so. For certain existing regulatory frameworks, it might 16
be appropriate to focus on NOEC test designs for fish chronic endpoints (e.g. FIFRA: US EPA 17
1996). However, for future regulatory frameworks it could be required to have both ECx and 18
NOEC determinations in fish chronic studies (e.g. draft Sanco 2010 review document). However, 19
the latter has serious implications for experimental design, time and costs, as well as ethical and 20
statistical interpretations. It might not be practical to design tests with multiple endpoints to de-21
termine both the NOEC and ECx values for the endpoints of interest. 22
23
3.5 Alternate designs (e.g. square root allocation rule) 24
There are several factors that affect the power of a given test. These are experimental design (e.g. 25
number of replicates per control and treatment group, number of fish per replicate, number of 26
treatment groups), shape of the concentration-response, and inherent variability of the response 27
of interest. One simple, but important decision is whether the control and treatment groups 28
should be equally replicated or whether more replicates should be allocated to the control. The 29
argument for the latter is two-fold: First, it gives a better measure of the undisturbed population 30
against which all treatment groups are compared, and, second, it tends to increase the power of 31
the test, in part by increasing the degrees of freedom for the test statistics. 32
Dunnett (1955) showed that the power of his test is optimized using what is called the square-33
root allocation rule, which provides a specific formula for the number of replicates in the control 34
and all treatment groups. Details are given in OECD (2006). Further theoretical published work 35
and extensive power simulation studies have shown that this same rule (or a minor modification) 36
11
also maximizes the power of the Williams and Jonckheere tests and usually increases the power 1
of the Mann-Whitney and Dunn tests. 2
3
3.6 Solvent/carrier control 4
One of the issues brought up in the first version of the OECD Fish Sexual Development Test Re-5
view, and which is a consideration on all of the test guidelines, is how the two controls (dilution 6
water and solvent controls) should be used when there is a solvent used in the treatment groups. 7
There are advantages to pooling the two controls to test for treatment effects: (1) By doubling the 8
number of control replicates, the power of the tests for treatment effects is increased, achieving 9
at least part of the advantages of the square-root allocation rule described above. (2) All the data 10
are used and the pooled control provides the best estimate of the background population from the 11
experiment. Permissible solvents are those which have been well-established in fish experiments 12
and have been found to have no practical effect on fish at the concentrations used. A preferable 13
alternative to always pooling the controls is to compare them statistically and pool them, if no 14
significant difference is found, and otherwise use only the solvent control to test for treatment 15
effects. The justification for the latter is that solvent is in all the treatment groups at approxi-16
mately the same concentration as in the solvent control, so that one compares solvent plus treat-17
ment to solvent, the difference being the treatment effect. This is a plausible hypothesis based on 18
the apparent additivity of effects in most aquatic chemical mixtures that is supported by concen-19
tration addition. References on this include Belden et al. (2007), Backhaus et al. (2010), as well 20
as Kortenkamp et al. (2007). The last communication addresses endocrine disrupting chemicals 21
specifically as well as other classes of chemicals. 22
Currently, there is a lack of harmonisation amongst different regulatory authorities on what is the 23
control for statistical analysis (dilution water control or solvent control and if they should be 24
pooled or not). A definitive answer to this question cannot be provided at present, but it has been 25
recommended that a working group to progress this issue. A topic that might be addressed by 26
such a working group is the reduction in the number of animals that could arise by eliminating 27
one of the controls. 28
29
3.7 Power 30
In the design stage, the primary use of power analysis in toxicity studies is to demonstrate ade-31
quate power to detect effects that are large enough to be deemed important. If our methods have 32
sufficient power, and we find that, at a given concentration, there is no statistically significant 33
effect, we can have some confidence that there is no effect of concern at that concentration. Fail-34
ure to achieve adequate power can result in large effects being found to be statistically insignifi-35
12
cant. On the other hand, it is also true that a test can be so powerful that it will find statistically 1
significant effects of little importance. 2
Deciding on what effect size is large enough to be important is difficult. In some cases, the effect 3
size may be selected by regulatory agencies or may be specified in guidelines. 4
For design purposes, the background variance can be taken to be the pooled within-experiment 5
variance from a moving frame of reference from a sufficiently long period of historical control 6
data with the same species and experimental conditions. The time-window covered by the mov-7
ing frame of reference should be long enough to average out noise without being so long that un-8
detected experimental drift is reflected in the current average. If available, a three-to-five year 9
moving frame of reference might be appropriate. When experiments must be designed using 10
more limited information on variance, it may be prudent to assume a slightly higher value than 11
what has been observed. Power calculations used in design for quantal endpoints must take the 12
expected background incidence rate into account for the given endpoint, as both the Fisher-Exact 13
and Cochran-Armitage test are sensitive to this background rate, with highest power achieved for 14
a zero background incidence rate. The background incidence rate can be taken to be the inci-15
dence rate in the same moving frame of reference already mentioned. 16
At the design stage, power must, of necessity, be based on historical control data for initial vari-17
ance estimates. It may also be worthwhile to do a post-hoc power analysis to determine whether 18
the actual experiment is consistent with the criteria used at the design stage. If there is signifi-19
cantly higher observed variance (e.g. based on a chi-square or F-test) than that used in planning, 20
then the assumptions made at design time may need to be reassessed. Care must be taken in 21
evaluating post-hoc power against design power. Experiment-to-experiment variation is expected, 22
and variance estimates are more variable than means. The power determination based on histori-23
cal control data for the species and endpoint being studied should be reported. 24
25
Power Example 26
Suppose we want to determine the NOEC for mortality in an experiment with rainbow trout (On-27
corhynchus mykiss), where past experience with this species suggests that background mortality 28
rate at the relevant age and test duration is near zero. We want to be able to detect a 20 % mortal-29
ity rate and, based on preliminary range-finding experiments, we have decided on an experiment 30
with five test substance concentrations at 50, 100, 200, 400, and 800 ppm, plus a single (non-31
solvent) control. Furthermore, suppose previous experience suggests that extra-binomial variance 32
and within-tank correlations of responses are unlikely, so a standard Cochran-Armitage test can 33
be done treating all fish within a concentration equally (i.e. ignoring any tank or replicate effect). 34
How many fish per concentration should we plan? 35
36
13
First, consider designs with the same number, n, of fish in each concentration as in the control. 1
The power of the Cochran-Armitage test depends on the shape of the concentration-response 2
curve, which we do not know. Powers have been simulated for numerous shapes. Based on an 3
examination of the various power plots, a reasonable choice for design purposes is the linear 4
concentration-response shape. In addition, the power depends on the threshold of toxicity. For 5
design purposes, we will assume that is zero. The following plots will help (Figs. 3.2 and 3.3). 6
7
Fig.3.2: Power versus maximum change of 100∙Delta % for n = 5 fish per concentration. 8
Fig. 3.2 shows that 5 fish per concentration would give very low power (about 25 %) to detect a 9
20 % change in the high concentration. There is little point in conducting the experiment for the 10
purpose. 11
Consider a design with 20 fish per concentration: This sample size gives a power of 82 % to de-12
tect a 20 % mortality in the 800 ppm concentration (Fig. 3.3). This may well be adequate. What 13
is the power to detect a 20 % in lower concentrations? Fortunately, we do not lose much power 14
as we step down. The power to detect a 20 % mortality rate at 400 ppm is 80 %, at 200 ppm it is 15
78 %, and at 100 ppm it is 76 %. Notice, however, that if the background incidence rate were 10 16
%, then the power to detect an increase in mortality rate of 20 % drops to around 40 %, which 17
would be inadequate for most purposes. 18
19
14
1
Fig.3.3: Power versus maximum change of 100∙Delta % for n = 20 fish per concentration. 2
3
3.8 Replicates 4
Decisions on the number of fish per tank and number of tanks per group should be based on 5
power calculations using historical control data to estimate the relative magnitude of within- and 6
among-tank variations and correlations. If there is only one tank per test concentration, then there 7
is no way to distinguish housing effects from concentration effects and neither between- and 8
within-group variances or correlations can be estimated, nor is it possible to apply any of the sta-9
tistical tests described for continuous responses to tank means. Thus, a minimum of two tanks 10
per concentration is recommended; three tanks are much better than two; four tanks are better 11
than three. Some non-parametric tests (e.g. not Mann Whitney) require a minimum of four repli-12
cates. The improvement in modelling falls off substantially as the number of tanks increases be-13
yond four. (This can be understood on the following grounds: The modelling is improved, if we 14
get better estimates of both among- and within-tank variances. The quality of a variance estimate 15
improves as the number of observations on which it is based increases. Either sample variance 16
will have, at least approximately, a chi-squared distribution. The quality of a variance estimate 17
can be measured by the width of its confidence interval and a look at a chi-squared table will ver-18
ify the statements made). For further discussion and other tests (parametric and non-parametric), 19
see OECD (2006). 20
The number of tanks per concentration and fish per tank should be chosen to provide adequate 21
power to detect an effect of magnitude judged important to detect. This power determination 22
should be based on historical control data for the species and endpoint being studied. 23
15
Since the control group is used in every comparison of treatment to control, consider allocating 1
more fish to the control group than to the treatment groups in order to optimize power for a given 2
total number of fish. The optimum allocation depends on the statistical test to be used. A widely 3
used allocation rule was given by Dunnett (1955), which states that for a total of n fish and k 4
treatments to be compared to a common control, if the same number, n, of fish are allocated to 5
every treatment group, then the number, n0, to allocate to the control to optimize power is deter-6
mined by the so-called square-root rule. By this rule, the value of n is (the integer part of) the 7
solution of the equation N= kn + nk, and n0 = N - kn. [It is almost equivalent to say n0 = nk.]. 8
This has been shown to optimise power for Dunnett’s test. It is used, often without formal justifi-9
cation, for other pairwise tests, such as the Mann-Whitney and Fisher exact test. Williams (1972) 10
showed that the square-root rule may be somewhat sub-optimal for his test and optimum power 11
is achieved when k in the above equation is replaced by something between 1.1k and 1.4k. 12
Computer simulations show that for the step-down Jonckheere-Terpstra and Cochran-Armitage 13
tests, power gains of up to 25 % can be realized under the square-root rule compared to results 14
from equal sample sizes. 15
16
Power example, continued 17
What if we used the square-root rule in the above power example? Based on the above, we will 18
examine the case where a total of 120 fish are used (20 per concentration and control in the 19
above design). Under the square-root rule, we solve 120 = 5n + n5 for n to get n = 16. Then 20
n0 = 120 - 5 ∙ 16 = 40. The following power plot is based on this allocation (Fig. 3.4). Note that 21
the power to detect a 20 % increase in mortality rate in the 800 ppm group is now 92 %. So, with 22
the same number and spacing of concentrations and the same total number of fish, the power to 23
detect a 20 % increase in mortality rate has increased from 82 % to 92 % by using the square-24
root allocation rule instead of equal sample sizes. An alternative way to use the square-root rule 25
would be to reduce the total number of fish required without loss of power. Indeed, power curves 26
for nominal sample size N = 15 under the square-root rule show the power to detect a 20 % in-27
crease in mortality is 86 %. Thus, with a smaller total number of fish allocated optimally, the 28
power to detect a 20 % increase is actually increased. This result underscores the importance of 29
good experimental design and test selection. 30
31
16
Figure 3.4: For details, see text. 1
2
In experiments where two controls (dilution water and solvent controls) are used and controls are 3
combined for further testing, a doubling of the control sample size is already achieved. Since ex-4
perience suggests that most experiments will find no significant difference between the two con-5
trols, the optimum strategy for allocating fish is not necessarily immediately clear. This of course 6
would not be a consideration, if a practice of pooling of controls is not followed. 7
The reported power increases from the square root rule do not consider the effect of any increase 8
in variance as concentration increases. One alternative is to add additional fish to the control 9
group without subtracting from treatment groups. There are practical reasons for considering this, 10
since a study is much more likely to be considered invalid when there is loss of information in 11
the controls than in treatment groups. 12
The square-root allocation rule holds little, if any, advantage for regression analysis. The reason 13
for this is that the curve fitting activity only happens once, with all data. There is no special con-14
sideration or use of the controls. 15
16
17
17
3.9 Appendix 1: Detailed consideration of regression analysis for sex ratio endpoints 1
3.9.1 Comparison of alternative models 2
When using regression models based on treating the proportion male or female as a continuous 3
response, there will be a need to select one model from among several candidate models fit to the 4
data. There are both formal and informal selection criteria appropriate for the class of models 5
that will be used for sex ratio data. The simplest approach to comparing models should be used 6
in all cases, even when formal tools will also be used. This is visually inspecting the fits to the 7
data. It is important to identify regions of concentrations where each model provides a poor fit. 8
Next, the widths of the confidence bounds about the fitted curves should be compared. Generally, 9
a model that gives narrower confidence bounds is preferred, but this does not outweigh the fit of 10
the model to the data. There are situations where narrow confidence bounds are obtained for a 11
model that clearly does not fit the data. Next, confidence bounds for all estimated parameters 12
should be examined. If the confidence interval for a model parameter contains zero, then the 13
model is suspect, as that parameter is evidently not required. Beyond that, preference is given to 14
the model where the confidence intervals for the parameters are smallest. Where replicate data 15
are available, the residual mean square from the model should be compared to the pure error 16
mean square, which can be obtained from an ANOVA. Finally, Akaiki’s or Schwartz’s informa-17
tion criterion can be used. The preferred form for Akaiki’s information criterion is given by the 18
formula below. A good discussion of AICc is given in Motulsky and Christopoulos (2004). 19
20
2)/(
kn
knnRSSLnAICc , 21
where RSS is the residual sum of squares from the model, n is the total number of observations, 22
and k is the number of parameters estimated for the model. 23
In general, the model with the smaller AICc is preferred. If the values of AICc are close for two 24
models, it is helpful to compute the probability that model A is better than model B using the fol-25
lowing formula: 26
)()(
,1
Pr2/
2/
AAICcBAICcD
where
e
eob
D
D
, 27
28
Here AICc(B) denotes the value of AICc for model B. If the probability is high, then model A is 29
favoured. The following plot of these probabilities may be helpful (Fig. 3.5). 30
18
1
Fig. 3.5: Probability of correct model selection 2
It will be observed that if AICc(B) - AICc(A) ≥ 10, the model A is almost certainly better than 3
model B. This criterion is limited, however, where weighted fits are used, as two models can be 4
compared using this criterion only if they use the same weights. So, in comparing two un-5
weighted model fits (i.e., weight=1), the criterion is sound. For weighted models where the 6
weights depend on the function being estimated, the criterion is not appropriate and comparing 7
an unweighted to a weighted model fit is certainly not appropriate. Some discussion of weights 8
in using AICc is given in http://www.boomer.org/Manual/ch05.html and 9
http://www.micromath.com/products.php?p=scientist&m=statistical_analysis. 10
11
3.9.2 The meaning of an x % effect 12
The third issue concerns the meaning of an x % effect. For incidence data (such as percent 13
males), there are three distinct concepts that are sometimes confused: 14
Absolute risk is when x % of the population is affected. 15
Additional risk is when x % above the “background” is affected, so that if the background 16
incidence rate is c %, then the total risk is (x+c) %. 17
Relative risk is when x % of the population that would “normally” not be affected is af-18
fected, or c % + (1 - c/100) ∙ x %. 19
20
19
To illustrate the difference, consider the meaning of EC50 when the background incidence is 1
20 %, i.e., C = 0.2 (background incidence rate). 2
50% Absolute Risk
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(ba
ckg
rou
nd
)
Sco
pe
fo
r
tre
atm
en
t e
ffe
ct
50% Absolute Risk
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(ba
ckg
rou
nd
)
Sco
pe
fo
r
tre
atm
en
t e
ffe
ct
C
(ba
ckg
rou
nd
)
Sco
pe
fo
r
tre
atm
en
t e
ffe
ct
Sco
pe
fo
r
tre
atm
en
t e
ffe
ct
3
Fig. 3.6: Absolute risk: 50 % of the population is affected (only 30 % above background) 4
5
50% Added Risk
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(back
gro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
50% Added Risk
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(back
gro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
C
(back
gro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
Sco
pe
fo
r
trea
tme
nt
eff
ect
6
Fig. 3.7: Added risk: 70 % is affected (total = 20 % + 50 %) 7
8
20
50% Relative Risk
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(backgro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
50% Relative Risk
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(backgro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
0
10
20
30
40
50
60
70
80
90
100
0.01 0.1 1 10 100
C
(backgro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
C
(backgro
un
d)
Sco
pe
fo
r
trea
tme
nt
eff
ect
Sco
pe
fo
r
trea
tme
nt
eff
ect
1
Fig. 3.8: Extra (or relative) risk: 60 % is affected (total = 20 % + 50 ∙ (1 - 0.2) % = 2
20 % + 40 % = 60 %) 3
Probit analysis of incidence data is based on the concept of relative risk and care must be taken 4
to arrive at the correct ECx estimate, if there is background incidence. For the sex ratio, if there is 5
no background proportion male (i.e. no males in the control) and there are only males and fe-6
males, then EC50 for males is the same as EC50 for females. However, if there is a background 7
incidence of males, the two approaches are not equivalent. This is because it is important in pro-8
bit analysis to analyse an increasing proportion, when there is background incidence. Probit 9
analysis can handle background incidence for an increasing function, but gets thoroughly con-10
fused accounting for background in a decreasing function, for what does a “background” inci-11
dence rate of 70 % mean when the incidence rate in a treatment group is 40 %? 12
13
21
Glossary 1
NOEC - Definition 2
ECx – Definition 3
Chi-square - Definition and reference to the statistical test 4
Cochran-Armitage – Definition and reference to the statistical test 5
Dunn – Definition and reference to the statistical test 6
Dunnett – Definition and reference to the statistical test 7
Fisher exact test - Definition and reference to the statistical test 8
Jonckheere-Terpstra – Definition and reference to the statistical test 9
Mann-Whitney – Definition and reference to the statistical test 10
Rao-Scott - Definition and reference to the statistical test 11
Williams - Definition and reference to the statistical test 12
13
References 14
Backhaus, T., Blanck, H., Faust, M. (2010) Hazard and risk assessment of chemical mixtures un-15
der REACH – State of the art, gaps and options for improvement, Swedish Chemicals Agency, 16
Sundbyberg. 17
Belden, J.B., Gilliom, R.J., Lydy, M.J. (2007) How well can we predict the toxicity of pesticide 18
mixtures to aquatic life? Integr. Environ. Assess. Man. 3:364-372. 19
Chapman, P., Crane, M., Wiles, J., Noppert, F., McIndoe, E. (1996) Improving the quality of sta-20
tistics in regulatory ecotoxicity tests. Ecotoxicology 5: 169-186. 21
Dhaliwal, B., Dolan, R., Batts, C., Kelly, J., Smith, R., Johnson, S. (1997) Warning: Replacing 22
NOECs with point estimates may not solve regulatory contradictions. Environmental Toxicology 23
and Chemistry 16:124-126. 24
EU (1991) Council Directive 91/414/EEC of 15 July 1991 concerning the placing of plant-25
protection on the market. 26
EU (2003) EU Technical Guidance Document in support of Commission Directive 93/67/EEC 27
on risk assessment for new notified substances, Commission Regulation (EC) No 1488/94 on 28
22
risk assessment for existing substances and Directive 98/8/EC of the European Parliament and of 1
the Council concerning the placing of biocidal products on the market. 2
Kortenkamp, A., Faust, M., Scholze, M., Backhaus, T. (2007). Low-level exposure to multiple 3
chemicals: reason for human health concerns? Environ. Health Persp. 115 Suppl. 1:106-114. 4
Länge, R., Hutchison, T.H., Croudance, C., Siegmund, F., Schweinfurth, H., Hampe, P., Panter, 5
G.H., Sumpter, J.P. (2001) Effects of the synthetic estrogen 17α-ethinylestradiol on the life-cycle 6
of the fathead minnow (Pimephales promelas). Environ. Toxicol. Chem. 20: 1216-1227. 7
OECD (2006) Current Approaches in the Statistical Analysis of Ecotoxicity Data: a guidance to 8
application. OECD Series on Testing and Assessment. Guidance Document No. 54. Organisation 9
for Economic Cooperation and Development, Paris, 146 pp. 10
OECD (2008) Detailed review paper on fish life-cycle tests, OECD Series on Testing and As-11
sessment No. 95, ENV/JM/MONO(2008)22. Organisation for Economic Cooperation and De-12
velopment, Paris. 13
US EPA (1996) Federal Insecticide, Fungicide, and Rodenticide Act 7 U.S.C. §136 et seq., 14
http://agriculture.senate.gov/Legislation/Compilations/Fifra/FIFRA.pdf. 15
US EPA (2004) Overview of the ecological risk sssessment process in the Office of Pesticide 16
Programs, U.S. Environmental Protection Agency. Endangered and threatened species effects 17
determinations. Office of Prevention, Pesticides and Toxic Substances, Office of Pesticide Pro-18
grams, Washington, D.C., January 23, 2004. 19
Sanco (2010) Draft SANCO 11802/2010 working document. Revision July 2010. 20
Williams, T.D., Caunter, J.E., Lillicrap, A.D., Hutchinson, T.H., Gillings, E.G., Duffell, S. (2007) 21
Evaluation of the reproductive effects of tamoxifen citrate in partial and full life-cycle studies 22
using fathead minnows (Pimephales Promelas). Environ. Toxicol. Chem. 26: 695-707. 23
Recommended