View
8
Download
0
Category
Preview:
Citation preview
1
Running head: Assessing equating transformations
Statistical assessment of estimated transformations in observed-score equating
Abstract
Equating methods make use of an appropriate transformation function to map the scores of
one test form into the scale of another so that scores are comparable and can be used
interchangeably. The equating literature shows that the ways of judging the success of an
equating (i.e., the score transformation) might differ depending on the adopted framework.
Rather than targeting different parts of the equating process and aiming to evaluate the
process from different aspects, this paper views the equating transformation as a standard
statistical estimator and discusses how this estimator should be assessed in an equating
framework. For the kernel equating framework, a numerical illustration shows the potentials
of viewing the equating transformation as a statistical estimator as opposed to assessing it
using equating-specific criteria. A discussion on how this approach can be used to compare
other equating estimators from different frameworks is also included.
Keywords: equating transformation estimates, statistical evaluation, equating-specific
evaluation criteria, simulation
Acknowledgement
The research in this article by Marie Wiberg was funded by the Swedish Research Council
grant 2014-578. Jorge González was partially funded by the FONDECYT grant 1150233.
2
Statistical assessment of estimated transformations in observed-score equating
Equating methods are statistical tools used to ensure that scores from different test forms
are comparable and can be used interchangeably. The comparability is obtained using an
appropriate transformation function that maps the scores of one test form into the scale of the
other. To set up the problem, we let X and Y be random variables denoting the scores from test
forms X and Y with cumulative distribution functions (cdfs) FX(x) and FY(y), respectively. We
assume that scores on X are to be transformed to the Y scale. A general transformation
function for the comparison of any two samples or distribution of random variables can be
defined as
1( ) ( ( ))Y Y Xx F F xϕ −= (1)
(Wilk & Gnanadesikan, 1968). In the equating literature, this function is known as the
equipercentile transformation (e.g., Braun & Holland, 1992), and it has been shown that all
equating transformations are particular cases of it. Different equating frameworks lead to
parametric, semiparametric, and nonparametric estimators of ϕ (González & von Davier,
2013). Examples of such frameworks include traditional equating methods (e.g. Kolen &
Brennan, 2014), observed-score kernel equating methods (von Davier, Holland and Thayer,
2004; von Davier, 2013), item response theory methods (Lord, 1980; Kolen & Brennan,
2014), local equating methods (van der Linden, 2011), or combinations of these as in Wiberg,
van der Linden, and von Davier (2014). Because of the large number of possible equating
transformations, we need statistical tools to assess which equating estimators should be used
in different situations.
Although it might appear evident from this setup that the main object of inference is the
equating transformation ϕ , equating methods within their respective frameworks have
typically been evaluated using what we call equating-specific evaluation measures. For
example, in kernel equating, one measure that has been used is the percent relative error
(PRE), which essentially compares the moments in the observed and the equated score
distributions (see e.g. von Davier et. al, 2004; Jiang, von Davier & Chen, 2012). In contrast,
studies using traditional equating methods have reported the so-called “difference that
matters” (DTM), and this was originally defined as the difference between equated scores and
scale scores that are larger than half of a reported score unit (Dorans & Feigenbaum, 1994).
Equating-specific evaluation measures, target different parts of the equating process (e.g.,
moments of a distribution) and aim to evaluate it based on different aspects.
3
Another usual practice for the evaluation of equating transformations is the use of
summary indices (e.g., Han, Kolen, & Pohlmann, 1997). These indices make use of a
particular equating to be the standard against which other equatings are compared. They
measure discrepancies between equivalent scores for two different equating methods. We call
these measures equating evaluation criteria. For a summary of traditional equating evaluation
criteria and how they have been implemented, see Harris and Crouse (1993) and Kolen and
Brennan (2014). Harris and Crouse (1993) pointed out that there is no single criterion that is
preferable, but rather several can be used. Although their review is now quite old, their
conclusions are still true today as noted in Kolen and Brennan (2014, p. 325).
The viewpoint taken in this paper is that the equating transformation is the parameter of
interest for making statistical inferences and that the estimations of the transformation need to
be statistically assessed using measures such as bias, standard errors (SE), mean square error
(MSE), and root mean square error (RMSE) as would be standard for any evaluation of an
estimator of an unknown parameter. Although the use of these measures in equating has been
criticized due to the fact that a true value of the parameter is needed for the comparison
(Skaggs & Lissitz, 1986), in this paper it is explicitly shown how it is possible to define a true
equating transformation against which the estimated equating transformation can be
evaluated. Another important viewpoint proposed in this paper is that if we want to compare
different equating transformations we might need to examine several scenarios as the
definition of what constitutes a true equating function may differ depending on the choice of
the equating method.
The aim of this paper is to propose how to perform a statistical evaluation of equating
transformation parameter estimates within an observed-score equating framework. Thus, this
paper is an important step towards viewing the equating transformation as a statistical
estimator, which enriches the statistical theory viewpoint of equating (in line with von Davier,
2011a; 2011b, van der Linden, 2011; González & von Davier, 2013). Although we will give
general directions on how to handle a statistical assessment of equating transformations in
various observed-score equating frameworks, the emphasis will be on kernel equating. This
choice is mainly based on the fact that in this case a "true" value of the equating
transformation (i.e., the parameter of interest) can easily be defined. Contributions in this
paper include a new evaluation measure, a novel way to use PREs when assessing equating
transformations, and the observation that several competing scenarios should be examined if
we want to make a fair comparison of equating transformations.
4
The structure of the paper is as follows. We first briefly summarize the kernel equating
framework, including a description of equating-specific evaluation measures within this
approach. In the next sections, the general strategy for the statistical evaluation of equating
transformation parameter estimates is presented and discussed when used within three
different equating frameworks, including kernel equating. Measures of statistical evaluation
criteria to evaluate test scores are then described, and this is followed by a numerical
illustration. The final section contains some concluding remarks.
The kernel equating framework
Let jx and ky denote the possible score values that the random variables X and Y can
take with 1, ,j J= … and 1, ,k K= … , respectively. For test takers randomly selected from the
target population T scoring X = xj and Y = yk, we can define the score probabilities
{ ; }j jr Pr X x T= = and { ; }k ks Pr Y y T= = , respectively. Kernel equating is an observed-score
framework whose goal is to find an optimal equating transformation between the test scores X
and Y for a target population T, and the process consists of five steps: 1) presmoothing, 2)
estimation of score probabilities, 3) continuization, 4) equating, and 5) the calculation of the
standard errors of equating (von Davier et al., 2004; von Davier, 2013). The equating
transformation, obtained in step 4, is defined as
1 1ˆ ˆˆ ˆ ˆ ˆ ˆ( ) ( ; , ) ( ( ; ); ) ( ( ))Y X Y XY Y h h h hx x r s F F x s r F F xϕ ϕ − −= = = , (2)
where r and s are vectors of the estimated score probabilities (step 2) obtained from the
estimated score distributions (step 1), and Xh and Yh are bandwidth parameters controlling
the degree of smoothness for the continuization (step 3). A Gaussian kernel is usually utilized
for continuization of the cdf, and for X this is defined as
( ; ) ( ( ))T X j jjF x h r R x= Φ∑ , (3)
where ( )Φ ⋅ denotes the standard normal cdf, (1 )
( ) X j X XTj
X X
x a x aR x
h a
µ− − − =
,
∑=j jjXT rxµ , ∑ −=
j jXTjXT rx 22 )( µσ , and )/( 222XXTXTx ha += σσ .
Other choices besides the Gaussian kernel are also possible (e.g., a logistic or a uniform
kernel). The most common way to find an optimal bandwidth is to minimize a penalty
function comprising two components. One component accounts for the distance between the
estimated score probabilities and the corresponding estimated density function, and the other
acts as a smoothness penalty term that avoids rapid fluctuations in the approximated density
5
(for details, see von Davier, 2011). Alternatives to the penalty function approach have been
described in Häggström and Wiberg (2014), Liang and von Davier (2014), and Andersson and
von Davier (2014).
The (asymptotic) standard error of equating (SEE) due to random sampling from the
target population T is defined as:
ˆˆSEE ( ) ( ) Var( ( ))Y Y Yx x xσ ϕ= = , (4)
where ˆ ( )Y xϕ is defined in Equation 2 and the delta method is used to calculate the variance.
Interestingly, the definition of SEE in Equation 4 is equivalent to that given in statistical
inference for an estimated parameter of interest (in this case, the equating transformation).
However, as it will be seen later in this paper, the mathematical shape of the quantity under
the square root sign varies according to the adopted equating framework. In this sense, the
SEE could also be considered as an equating-specific evaluation measure.
Equating-specific evaluation measures within the kernel-equating framework
The PRE has been used for evaluation in the kernel equating framework. If we denote the
pth moment of the distribution of test scores Y and the equated scores ( )Y Xϕ as
( ) ( ) pp k kk
Y y sµ =∑ and ( ( )) ( ( ))pp Y Y j jj
X x rµ ϕ ϕ=∑ , respectively, then the PRE is defined
as
( ( )) ( )PRE( ) 100
( )p Y p
p
X Yp
Y
µ ϕ µµ
−= (5)
(von Davier et al., 2004). Another commonly used equating-specific measure not only in
kernel equating, but also in traditional methods of equating, is the DTM. DTM has been used
as a criterion to decide between two transformations that differ in some respect. Originally,
this meant any differences between equated scores and scale scores that are larger than half of
a reported score unit. In this paper, DTM will be used in connection with the measure of
differences of two equating estimators, which is defined formally in the numerical illustration
section.
These two evaluations share the common feature that they are especially developed to
evaluate an equating, although they handle different aspects of the performed equating. These
evaluations are quite different from general statistical measures used to assess different
estimators because they do not rely on comparing a true value with an estimated value of a
parameter of interest. It should, however, be noted that statistical indices such as root mean
6
square difference (RMSD), mean absolute difference (MAD), and mean signed difference
(MSD) have been used previously to evaluate an equating (Harris and Crouse,1993), although
not in the way that will be implemented here. As mentioned before, we will define a true
equating transformation against which the estimated equating transformation will be
evaluated. Bias and MSE have also been used to evaluate equating functions, sometimes with
several replications as noted in Kolen and Brennan (2014, p. 311). They do not, however,
emphasize the comparison of an estimated equating with its true equating but rather compare
a criterion equating against an estimated equating function. Additionally, previous studies
have typically used large populations from which replicated samples have been drawn and
used to calculate some of the above-mentioned measures. This study differs from previous
studies in that a probability model is utilized to generate replicated data so that Monte Carlo
methods can be used to calculate the evaluation measures, as is usually done in statistical
assessment studies.
Statistical evaluation of equating transformation parameter estimates
Because a point estimator is defined as any function of the sample, measures of the
quality of an estimator have been developed in order to choose between different estimators.
Most of these measures are based on the deviation of the estimated value of a parameter from
its true value. We will describe three of these measures when using the equating
transformation as the unknown parameter of interest. It is important to emphasize that the test
scores X and Y are random variables and that the equating transformation, as for any other
estimator in statistics, is just a mathematical rule that becomes a random variable when
observed data (i.e., the realizations of random variables) are used to obtain an estimation. It is
thus necessary to use expectations to evaluate the quality of this estimator.
Let ˆ( )xϕ denote an estimator of ( )xϕ . The definitions of bias, MSE, and RMSE are well
known and are shown here explicitly for the case of equating transformations:
ˆ ˆBias( ) [ ]fEϕ
ϕ ϕ ϕ= − , (6)
2ˆ ˆMSE( ) [( ) ]fEϕ
ϕ ϕ ϕ= − , and (7)
2ˆ ˆRMSE( ) [( ) ]fEϕ
ϕ ϕ ϕ= − , (8)
where the expectation is taken over the distribution of the equating function estimates fϕ . For
similar definitions within the local equating framework, see Wiberg and van der Linden
(2011). Because these expectations are not generally available in closed form, randomly
7
generated score data are used to calculate them in practice using Monte Carlo simulation.
Likewise, the definition for the SE is
ˆ ˆSE( ( )) Var( ( ))Y Yx xϕ ϕ= , (9)
and as mentioned before it coincides exactly with the definition of the SEE. Note that SE can
be obtained from MSE and bias because the former can be decomposed as the square of the
latter plus variance.
As pointed out earlier, there are also equating evaluation criteria or statistical indices that
have been used to evaluate equatings. Three commonly used indices are MSD, MAD, and
RMSD (Han, Kolen & Pohlmann, 1997; Harris and Crouse, 1993). These measures have been
shown to be particularly useful for comparing two different equating estimators within a
framework (e.g., one estimator that uses a Gaussian kernel and the other using a logistic
kernel within the kernel equating framework). Harris and Crouse (1993) pointed out that most
of these summary measures have been (and will continue to be) used due to their prominence
in the literature and not based on their applicability in a particular situation. We will explore
modified versions of these indices to compare a true equating transformation against an
estimated equating transformation, in line with the adopted viewpoint that the equating
transformation is a standard statistical estimator.
Assessing ˆ( )xϕ within an observed-score equating framework
In order to use any of the statistical evaluation measures described in equations 6-9 and
the modified versions of the equating evaluation criteria, we need to define a true equating
transformation. A (unique) true equating transformation is not known, except for the case of
simulations. Here, an explicit probability model will be used to generate score data so that the
calculation of true equated values is possible. Depending on the adopted framework, the
mathematical form of the transformation ϕ might change. Whichever framework is adopted,
the true value of ϕ is obtained for the equating framework using a particular definition of
equating. Here, Braun and Holland’s (1982) definition stated in Equation 1 is used. As long as
the interest is to assess the equating transformation within an observed-score framework, this
should not yield too many difficulties because the mathematical form of the true parameter
and the estimator is the same.
Once the true transformation is defined, simulated data are used to evaluate different
estimators. Simulations are typically used to approximate the sampling distribution of the
estimator under a particular set of conditions. Examining different situations through
8
simulations is very useful when assessing the behavior of different estimators before they are
applied to real data.
Next, the statistical assessment of ˆ( )xϕ will be discussed for the three observed-score
equating frameworks of kernel equating, item response theory (IRT) observed-score equating,
and local equating. For all of these cases, the assessment of the equating transformation
follows the same procedure:
1. Create simulated score data for the two test versions from a known probability model.
2. Obtain the true equating transformation.
3. Estimate the equating transformations for the different scenarios or different estimators
ϕ of interest.
Although the assessment strategy is common for all methods, both the mathematical
definition of the equating transformation ( )xϕ and the data-generating mechanism used to
generate replicated data will differ.
In order to make a fair comparison that does not favor any of the assessed estimators, the
estimated transformations will be compared against the true equating transformation as
defined using one of the competing models. This means that if two transformations are
compared, two sets of true equated values will be generated and compared with the two
assessed estimators. This way it is possible to check the robustness of misspecified models,
for example, those that include certain feature in the estimation that the one used to produce
the true equated values does not have.
Assessing ˆ( )xϕ within observed-score kernel equating
To assess the estimated equating transformation ˆ( )xϕ within the kernel equating framework,
we need to define a true equating transformation ( )xϕ . The true transformation will depend
on what we are interested in examining, for example, different bandwidth selection methods,
different kernels, or how the equating transformation behaves if we have a symmetric or a
skewed test score distribution. Because the equating transformation might depend on more
than one parameter, one has the flexibility to decide which parameter should be fixed and
which should be estimated when defining the true equating transformation. For the
assessment of the equating transformation, the three steps outlined above are followed. We
define the estimated equating transformation as
1,
ˆ ˆˆ ˆ ˆ( ) ( ; , , , , ) ( ( ))x y Y XY Y h h x y h hx x F F h h r s F F xϕ ϕ −= = , (10)
9
where xh , yh , r , and s are defined as above, andxhF and
yhF are used to make explicit the
dependence of ϕ on the kernel used. Suppose the interest is in assessing the use of different
kernels. Because using different kernels has an impact on the estimation of the bandwidths
Xh and Yh , these values need to be estimated in each replication for the different scenarios.
The same is true for the score probabilities r and s. If the interest instead is to compare
different bandwidth selection methods, one could let the score probabilities be fixed and
estimate the bandwidths using different bandwidth selection methods. These are just two
examples of possible comparisons, and the first will be illustrated later in this paper.
Additionally, in order to make a fair comparison that does not favor any of the assessed
estimators, the estimated transformations will be compared against the true equating
transformation as defined using either of the competing models (e.g. models defined using
different kernels).
Assessing ˆ( )xϕ within IRT observed-score equating
In IRT observed-score equating, an IRT model is used to produce an estimated distribution of
observed number-correct scores on each test form, and these are then used to equate scores
using equipercentile methods. The equating transformation is defined as in Equation 1 where
the involved cdfs are defined as
( ) ( | ) ( )ZF z F z g dθ θ θ= ∫ , ,Z X Y=
where ( )g θ is the distribution of θ (Kolen & Brennan, 2014). Possible scenarios that can be
evaluated include different ability distributions ( )g θ , different IRT models, and different
ways to obtain the conditional distributions of scores ( | )F X θ and ( | )F Y θ (González,
Wiberg & von Davier, In press). For the assessment to be fair, if the interest is in comparing
equating transformations obtained from using either the two (2PL) or the three (3PL)
parameter logistic IRT models, one needs to examine two possible true equating
transformations provided that the data are suitable for being modeled with both of these IRT
models. The true equating transformation will thus consist of Equation 1 with either the 2PL
or the 3PL. Also, for the assessment to be fair, we need to simulate data following both a 2PL
and a 3PL model.
Assessing ˆ( )xϕ within local equating
10
A more recent equating framework is local equating (van der Linden, 2011; van der
Linden & Wiberg, 2010; Wiberg, et al, 2014). Instead of using the marginal distributions of
scores, this method utilizes the conditional distributions of scores given ability or any simple
classification of it, and this leads to a family of equating transformations of the form
1| |( ; ) ( ( )), Y Xx F F xθ θϕ θ θ−= ∈ℜ . (11)
Assessing ˆˆ( ) ( ; )x xϕ ϕ θ= within local equating will be similar to IRT observed-score
equating if we use the local equating methods that rely on the assumption that the data fit an
IRT model. We can then proceed and obtain estimates of the equating transformations
similarly as in IRT observed-score equating. Although it is possible to use various evaluation
measures, the assessment has typically been based on bias measures (e.g. van der Linden &
Wiberg, 2010; Wiberg, et al., 2014). In this framework, possible scenarios that can be
evaluated are different estimation methods for θ , different IRT models, and different ways to
obtain the conditional distributions.
Measures of statistical assessment when equating test scores
In order to practically evaluate the statistical measures in Equations 6-9, the Monte Carlo
method will be used with replicated data generated from a known probability model. For each
assessment measure, the true and estimated equated values are compared for each test score.
Let ix denote a specific test score, where 0, ,i n= … and n is the number of possible score
values. The simplest evaluation measure is the bias, which for an equated value ( )Y ixϕ over
1000 replications where each replicate is denoted by l , is defined as
1000( )
1
1ˆ ˆBias( ( )) ( ( ) ( ))
1000l
Y i Y i Y il
x x xϕ ϕ ϕ=
= −∑ , (12)
followed by the MSE defined as
1000
( ) 2
1
1ˆ ˆMSE( ( )) ( ( ) ( ))
1000l
Y i Y i Y il
x x xϕ ϕ ϕ=
= −∑ , (13)
and the RMSE defined as
1000( ) 2
1
1ˆ ˆRMSE( ( )) ( ( ) ( ))
1000l
Y i Y i Y il
x x xϕ ϕ ϕ=
= −∑ , (14)
where ( )ˆ ( )lY ixϕ is the estimated equated score for the l th replication. As mentioned before,
ˆ( ( ))Y iSE xϕ can be calculated subtracting the squared bias from the MSE and taking squared
root. We will also use modifications of the previously mentioned indices (MSD, MAD, and
11
RMSD) that compare a true equating transformation against an estimated equating
transformation using the full set of replications. For that aim, we adjusted the indices to be
used with replications and defined them as the following average measures:
1000
1
1AMSD( ) MSD
1000Y ll
ϕ=
= ∑ , (15)
1000
1
1AMAD( ) MAD
1000Y ll
ϕ=
= ∑ , and (16)
1000
1
1ARMSD( ) RMSD
1000Y ll
ϕ=
= ∑ , (17)
where 0
1ˆMSD [ ( ) ( )]
n
Y i Y iix x
nϕ ϕ
= = − ∑ ,
0
1ˆMAD | ( ) ( ) |
n
Y i Y iix x
nϕ ϕ
= = − ∑ , and
2
0
1ˆRMSD [ ( ) ( )]
n
Y i Y iix x
nϕ ϕ
= = − ∑ .
These definitions are in line with how Chen (2012) used these measures, although no
formulas were given in that study. Note that MSD, MAD, and RMSD produce a single
number that corresponds to an average over the total number of scores. We have also explored
the possibility of redefining these measures in order to use the full range of score values rather
than averaging over them. In this case, we started from the versions of these measures given
in Han et al. (1997) and extended them into average at each score point using the replicated
data. A proposed measure that we will use is the average point absolute difference (APAD),
and this is defined as
1000( )
1
1ˆAPAD( ( )) | ( ) ( ) |
1000l
Y i Y i Y il
x x xϕ ϕ ϕ=
= −∑ . (18)
Note that although it is also possible to define an average point signed difference (APSD), the
resulting formula becomes mathematically equivalent to the bias (Equation 12) and thus is
excluded here. Also, an average point root square difference (APRSD) where the absolute
value function in APAD is exchanged with a square root and the difference is raised to the
power two is mathematically equivalent to the formal definition of APAD and is excluded
here.
Numerical illustration
12
In order to illustrate the statistical assessment of the equating transformations, simulated
data were used with a large number of replications to evaluate kernel equating using the
Gaussian and logistic kernels. The definition of the equating transformation using a standard
logistic kernel is similar to that used for the Gaussian kernel in Equation 4, but replacing (·)Φ
with
1(v)
1 exp( )K
v=
+ −,
where the random variable V has mean of 0 and a variance of 2 2 / 3Vσ π= (Lee & von Davier,
2011). A comparison between these two kernels using SEE and cumulants was presented in
Lee and von Davier (2011) where data from chapter 7 in von Davier et al. (2004) are used
with the equivalent group design. For the numerical illustration, the reported values of r , s ,
xh , and yh in these studies were used as true parameter values to obtain true equated scores.
The score data came from two 20-item test forms X and Y where M = 1455 and N =1453 test
takers were administered test forms X and Y, respectively. For the true parameter values of
the Gaussian kernel, the optimal bandwidths were 0.622Xh = and 0.579Yh = , and for the
logistic kernel the optimal bandwidths were 0.512Xh = and 0.446Yh = . Given the true
parameter values for jr and ks , which are given in Table 7.4 (von Davier, et al., 2004), 1000
instances of score frequencies 1( , , )Jn n= …n and 1( , , )Km m= …m were generated from the
multinomial distributions 1Mult( , , , )Jr r…n and 1Mult( , , , )Ks s…m with the same sample
sizes as in the original data. For each replication, we estimated Xh and Yh as well as the jr
and ks values. The first step in kernel equating is presmoothing. For simplicity, we used the
same loglinear model that originated from the "true" score probabilities parameters to
presmooth the simulated data as was given in von Davier et al. (2004)
1
log( ) ( )rT
tj r ri j
t
r xα β=
= +∑ and log( ) ( )sT
tk s st k
i t
s yα β=
= +∑ ,
with 2rT = and 3sT = . Other possibilities are to either automatically select the loglinear
models (using e.g. Akaike Information Criterion (AIC) or Bayesian Information Criterion
(BIC), a chi square test, or a likelihood ratio test) or to skip the presmoothing step and argue
that the same mistake is made in all replications. This could also have been done here,
although we have chosen to include the presmoothing step for illustration reasons because it is
part of the five steps of kernel equating.
13
From the generated data, we estimated equated values for both the Gaussian kernel and
the logistic kernel. The estimated equating values in both of these scenarios were then
compared with the true equated values from both the Gaussian and logistic kernel to yield a
total of four comparisons or scenarios. This way it is possible to check the robustness of
misspecified models, for example, those that use a kernel in the estimation that is different
from the one used to produce the true equated values. The four different scenarios were
examined with all of the previously described statistical evaluation measures (bias, MSE,
RMSE, AMSD, AMAD, ARMSD, APAD) given in Equations 12–18. Additionally, adjusted
versions of the PRE and SEE that use the full set of replications as well as an overall criteria
of equating differences (EDIFF) were used and are described in the next subsection. To
perform the numerical illustration, either of the R packages kequate (Andersson, Bränberg, &
Wiberg, 2013) or SNSequate (González, 2014) could be used. In this case we used
SNSequate.
Adjusting the equating-specific evaluation measures
To take advantage of the full set of replications, we adjusted the previously described
equating measures PRE and SEE. To adjust SEE, for each score value ix , the SEE is obtained
by calculating the average over 1000 replications as
1000
( )
1
1SEE ( ) SEE ( )
1000a lY i Y i
l
x x=
= ∑ , (19)
where SEE ( )Y ix is defined as in (4). Similarly, to adjust the PRE we average over
replications, and define PREa as
1000 ( )
1
1PRE ( ) PRE ( )
1000l
a lp p
== ∑ , (20)
where ( )PRE ( )l p is the value of PRE as defined in Equation 5 for the l th replicate. Finally the
DTM, or modifications of it, has been used in several equating studies (e.g. Yang, Bontya and
Moses, 2011; Brossman, 2010). Here, DTM is used as a criterion to decide between
differences of two equating estimators measured by EDIFF defined as
1 2
1000( ) ( )
1
1ˆ ˆEDIFF | ( ) ( ) |
1000l l
Y i Y il
x xϕ ϕ=
= −∑ , (21)
where 1
ˆYϕ and2
ˆYϕ represent two different equating estimators (Liang & von Davier, 2014).
14
Results
The bias, MSE, and RMSE for the different equated values for the Gaussian and logistic
kernels using either of them as the true equating transformation are shown in Figures 1 and 2.
The suffixes G and L in these figures indicate which kernel (Gaussian or logistic) was used to
obtain true equated values, and the words gauss and logis indicate the kernel used for the
estimations. From Figure 1 it is evident that the bias at the endpoint is larger for misspecified
models (i.e., gauss.L and logis.G) as expected, and the bias also differs depending on which
kernel is used. It should also be noted that the size of the bias is slightly different depending
on whether the Gaussian or the logistic kernel is used as the true equating transformation.
Looking at the lower extreme (until score 7), the bias for gauss.G appears to be smaller, and
for both correctly specified models (i.e. gauss.G and logis.L) they are convergent from the
middle to the upper parts of scores.
Figure 1: Bias for equated values over 1000 replications. G and L indicate that data were
generated using the Gaussian and logistic models, respectively.
The MSE, shown in the left panel of Figure 2, was almost identical in the four scenarios,
except for slight differences at the extremes of the score scale. This is not surprising because
both the bias and the SE (right part of Figure 3) only displayed small differences across the
score scale. The RMSE also yielded the same pattern for all scenarios and was never larger
than .30 for any of the scores.
15
Figure 2. MSE to the left and RMSE to the right for equated values (over 1000 replications).
G and L indicate that data were generated under the Gaussian and logistic model,
respectively.
The results for the average loss measures are shown in Table 1. From Table 1 it can be
seen that values of AMSD in all scenarios were close to zero when comparing Gaussian and
logistic kernels, and this indicated a small average loss regardless of the kernel that was used.
In all cases, the values for misspecified models were larger than for the correctly specified
models. The AMAD values indicated that the Gaussian model performed better because it
outperformed the logistic model when the model was correctly specified and it was also more
robust when the model was misspecified. The values of ARMSD were all very similar across
simulation scenarios.
Table 1. Average loss measures for the four examined scenarios. gauss.G logis.G gauss.L logis.L AMSD −0.00031 0.00361 −0.00419 −0.00026 AMAD 0.15866 0.16074 0.16040 0.15923 ARMSD 0.18147 0.18296 0.18283 0.18135 The suffixes (G and L) indicate which kernel (Gaussian or logistic) was used to obtain true equated values, whereas the words gauss and logis indicate the kernel used for the estimations.
The results for the proposed APAD are given in the left part of Figure 3. There is clear
consistency of the curves when correctly specified models are evaluated across the full range
of scores, which is the opposite conclusion for misspecified models where none of the models
appear to be robust, particularly at the extremes of the score scale.
16
The SE in the right part of Figure 3 yielded very similar results regardless of the used
kernel. In order to compare the results of the SE obtained using our approach to the SEE
reported in Lee and von Davier (2011), we have added the SEE reported using both Gaussian
(gauss.b) and a logistic (logis.b) kernel. Interestingly, the SE plot displays essentially the
same results as those shown in Lee and von Davier (2011) (Figure 10.3) for the SEE even
though they did not use replicates. This means that the simulations support the analytical
results. Note that we can only compare the results of the SE with the SEE in Lee and von
Davier (2011), as they did not use any of the other measures proposed here in their evaluation.
It is notable that the SEE curves in Figure 3 are almost identical to the RMSE curve shown in
Figure 2. This might be due to the fact that for the former, the average of 1000 replicates used
as a criterion was very close to the true equating value in the latter.
Figure 3. APAD (over 1000 replications) for the four examined scenarios (left panel) and SE
(over 1000 replications) using either a logistic or a Gaussian kernel together with the
obtained results by Lee and von Davier (2011) (right panel). G and L indicate that data were
generated under the Gaussian and logistic model, respectively.
The results in Figure 3 are in line with what Figure 4 shows for the EDIFF measure.
Although none of the differences are larger than a DTM, it can be seen that the largest
differences between the Gaussian and logistic model occur at the extremes of the scores scale.
Figure 5 shows plots for the adjusted SEE (left panel) and PRE (right panel). The
adjusted SEE gives almost identical results as the ones reported in Lee and von Davier (2011).
The plot of the adjusted PRE values for the first 10 moments of the distributions shows that
the average PRE is lower if a Gaussian kernel instead of a logistic kernel is used, at least from
17
the third moment on. Note, because ( )p Yµ in Equation 5 is the same regardless of the
estimated values, there are only two curves in the right panel of Figure 5 compared to Figures
1, 2, and 3 where four curves, one for each scenario, are shown.
Figure 4. The equating difference (EDIFF) over the different test scores (over 1000
replications) when comparing the Gaussian kernel against the logistic kernel.
Figure 5. The adjusted SEE (over 1000 replications) and the SEE from Lee and von Davier
(2011) when comparing the Gaussian kernel against the logistic kernel (left panel); and the
average PRE (over 1000 replications) using either a logistic or a Gaussian kernel (the right
panel).
18
Concluding remarks
This paper was motivated from the insight that test scores are random variables and thus
an equating transformation is nothing more than a statistical estimator. This implies that the
obtained equating transformation, and thus the equated scores, should be treated as estimates
of a statistical estimator.
The aim of this paper was to propose a statistical evaluation of equating transformation
parameter estimates within an observed-score equating framework. The proposed approach to
assessing equating transformations has interesting features. First, it offers a statistical
perspective on equating that allows the use of general statistical tools such as probability
models used for replications. Second, it emphasizes that it could be fairer to use several true
equating models instead of just one. Third, it gives us the possibility to use statistical
evaluation methods that are already developed.
If we want to assess an equating transformation we propose to proceed as when working
with a standard statistical inference problem. Our approach differ from other equating
assessment studies in that simulated data from a known probability model are used for
evaluations. This way, it is possible to assess different components of the equating
transformations (e.g., different kernels) as well as to examine different equating methods in
different (simulated) situations (e.g., it allows us to determine which equating method is most
suitable if the score distributions are symmetric or non-symmetric). These ideas were
explicitly illustrated for kernel equating using a multinomial probability distribution and
discussed for being implemented within either the IRT observed score or local equating
frameworks. In all of these cases, an explicit probability model (i.e., the multinomial and the
underlying Bernoulli model for binary IRT data) was available. A more challenging part is to
assess the equating transformation within equipercentile observed-score equating because one
has to decide which probability model can be used to generate the data. If we use a
multinomial distribution, we already have discrete test scores, but if we decide to use, for
example, a normal distribution we need to first discretize the scores. In any case, it is unclear
which probability model better represents the true situation. When parametric inference is
impossible or requires complicated formulas for the calculation of standard errors, the
bootstrap (Efron & Tibshirani, 1993) would be a valuable alternative.
The comparison of the standard errors obtained using the proposed approach with those
that would be obtained using bootstrap standard errors is a topic of future research. In the
future one should elaborate on comparing different approaches to equating in different
19
situations. In these future situations the analytical standard error of equating may not always
be available and thus a comparison with bootstrap standard errors will be of great importance.
All the evaluation measures used in this paper consider a point-wise comparison of each
score value in the score scale. In this sense, the equating transformation is being considered as
a multivariate functional parameter ( ) ( (0), (1), , ( )) 'nϕ ϕ ϕ= …φ x so that each equated score is a
component in the vector equating parameter. If one is interested in globally assessing the
equating transformation, measures such as the MSE and Bias should be reformulated in terms
of ( )φ x . This multivariate setting has been considered in Rijmen, Qu, and von Davier (2011)
and Andersson and Wiberg (2016) where the equating function is treated as a multivariate
parameter and the asymptotic covariance matrices of equating functions are derived for
hypothesis testing. Investigating the implications of this approach on the proposed evaluation
method described in this paper is a subject of future research.
In this paper, only simulations were used but the outlined approach can also be used with
real data. If there is not an explicit probability model to generate data, a statistical model (e.g.,
a polynomial loglinear model or an IRT model) can be fit to the real data. Provided that the
model assumptions are fulfilled, the best possible model is chosen using a goodness of fit test,
AIC or BIC, and it is considered as the true model. The true parameters that are needed to
obtain the true equated scores can be obtained from this true model, and these are then used to
obtain sample replicates from the large population of test data. This approach has been used in
some studies as mentioned in Kolen and Brennan (2014, p.313) and references therein.
The focus in this paper has been on how to assess the equating transformation within an
observed-score equating framework. An area for future research is how to make comparisons
of equating transformations between different equating frameworks. To assess between
different equating frameworks is, however, more complicated and difficult, if not impossible,
because the true value of the equating parameter differs between frameworks. This problem is
not shared with regular statistics where the parameter that indexes a probability distribution is
a unique abstract element with no specific mathematical form. How to proceed with a fair
comparison in this setting will depend on which frameworks the methods come from. If we
use, for example, IRT observed-score equating or local equating, the data need to fit an IRT
model. But does that automatically mean that this requirement favors the IRT observed-score
framework or the related local equating over, for example, kernel equating or equipercentile
equating? If we want to compare either of the estimators in the IRT observed-score
framework with the kernel equating framework, we have to decide which probability model is
20
the most fair to generate data from. One could argue, for example, that to generate data with a
multinomial model will suit kernel equating better. Alternatively, such a decision might be
avoided if we use two equating models and compare the resulting scenarios of using two true
models, as we did in the illustration. The same issues will arise when more than two equating
transformations from different frameworks are to be compared because there might be a
number of possible candidates for the true equating transformation making the comparison
quite complicated. A recent reflection related to this is given in Chen (2012) who declared
that observed differences between equating methods can come both from the framework used
for equating and from the particular equating estimator. Thus, an important point of our
approach is to include all possible scenarios when comparing competing equating
transformations to assure that none are favored over any others. Chen (2012) concluded that
more research is needed to determine the best equating practice and that there is a need to find
a good and practical criterion to identify the most appropriate equating method. This paper
can be seen as one step in that direction. However, we still have to work with the two most
important questions of how one should define the true ( )xϕ and how one should generate
data to avoid an unfair comparison of methods. These questions are ongoing challenges as
new methods are proposed and new comparisons are examined.
References
Andersson, B. Bränberg, K., & Wiberg, M. (2013). Performing the kernel method of test
equating using the package kequate. Journal of Statistical Software. 55, 1-25.
Andersson, B., & von Davier, A. A. (2014). Improving the Bandwidth Selection in Kernel
Equating. Journal of Educational Measurement, 51(3), 223-238.
Andersson, B. & Wiberg, M. (2016). Item response theory observed-score kernel equating.
Manuscript submitted for publication.
Braun, H. I. & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis
of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating
(pp.9-49). New York: Academics.
Brossman, B. G. (2010). Observed score and true score equating procedures for
multidimensional item response theory. Doctoral thesis, University of Iowa.
Chen, H. (2012). A comparison between linear IRT observed-score equating and Levine
observed-score equating under the generalized kernel equating framework. Journal of
Educational Measurement, 49(3), 269-284.
21
Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the
SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton, NJ:
ETS.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman & Hall.
González, J. (2014). SNSequate: Standard and Nonstandard Statistical Models and Methods
for Test Equating. Journal of Statistical Software, 59(7), 1-30.
González, J. & von Davier, M. (2013). Statistical models and inference for the true equating
transformation in the context of local equating. Journal of Educational Measurement,
50(3), 315-320.
González, J., Wiberg, M., & von Davier, A. A. (in press). A note on the Poisson's binomial
distribution in item response theory. Applied Psychological Measurements.
Han, T., Kolen, M.J., & Pohlmann, J. (1997). A comparison among true- and observed-score
equating and traditional equipercentile equating. Applied Measurement in Education, 10,
105-121.
Harris, D. J. & Crouse, J. D. (1993). A study of criteria used in equating. Applied
Measurement in Education, 6, 195-240.
Häggström, J. & Wiberg, M. (2014). Optimal bandwidth in observed-score kernel equating.
Journal of Educational Measurement. 51( 2), 201-211.
Jiang, Y., von Davier, A. A., & Chen, H. (2012). Evaluating equating results: percent relative
error for chained kernel equating. Journal of Educational Measurement, 49(1), 39-58.
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking: methods and
practices. (3rd ed.). New York: Springer.
Lee, Y.-H. & von Davier, A. A. (2011). Equating through alternative kernels. In A. A. von
Davier (Ed.), Statistical models for equating, scaling, and linking. Chapter 10: pp. 159-
173. New York: Springer.
Liang, & T. von Davier, A. A. (2014). Cross-validation: an alternative bandwidth-selection
method in kernel equating, Applied Psychological Measurement, 38(4), 281-295.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Erlbaum.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile
observed-score equatings. Applied Psychological Measurement, 8, 452-461.
R Core Development Team (2014). R: A language and Environment for Statistical
Computing. Vienna, Austria: R Foundation for statistical Computing. ISBN 3-900051-07-
0.
22
Rijmen, F., Qu, Y., and von Davier, A. A. (2011). Hypothesis testing of equating differences
in the kernel equating framework. In A. A. von Davier (Ed.) Statistical Models for Test
Equating, Scaling, and Linking. Chapter 19: pages 317–326. New York: Springer.
Skaggs, G & Lissitz, R. W. (1986). IRT Test Equating: Relevant Issues and a Review of
Recent Research. Review of Educational Research, 56(4), 495-529.
van der Linden, W. J. (2011). Local observed-score equating. In A. A. von Davier (Ed.),
Statistical models for equating, scaling, and linking. Chapter 13: pp. 201-223. New York:
Springer.
van der Linden, W. J., & Wiberg, M. (2010). Local observed-score equating with anchor-test
designs. Applied Psychological Measurement, 34, 620–640.
von Davier, A. A. (2011a). Statistical models for test equating, scaling, and linking. New
York: Springer.
von Davier, A. A. (2011b). A statistical perspective on equating test scores. In A. A. von
Davier (Ed.) Statistical models for test equating, scaling, and linking. Chapter 1: pp.1-17.
New York: Springer.
von Davier, A. A. (2013). Observed-score equating: An overview. Psychometrika. 78(4), 605-
623.
von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test
equating. New York: Springer.
Yang, W-L., Bontya, A. M., and Moses, T. P. (2011). Repeater effects on score equating for a
graduate admissions exam. Research Report ETS RR-11-17.
Wiberg, M. & van der Linden, W. J. (2011). Linear local observed-score equating. Journal of
Educational Measurement, 48, 229-254.
Wiberg, M., van der Linden, W. J., & von Davier, A. A. (2014). Local observed-score kernel
equating. Journal of Educational Measurement. 1, 57-74.
Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of
data. Biometrika. 55, 1-17.
Recommended