21
PSYCHOMETRIKA--VOL.58, NO. 3,395....415 SEPTEMBER 1993 A DYNAMIC GENERALIZATION OF THE RASCH MODEL N. D. VERHELST AND C. A. W. GLAS NATIONAL INSTITUTE OF EDUCATIONAL MEASUREMENT (CITO), ARNHEM, THE NETHERLANDS In the present paper a model for describing dynamic processes is constructed by combining the common Rasch model with the concept of structurally incomplete designs. This is accom- plished by mapping each item on a collection of virtual items, one of which is assumed to be presented to the respondent dependent on the preceding responses and/or the feedback ob- tained. It is shown that, in the case of subject control, no unique conditional maximum likeli- hood (CML) estimates exist, whereas marginal maximum likelihood (MML) proves a suitable estimation procedure. A hierarchical family of dynamic models is presented, and it is shown how to test special cases against more general ones. Furthermore, it is shown that the model presented is a generalization of a class of mathematical learning models, known as Luce's beta-model. Key words: Rasch model, missing data, incomplete designs, dynamic models, mathematical learning theory. Introduction Perhaps the most outstanding feature of IRT models is that interindividual differ- ences in behavior are explained at the model level, implying that subjects need not be considered as statistical replicates of each other, and it is not necessary to sample multiple observations from the same subject under assumedly constant conditions. This is accomplished by introducing so-called incidental parameters, one parameter associ- ated with each individual. Although these parameters can be treated formally as any other parameter in the model, a major problem is associated with their presence in parameter estimation. Maximizing the likelihood, for example, will not yield consistent estimates of the incidental parameters in general, nor of the other (sometimes called structural) parameters of the model (Neyman & Scott, 1948). There are several ways to cope with this problem. Two estimation methods are often used in IRT. In the first method, sufficient statistics for the incidental parameters (if they exist) are considered constants, and the conditional likelihood, given these statistics, is maximized. This conditional likelihood is by definition independent of the incidental parameters, and Andersen (1973) proved that this method yields consistent estimates of the structural parameters. This method is commonly labeled conditional maximum likelihood (CML). In the other approach, the incidental parameters are no longer considered fixed con- stants, but are treated as realizations of a (nonobserved or latent) random variable, whose distribution is assumed to belong to a certain family of distributions. In many applications this family is parameterized by a finite number of parameters (for example, the family of normal distributions). Since the latent variable is not observed, it is integrated out from the likelihood function, yielding the so-called marginal likelihood function. This function is maximized with respect to the structural model parameters and the parameters of the distribution jointly, yielding marginal maximum likelihood (MML) estimates, which by a seminal paper of Kiefer and Wolfowitz (1956) are proved Requests for reprints should be sent to N. D. Verhelst, Cito PO Box 1034, 6801 MG Arnhem, THE NETHERLANDS. 0033-3123/93/0900-91353500.75/0 © 1993 The Psychometric Society 395

A dynamic generalization of the Rasch model

Embed Size (px)

Citation preview

PSYCHOMETRIKA--VOL. 58, NO. 3,395....415 SEPTEMBER 1993

A DYNAMIC G E N E R A L I Z A T I O N OF T H E RASCH M O D E L

N. D. VERHELST AND C. A. W. GLAS

NATIONAL INSTITUTE OF EDUCATIONAL MEASUREMENT (CITO), ARNHEM,

THE NETHERLANDS

In the present paper a model for describing dynamic processes is constructed by combining the common Rasch model with the concept of structurally incomplete designs. This is accom- plished by mapping each item on a collection of virtual items, one of which is assumed to be presented to the respondent dependent on the preceding responses and/or the feedback ob- tained. It is shown that, in the case of subject control, no unique conditional maximum likeli- hood (CML) estimates exist, whereas marginal maximum likelihood (MML) proves a suitable estimation procedure. A hierarchical family of dynamic models is presented, and it is shown how to test special cases against more general ones. Furthermore, it is shown that the model presented is a generalization of a class of mathematical learning models, known as Luce's beta-model.

Key words: Rasch model, missing data, incomplete designs, dynamic models, mathematical learning theory.

Introduction

Perhaps the most outstanding feature of IRT models is that interindividual differ- ences in behavior are explained at the model level, implying that subjects need not be considered as statistical replicates of each other, and it is not necessary to sample multiple observations from the same subject under assumedly constant conditions. This is accomplished by introducing so-called incidental parameters, one parameter associ- ated with each individual. Although these parameters can be treated formally as any other parameter in the model, a major problem is associated with their presence in parameter estimation. Maximizing the likelihood, for example, will not yield consistent estimates of the incidental parameters in general, nor of the other (sometimes called structural) parameters of the model (Neyman & Scott, 1948). There are several ways to cope with this problem. Two estimation methods are often used in IRT. In the first method, sufficient statistics for the incidental parameters (if they exist) are considered constants, and the conditional likelihood, given these statistics, is maximized. This conditional likelihood is by definition independent o f the incidental parameters , and Andersen (1973) proved that this method yields consistent estimates of the structural parameters. This method is commonly labeled conditional maximum likelihood (CML). In the other approach, the incidental parameters are no longer considered fixed con- stants, but are treated as realizations of a (nonobserved or latent) random variable, whose distribution is assumed to belong to a certain family of distributions. In many applications this family is parameterized by a finite number of parameters (for example, the family of normal distributions). Since the latent variable is not observed, it is integrated out from the likelihood function, yielding the so-called marginal likelihood function. This function is maximized with respect to the structural model parameters and the parameters of the distribution jointly, yielding marginal maximum likelihood (MML) estimates, which by a seminal paper of Kiefer and Wolfowitz (1956) are proved

Requests for reprints should be sent to N. D. Verhelst, Cito PO Box 1034, 6801 MG Arnhem, THE NETHERLANDS.

0033-3123/93/0900-91353500.75/0 © 1993 The Psychometric Society

395

396 PSYCHOMETRIKA

to be consistent under very mild regularity conditions. The use of CML is restricted to models that have minimal sufficient statistics for the incidental parameters, and among the IRT models commonly applied very few are in this class (see Verhelst & Eggen, 1989, for a general characterization). The most famous example is the Rasch model, and both estimation methods have been successfully applied to this model (for CML, see Fischer, 1974; and Rasch, 1960; for MML, see Glas, 1989; and Thissen, 1982).

Several authors (Fischer, 1981; Mislevy, 1984) noted the possibility of obtaining consistent estimates of the structural parameters in the Rasch model from partially incomplete data or data collected in an incomplete design. It is the purpose of the present paper to investigate the applicability of the Rasch model in a dynamic context by manipulating the missing data concept on a set of complete data. This manipulation is convenient to get around the restrictions implied by the central axiom of local sto- chastic independence common to most latent trait models. For one thing, this axiom implies that a correct response by a given subject is independent of the responses on previous items or trials, thus seemingly excluding the so-called subject controlled learn- ing models where the probability of a correct answer depends explicitly on the response pattern given thus far (Sternberg, 1963). Although there exist generalizations of the Rasch model, where this dependence can be modeled explicitly (Jannarone, 1986; Kelderman, 1984), parameter estimation is difficult and limited to cases with a rather restricted number of parameters. Special mention needs to be made of the dynamic test model by Kempf (1974, 1977a, 1977b), which, in the sequel, will be contrasted with the model of the present paper. The approach used here combines the Rasch model with the missing data concept and with linear restrictions on the parameter space, yielding a wide range of dynamic models. The basic approach consists in conceiving of an item (or trial in a learning experiment) as a collection of "virtual" or "technical" items, one of which is supposed to be administered to each subject, depending, for example, on the pattern of previous responses. This is possible because of the basic symmetry in the Rasch model, where a change in the latent ability can equally well be considered a complementary change in the item difficulty. The details of this approach are the sub- ject matter of the next section.

The idea presented here is not new; in fact, it has been considered by Fischer (1972, 1983). Working in the context of CML estimation, Fischer rejected this approach because the model thus constructed was not estimable. In the present paper, it will be shown that the models constructed are in general estimable under MML and statistical tests of special cases of the model against more general ones can easily be constructed. Finally, it will be shown that a class of mathematical learning models, known as Luce 's beta model and generalizations thereof, are a special case of our model.

The Dynamic Rasch Model

In mathematical learning theory (see Sternberg, 1963, for a general introduction), the control of change in behavior is generally attributed to two classes of events: one is the behavior of the responding subject itself; the other comprises all events that occur independently of the subject's behavior, but which are assumed to change that behav- ior. Models that only allow for the former class are called "subject controlled"; if only external control is allowed, the model is "experimenter controlled", and models where both kinds of control are allowed are labeled "mixed models". As an example of "experimenter control", assume that during test taking, the correct answer is given to the respondent after each response. If it is assumed that learning takes place under the influence of feedback (generally referred to as reinforcement) independent of the cor- rectness of the given response, the model is experimenter controlled; if it is assumed,

N . D. V E R H E L S T A N D C. A . W. G L A S 397

however, that learning takes place only if the subject gave a wrong answer, the model is mixed. In the sequel it will be assumed that all controlling events can be binary coded, that the subject control can be modeled through the correctness of the responses on past items, and that experimenter control expresses itself at the level of the item. Let X = ( X 1 , . . . , X i . . . . , X k ) be the vector of response variables, taking values I for a correct answer and zero otherwise, and let Z = ( Z 1 , . o . , Z i . . . . , Z k ) be the binary vector representing the reinforcement event occurring after the response has been given. The value Z i is assumed to be independent of X. The prototype of a situation where this independence is realized is the so-called prediction experiment where the subject has to predict a sequence of independent Bernoulli events, indexed by a con- stant parameter ¢r (e.g., the prediction of the outcome of coin tosses with a biased coin). The outcome itself is independent of the prediction and it is observed frequently that in the long run the relative frequency of predicting each outcome matches approximately the objective probability of the outcomes (Estes, 1972). Therefore, it can be assumed that the outcome itself acts as a reinforcer and is the main determinant of the change in the behavior. Notice that in the example given, Zi is independent of the vector of response variables X. Therefore, any model that assumes control over the responses only through the outcomes is experimenter controlled.

Define the partial response vector X i (i > 1) as

x i : ( X 1 , • . . , X i - 1 ) , (la)

and the partial reinforcement vector Z i ( i :> I) as

z i ~--- ( Z 1 , • . • , Z i - l ) . (lb)

The model that will be considered is in its most general form given by

Pr (Xi = 110, ~i, xi , zi) = exp [0 + 8i + f i ( x i) + y i (z i )]

1 + exp [0 + 8i + Z ( x i) + gi(zi)] ' (2)

where 0 is the latent variable, 6i the "easiness" parameter of item i, x i and Z i represent realizations of X i and Z i, respectively, and f i ( ' ) and 9i( ' ) are arbitrary real-valued functions. Since the domain of these functions is discrete and finite, the possible func- tion values can be conceived of as parameters. It is clear that the general model is not identified, because the number of parameters outgrows by far the number of possible response patterns. So some suitable restrictions will have to be imposed.

A general restriction, frequently applied in mathematical learning theory, is the requirement that the functions f i and/or 9i are symmetric in their arguments, yielding models with commutative operators. Since the arguments are binary vectors, symmetry implies that the domain of the functions f i and 9i can be restricted to the sum of the elements in the vectors X/and Z i, respectively. Defining the variables R i and Si as

X j , (i > 1), Ri = [ ] =1

0, (i = 1),

(3a)

and

398 PSYCHOMETRIKA

~ i ~ l Z j , ( i > I),

S i = I j = 1 (3b)

L O, (i = 1),

with realizations r i and s i, respectively, and assuming symmetry of the functions g i and f i , (2) reduces to

Pr ( X i = 110, B i , r i , s i ) - exp [0 + 8i +/3i(ri) + ~/i(si)]

1 + exp [0 + 8i + Bi(ri) + ~/i(si)]' (4)

where / 3 i ( 0 ) and yi(0) are defined to be zero for all i. If all/3 and 3' are equal to zero, no transfer takes place, and (4) reduces to the common Rasch model; if all/3's are zero, and at least one of the 3,'s is not, the resulting model is experimenter controlled; if all ~,'s are zero and at least one of the/3's is not, the model is subject controlled, and in the other cases a mixed model results. Notice that (4) implies that no forgetting occurs: The influence of an equal number of correct responses and/or an equal number of positive reinforcements has the same influence on behavior, irrespective of their temporal dis- tance to the actual response. This somewhat unrealistic assumption is the price to be paid for the elegant mathematical implications of commutative operators. In the sequel, however, a model will be discussed where this symmetry is at least partially aban- doned.

Using the concept of incomplete data, it can be illustrated how the models dis- cussed above fit into the ordinary Rasch model. Only the subject controlled subcase of (4) will be considered in detail; that is, it is assumed that all y's are zero. Notice, it is assumed that the transfer taking place does not depend on the'initial ability 0 of the respondent, and any change in the ability (the increment/3i(ri)) c a n be translated into a change of the difficulty of the items. Thus, the difficulty of an item is conceived of as being composed of an "intrinsic" parameter ~i and a dynamic component f l i ( r i ) . The latter is not constant, but depends on the rank number of the item in the sequence (hence, the subscripted r), as well as on the specific capability of the item to induce learning effects (hence, the subscripted/3). These two effects can in principle be dis- entangled by experimental design, but not to overcomplicate the model, it will be assumed that the order of presentation is the same for all subjects.

Let a "real" item i be associated with a collection of "virtual" items, denoted by the ordered pair (i, j ) , j = 0, . . . , i - 1. The virtual item (i, j ) is assumed to be presented to all subjects who gave exactly j correct responses to the i - 1 preceding "real" items. Associated with response pattern X is a design vector D(X), with ele- ments D(X)ij (i = I, . . . , k; j = O, . . . , i - 1) defined by

{ l o i f R i = J , D(X)e = otherwise. (5)

Response pattern X is transformed into a response pattern Y(X) with elements Y(X)ij ( i = 1 , . . . , k ; j = 0 . . . . , i - 1) u s i n g

f ! if D(X)ij = 1 and X i --- I,

y(x)0 = if D(X)ij I and Xi = 0,

if O(X)ij O,

(6)

N . D . VERHELST AND C.A.W. GLAS

TABLE 1

The Transformation from Real to Virtual Items for k=3

399

real items conceptual items

1 2 3 (1,0) (2,0) (2,1) (3,0) (3,1) (3,2) s u m s c o r e

1 1 1 1 * 1 * * 1 3

1 1 0 1 * 1 * * 0 2

1 0 1 1 * 0 * 1 * 2

1 0 0 1 * 0 * 0 * 1

0 1 1 0 1 * * 1 * 2

0 1 0 0 1 * * 0 * 1

0 0 1 0 0 * 1 * * 1

0 0 0 0 0 * 0 * * 0

where c is an arbitrary constant. In Table 1 the set of all possible response patterns X and the associated vectors Y(X) is displayed for k = 3, the constant c is replaced by a star. It is easily verified that Y(X) and D(X) are a unique transformation of X.

The probability of an observed response pattern x, that is, y(x) and d(x), can now be derived as follows. Let ~ be a k(k + 1)/2 dimensional vector with elements ~ij (i = 1, . . . , k; j = 0, . . . , i - 1), where ~ij = 8i + fli(J). Then,

Pr(xlO, ~ ) = P r ( x ] l O, ~) 1-I Pr (xil xi, O, t~) i > 1

= Pr (xilO, ~) 1-I pr (y(x)ij[d(x)ij, O, ~) i > 1,j

exp [y(x)ijd(x)ij(O + ~ij)] = I~ [1 + exp (0 + ~ij)] d(x)'~ i , j

I ] [1 + exp (0 + ~ij)] d(x)U i , j

(7)

The second line of the derivation is motivated by the fact that the response x i is recoded to y(x)ij and the item administration variable d(x)ij is determined by the previous responses x i. The next line of the derivation follows from the assumption that, if d(x)ij = 1, the Rasch model holds, and Pr (y(x)ij = cld(x)ij = 0, 0, ~) = 1. Notice that (7) is a reparametrization of the original model. The advantage of writing the model in this fashion is that (7) is a simple generalization of the likelihood function for the Rasch model to incomplete designs (see Fischer, 1981). Of course, a similar derivation may be made for experimenter controlled models, and a straightforward generalization yields a similar result for mixed models. Although the likelihood function for the experimenter controlled case is formally identical to (7), with the design vector being a function of Z instead of X, the estimation problems are quite different, as discussed in the sequel.

400 PSYCHOMETRIKA

Estimation in the Dynamic Rasch Model

Glas (1988) has investigated the estimation problems in the Rasch model for so- called multi-stage testing procedures, that is, designs where the sequence of tests administered is controlled by the sequence of test scores of the respondent. On the level of virtual items, the design used in the present paper to analyze data under a subject controlled model can be viewed as a limiting case of a multi-stage testing design, where all tests consist of one item only, and the next test to be administered depends on the sum score R i obtained in the preceding tests. The main result of Glas is the conclusion that in the case of a multi-stage design, the CML estimation equations have no unique solution, while MML, in general, does yield consistent estimates.

With respect to CML estimation, the fact that the equations have no unique solu- tion directly follows from the analogy between the multi-stage testing design considered by Glas and the present design. However, since the latter design is an extreme case of the former, the problem of obtaining CML estimates emerges more profound in the present case, as the following argument may reveal. The test administration design and the sum score are a sufficient statistic for 0. CML estimation amounts to maximizing the conditional likelihood function, where the condition is the sum score and the design jointly (see Glas, I988, for details). From inspection of Table 1, it is easily verified that given the design (i.e., the location of the stars) and the sum score, the response vector Y(X) is completely determined. This means that the conditional sample space is a singleton, its likelihood is trivially identical to 1, and cannot be used to estimate the structural parameters, or any functions of structural parameters. The situation is totally different for experimenter controlled models, where the control over the design vector is completely independent of the responses. In that case, a sum score r obtained on a subtest of k items can originate from (r k) different response patterns. So, conditioning on the design implies no restriction whatsoever on the sample space of X and CML can be applied straightforwardly. It should be stressed that the well-known theorems on the existence and uniqueness of CML estimates by Fischer (1981, 1983) only apply to situations where, given the design, there is no restriction on the set of possible response patterns.

Also the applicability of the MML procedure follows from recognizing the analogy between the multi-stage testing design considered by Glas and the present design. In the framework of the assumption that the ability parameter is a stochastic variable, Glas (1988) shows that the dependency between the design and the responses does not result in a likelihood function that is different from the usual likelihood function for the Rasch model. So, translated to the present problem, the likelihood function to be maximized is given by

In L(~, q~lX) = ~'~ In f Pr (x.10n, ~.) 9(0.; q~) dO,,, n

(8)

where X stands for the data, x n is the response pattern of person n, and 9(. ; q~) is the probability density function of On, indexed by the parameter vector q~. The equivalence of (8) and the usual item response log-likelihood function with incomplete data makes it possible to directly use the estimation procedures for the latter family of models, such as MML estimation procedures for models with a normal distribution of ability (Bock & Aitkin, 1981; Mislevy, 1984), and nonparametric MML estimation procedures for models where the distribution function of ability is represented by a step-function (de Leeuw & Verhelst, 1986; Follmann, 1988). To avoid misunderstanding, it is important to emphasize that the missing data conception of the model is introduced to create a

N. D. VERHELST AND C. A. W. GLAS 401

connection with the theory of the Rasch model. However, the model can also be studied without the missing data concept, by noticing that, if for all possible response patterns x, nx is the count of all respondents producing x, the counts nx have a parametric multinomial distribution with parameters N and Pr (x; ~, q~).

The estimation procedure for the Rasch model with incomplete data and a normal ability distribution is described in detail by Glas and Verhelst (1989) and needs no further comment. In the next section, the nonparametric version of the MML estima- tion procedure for the Dynamic Rasch model will be studied.

A Nonparametric MML Estimation Procedure

Before deriving the actual estimation equations, an identity by Louis (1982) will be applied to IRT models. The identity proves very useful in the framework of MML estimation in general. Since the log likelihood given in (8) must be maximized, the estimation equations are given by

0 A(~) = ~ In L(~, ~oiX) = 0, (9a)

og

and

0 A(~o) = ~ In L(~, q~lX) = O. (9b) Oq~

The vector of partial derivatives A(~) can be expanded as

f A(~) = ~-~ ~n In Pr (xnl0.; ~)9(On; ~P) don

l n P r ( x n l 0 " ; ~ ) P r ( x . 1 0 . ; ~ ) g ( 0 . ; ~ ) dOn

= ~ , (10)

- f n Pr (x '10. ; ~)g(0n; ~) dO,,

where it is assumed that the order of integration and differentiation may be inter- changed. Notice that

Pr (x . t0"; ~)9(0"; ~) f(Onlxn; ~, ~ ) = , (11)

f Pr (x'10n; ~)ff(On; ~o) don

is the density of On given response pattern x n . Using

0 0 b(~) --- ~-~ In L(~, q~lO, X) = ~-~ ~ In Pr (x" IOn ; ~),

n

(12)

(10) can also be written as

a(~) = E(b(~)lX; ~, ~o). (13)

By an analogous argument, it can also be shown that

402 PSYCHOMETRIKA

a ( ~ ) = E(b(~)lX; ~2, ~). (14)

with

0 0 b(~o) -= In L(~, ~o10, X) -7- In g ( O n , q~). (15)

In general, if k is the vector of all parameters in the model, that is k ' = (~', q~'), (13) and (14) can be summarized as

A(X) = e (b(x) lX; X), (16)

with b(h) = 0/0k In L(Xl0, X), and the MML estimation equations are given by A(k) = 0. Identity (16), which was first derived by Louis (1982) in the framework of the EM algorithm, is important for the derivation of MML estimation equations, because, as will become clear in the sequel, b(k) is usually easy to derive. The interpretation of b(k) is easy: It is the vector of first derivatives of the log-likelihood function given that X as well as 0 are observed. The log-likelihood function given only X is the expected value of h(k) in the conditional distribution of 0 given X.

The general idea of nonparametric estimation is to leave the function 9 ( 0 ) as unspecified as possible. For the simple Rasch model (de Leeuw & Verhelst, 1986; Follmann, 1988) but also for more general models (Laird, 1978), it turns out that the distribution function G(O) may be chosen as a stepfunction with a finite number, m, of steps or nodes. This means that 0 is considered as a discrete random variable that can assume only a finite number of values with nonzero probability. In model estimation the nodes and their associated probabilities are treated as parameters. For the present model no formal proof is available that the likelihood can be maximized with respect to all possible distributions by this procedure. Nevertheless, the assumption of an un- specified discrete distribution is less restrictive than the assumption of the normal distribution. The formal derivation of the estimation equations is independent of the existence theorem. The only problem is the determination of m, the number of nodes. This will be discussed in the sequel.

To derive the actual estimation equations, assume that the random variable 0 can take only m values 01, • • • , Om with probabilities ~r 1 , . . . , 7r m respectively, such that ~.m 7 r m = 1. TO avoid the bounded parameters ~-q, the following reparametrization is us--~ffd:

exp (O) q) "/Tq = (17)

m

1 + ~ exp (tOp) / ) = 2

Formula (17) is a simple reparametrization mapping the m - 1 free ~r-values onto R m

and, at the same time, keeping the sum of the probabilities equal to one by the implied definition oJ 1 ---=- 0. The random variable Tnq is defined to be 1 if the 0-value of the n-th subject equals Oq, and 0 otherwise. Furthermore, define T n = (Tnl . . . . , Tnm )' and T = (tnq). Notice that the vector q~, denoting the set of parameters describing the 0-distribution is now defined to be q~ = (01 , . . . , 0 m , 7r 2 . . . . . 1rm)' or equivalently

= (01 , . . . , Ore, ~ 2 , " ' " , Win)" Let In g(tn; q~) = Y.~n=l tnq In 7rq. Using the shorthand notation dni j for d ( x n ) i j , and Ynij for Y(Xn)ij, the log likelihood given the observed data X and the unobserved data T is

N. D. VERHELST AND C. A. W. GLAS 403

In L(~, q~, IX, T) = ~'~ [In Pr (x n Itn ; ~) + In 9(tn ; q~)] n

= ~ t n q ( ~ d n i j [ y n i j l n O i j q + ( 1 - Y n i j ) l n ( 1 - d ~ i j q ) ] + l n l r q } , (18) q ~ i,j

with

exp ( O q -1- ~ i j )

l~lijq = 1 + exp (Oq + ~ij)" (19)

In the following, it will prove convenient to use the shorthand notation

Pr (xnlTnq = 1; ~, Oq)'lI 'q Fnq = Pr (Tnq = llXn; ~, ~0) = ~ Pr (x~lT~, = 1; ~, Op)~rp (20)

P

Notice that (20) is the discrete analogue of (11). The estimation equations are given by

NTrq = 2 Fnq, q = 1, . . . , m, (21) n

~, YnijdnijFnq = E E dniy*ijqFnq, q = 1 , . . . , m, (22) n i , j n i , j

ynijdnij = ~'~ ~, d~ijOijqFnq, for all i, j . (23) n n q

Proof. To derive the estimation equations for the weights (17), notice that from (18) it follows that b(toq) = O/OtOq ~.n ~.p tnp In Y-n (tn~ - Y.p tnpfrq) = ~ n ( tnq -- ~ q ) , and applying-(I-6~esults in g(b(toq)~t'; = ~ n P-"(tnq - "rrqlXn, ~ , ~o)

~,n (Fnq -- 'rrq). Using (18), the equations for the nodes Oq follow from

b(Oq)=O~q 2 tnq 2d~o[Y~O lnOqo + ( 1 - y n q ) l n ( 1 - O i ~ o ) ] n i, j

= ~ tnq ~ dnij(Ynij - Oijq), n i, j

and from (16) it follows that

E(b(Oq)lX; ~, ~) = ~ Fnq E dnij(Yniy - ~ l i j q ) . n i , j

Finally, the equations for the item parameters follow from

0

"n.(Yn tnqO.q)

404 PSYCHOMETRIKA

and E(b(~o)lX; ~, ~) = Zn dnijYnij - ~.n dnij •q OijqFnq • []

Equations (21) through (23) may be solved in different ways. A fast method is the Newton-Raphson (NR) algorithm, which, however, requires good initial estimates. A slower but much safer algorithm is the EM-algorithm. In the E-step the functions Fqn are evaluated at the current estimates; in the M-step, (21) through (23) are maximized while keeping the functions Fqn fixed; the solution yields the estimates for the next E-step. A combination of the two algorithms, starting with EM for a few iterations and then switching to NR proved to be a fruitful approach. For the derivation of second- order derivatives, for use in the NR-algorithm, and the evaluation of the asymptotic covariance matrix of the estimates, which is consistently estimated by inverting the observed information matrix, relatively simple expressions can be derived using the theory by Louis (1982). Let the observed information matrix be given by

I(k , k ) = [ 02 In L(X, IX)] - ~--k-~7 j. (24)

Then from Louis (1982), it follows that

I(k, k) = - e ( B ( k , k)lX; k) - E(h(k)b(k) ' lX; k) + A(k )a (k ) ' , (25)

with

0 2 In L(klX, O) B(k, X)= (26)

OkOk'

Notice that at the MML estimate of k, A(k) = 0. Explicit expressions of (25) for the present model are derived in appendix.

Linear Restrictions on the Parameter Space

Starting with the model defined by (7) and (8), it is possible to derive several interesting special cases by imposing linear restrictions on the parameters ~. Let lq be an m-dimensional vector, m < k(k + 1)/2, such that ~1 = Q~, with Q a (constant) matrix of rank m. Notice that the dimension of rl is less than the number of virtual items. It can be easily verified that the estimation equations for this class of models are given by

QA(~) = 0, that is, QE(b(~)IX; ~, ~¢) = 0.

The following models are identified.

i. Amount of learning depends on the number of successes on the preceding items, yielding the restrictions,

8i i f j = 0.

~ij = ~i + [3j i f j > 0. (27)

ii. As a further restriction of the previous model, one may suppose that the amount of learning after each success is constant, yielding

{ 6i i f j = 0.

CO= ~ i + j f l if j > 0. (28)

N. D. VERHELST AND C. A. W. GLAS 405

iii. A two operator model can be constructed by assuming that the amount of change in latent ability is not only a function of the number of previous successes, but also of previous failures. The most general version of a two operator model with commutative operators is given by

f Si if i = 0.

8 i + / 3 j i f i > l a n d j = i - 1, = (29)

~ij 8 i q _ E i _ l i f i > 1 a n d j = 0 ,

~i q- /3j + E l - j - 1 otherwise.

It can easily be checked that (29) is a full-rank reparametrization of (7). iv. Analogously to Case ii, one can assume that the amount of transfer is inde-

pendent of the item, specializing Case iii further by imposing/3j = j/3 and ej = j e .

v. It can be assumed that the effect of an incorrect response is just the opposite of the effect of a correct response, by imposing the extra restriction/3 = - e on the model defined in Case iv.

vi. Finally, one could assume the limiting case where the amount of learning is the same irrespective of the correctness of the preceding responses. Formally, this can be modeled as a further restriction on Case iv by putting/3 = e. In this case, however, the model is no longer identified if the rank order of presentation is the same for each respondent, because the parameter of each virtual item ( i , j ) is given by 8 i + (i - I)/3, and the value of/3 can be freely chosen, because adding an arbitrary constant c to/3 can be compensated for by subtracting c ( i - 1) (i > 1) from 8i. Besides, in this case there is no more subject control. If, for example, the learning effect is caused by giving feedback after each response, or by the item text itself, or in generic terms by some reinforcer not under control of the respondent, the above model is also a limiting case of experimenter control, in the sense that there is no variability in the reinforcement schedule. So the solution to the identification problem is simple: introducing variability in the reinforcement schedule will solve the problem. For this restricted model, where the amount of learning increases linearly with the rank number of the item in the sequence, it suffices to administer the test in two different orders of presentation to two equivalent samples of subjects. Let i in the ordered pair (i, j) (the virtual item) repre- sent the identification label of the item and j the rank number in the presentation; then ~ij = ~i "t- ( j - 1)/3, and the model is identified if there is at least one i such that (i, j) and (i, j ' ) , j ~ j ' , are virtual items.

Testing the Model

Testing the validity of the models presented above can be accomplished within the framework of parametric multinomial models. As has been mentioned above, this is justified by the fact that the counts of respondents producing response pattern x, nx have a multinomial distribution with parameters N and Pr (x; k). Consequently, Pear- son's chi-square statistic and the likelihood-ratio statistic can be directly applied. The degrees of freedom are given by the number of possible response patterns, minus the number of parameters to be estimated, minus one. The use of Pearson's chi-square statistic is rather limited because the number of response patterns grows exponentially with the number of items, so that even with a moderate number of items, many patterns will tend to have a very small expected frequency.

In many instances, one may be interested in testing the validity of a more restricted model against a more general case. Let O2 = Q2~ represent the general case, and let

406 PSYCHOMETRIKA

~1 = Q l l q 2 = Q 1 Q 2 ~ , with ml = rank (QI) < rank ( Q 2 ) = m2, represent the restricted model. Again, from the fact that both models are parametric multinomial models, standard asymptotic theory applies and the likelihood-ratio statistic based on the two models has an asymptotic chi-square distribution with m E - - m 1 degrees of freedom.

The asymptotic theory of parametric multinomial models also applies if the distri- bution of the latent variable is discrete with a finite number m of nodes, except if the model is overparametrized. Suppose the (absolute) maximum of the likelihood is at- tainable with a discrete distribution with rn - 1 nodes. Estimating with rn nodes will yield the same function value, but the degrees of freedom must be calculated using m - I rather than m. Fortunately, overparametrization shows itself in one of two possible ways: either the probability associated with a node is zero or two nodes are equal. Too small a number of nodes only means a restriction on the model space: one cannot claim to find the maximum of the likelihood over all possible distributions, but the asymptotic theory remains valid with respect to the restricted model space.

Examples

To illustrate some aspects of the model, an artificial example and an example with real data will be given. In the first example, data were generated for a test of 5 real items, which gives rise to 15 virtual items. The resulting 32 possible response patterns are shown in Table 2. The data were generated using the model given by (28), and a standard normal distribution of ability. All item parameters t~ i w e r e set equal to zero, the learning parameter/3 was set equal to 0.5.

In the column labeled "FREQUENCY" the number of observations on each re- sponse pattern are given for a study with 4000 respondents. Two MML estimation procedures were carded out; the first using a normal ability distribution, the second using three points on the latent continuum. The choice of the number of points is motivated by the upper bound given by de Leeuw and Verhelst (1986) for the Rasch model. It must be emphasized that this upper bound does not directly apply to the model with learning parameters, it only applies if all learning parameters are set equal to zero. In absence of a more general theory for the number of points, the number for the Rasch model has been chosen. This is motivated by the fact that, in many instances, the hypothesis of no learning will be among the hypotheses to be tested. If the number of points is different for the Rasch model and a model with one or more learning parameters, it may be difficult to decide whether the difference in model fit is due to the additional parameters or due to the difference in the number of points. On the other hand, it must be admitted that adding points to models beyond the Rasch model, such as the model with a learning parameter, does lead to some gain in likelihood, the optimal strategies and theoretical implications of nonparametric estimation of IRT mod- els beyond the Rasch model, remain a point of study that is beyond the scope of the present paper.

The parameter estimates are given in Table 3. In the first column the true values of the parameters are given. The next two columns contain the estimates and standard errors obtained using the normal assumption. The last two columns contain the esti- mates and standard errors for the nonparametric version of the estimation procedure. In the rows labeled "Point" , the estimates of the parameters Oq are given; the third point was fixed to identify the model. The rows labeled "Weight" contain the estimates of the parameters Zrq. Information on model fit is summarized in Table 2, where the last four columns contain the expected frequencies and the contribution to Pearson's chi- square statistic, which, together with the likelihood ratio statistic, is given at the bottom of the table.

N. D. VERHELST AND C. A. W. GLAS 407

TABLE 2

Expected and Observed Response Patterns and Model Fit with Artificial Data

RESPONSE PATTERN ¥(x) FREQUENCY EXPECT DEVIANCE EXPECT DEVIANCE

1 0 0 * 0 * * 0 * * * 0 * * * * 405 405.43 .00 405,76 .00

2 1 * 0 * 0 * * 0 * * * 0 * * * 62 68.62 .63 67,85 .50

3 0 1 * * 0 * * 0 * * * 0 * * * 97 87.25 1.08 86.82 1.19

4 1 * 1 * * 0 * * 0 * * * 0 * * 21 25,45 .77 24.99 .63

5 0 0 * 1 * * * 0 * * * 0 * * * 90 97.82 .62 98.66 .76

6 1 * 0 * 1 * * * 0 * * * 0 * * 31 30.92 .00 30.07 .02

7 0 1 * * 1 * * * 0 * * * 0 * * 28 40.92 4.08 39.81 3.50

8 1 * 1 * * 1 * * * 0 * * * 0 * 22 22.03 .00 22.11 .00

9 0 0 * 0 * * 1 * * * * 0 * * * 142 122.16 3.22 125.86 2.06

i0 1 * 0 * 0 * * 1 * * * * 0 * * 37 41.89 .57 40.69 ,33

ii 0 1 * * 0 * * 1 * * * * 0 * * 53 55.74 .13 54,07 .02

12 1 * 1 * * 0 * * 1 * * * * 0 * 37 32.41 .64 31,96 .79

13 0 0 * 1 * * * 1 * * * * 0 * * 68 65.43 .i0 63,86 .26

14 1 * 0 * 1 * * * 1 * * * * 0 * 42 41,31 .01 40,34 .06

15 0 1 * * 1 * * * 1 * * * * 0 * 55 57.38 .09 56,18 .02

16 1 * i * * 1 * * * 1 * * * * 0 70 65.14 .36 70.82 .00

17 0 0 * 0 * * 0 * * * 1 * * * * 159 157.49 .01 166.28 .31

18 1 * 0 * 0 * * 0 * * * 1 * * * 53 58.61 .53 57,08 .29

19 0 1 * * 0 * * 0 * * * 1 * * * 92 78.44 2.34 76.12 3.30

20 1 * 1 * * 0 * * 0 * * * 1 * * 51 49.13 .07 47,77 .21

21 0 0 * 1 * * * 0 * * * 1 * * * 107 92.58 2.24 90,25 3.10

22 1 * 0 * 1 * * * 0 * * * 1 * * 69 62.89 ,59 60.51 1.18

23 0 1 * * 1 * * * 0 * * * 1 * * 79 87,75 .87 84.58 .36

24 1 * 1 * * 1 * * * 0 * * * 1 * 104 105.20 .01 111,45 .49

25 0 0 * 0 * * 1 * * * * 1 * * * I01 121,81 3.55 120,33 3.10

26 1 * 0 * 0 * * 1 * * * * 1 * * 82 89.65 .65 86,16 .20

27 0 1 * * 0 * * 1 * * * * 1 * * III 125.73 1.72 120,96 .82

28 1 * 1 * * 0 * * 1 * * * * 1 * 178 161.42 1.70 168.17 .57

29 0 0 * 1 * * * 1 * * * * 1 * * 155 155.51 .00 151.11 .09

30 1 * 0 * 1 * * * 1 * * * * 1 * 225 216.11 .36 223.11 .01

31 0 1 * * 1 * * * 1 * * * * 1 * 326 316.64 .27 327.27 .00

32 1 * 1 * * 1 * * * 1 * * * * 1 848 860.98 .19 848.84 .00

CHI-SQUARE 27.53 24.34

DF 24 21

PROB .28 .28

LIKELIHOOD RATIO STATISTIC 28.29 24.78

DF 24 21

PROB .25 .26

It can be seen that both models result in an acceptable fit. Since the data were generated using a normal distribution of ability, this is not unexpected. However, with real data the normal assumption may obstruct model fit. This will.be illustrated with an example of seven items from an examination of language comprehension. The number of observations was 3968. Four models were considered: Model 1, G0 = 6i, which is the Rasch model; Model 2, ~U = 6i + j/3, that is, the model where every correct response provides the same amount o f learning; Model 3, ~ij = 6i + /3j, that is, the model were the amount of learning differs as a function of the number of correct responses given; Model 4, where ~ij is entirely free. The results are summarized in Table 4. In the first

408 PSYCHOMETRIKA

TABLE 3

Parametric and NonparametrlcMMLEstimation withArtiflclal Data

Parameter true value estimate Se(estimate) estimate Se(estimate)

Item 1

Item 2

Item 3

Item 4

Item 5

Learning

St.dev.

Point 1

Point 2

Point 3

Weight 1

Weight 2

Weight 3

.000 -.083 .039 -.i04 .126

.000 -.013 .074 .026 .142

.000 -.069 .134 .034 .164

.000 -.026 .200 .147 .215

.000 .037 .266 .281 .268

.500 .481 .113 .367 .106

1.000 1.021 .154

-1.8384 .212

-.5617 .223

1.0000 .***

.1487 .072

.3782 .062

.4731 .059

TABLE 4

Parameter Estimation and Model Testing with Examinations Data

Estimation with normal assumption Estimation using three-step function

Parameter Se(Parameter) Parameter Se(Parameter)

Item 1 .054 .038 Item 2 -.325 .038 Item 3 -1.780 .048 Item 4 -2.163 .053 Item 5 -1.759 .048 Item 6 -.232 .038 Item 7 -1.958 .050 St.Dev .983 .036

Item 1 -.023 .040 Item 2 -.408 .060 Item 3 -1.852 .062 Item 4 -2.224 .065 Item 5 -1.831 .062 Item 6 -.314 .060 Item 7 -2.025 .063 Point 1 -1.091 .038 Point 2 -.155 .031 Point 3 1.000 .*** Weight 1 .4090 .044 Weight 2 .1906 .049 Weight 3 .4004 .044

Model fit

Model 1.r.statistic df prob. (i) 150.65 119 .026 (2) 148.76 118 .029 (3) 129.71 113 .135 (4) 92.88 98 .627

(l}vs(2) 1.89 1 .170 (2)vs(3) 19.05 5 .002 (3}vs(4) 36.83 15 .001

Model fit

Model 1.r.statistic df prob. (i) 137.28 116 .086 (2) 114.61 115 .493 (3) 107.59 Ii0 .547 (4) 93.84 95 .527

(1)vs(2) 22.68 1 .000 (2)vs(3) 7.02 5 .219 (3)vs(4) 14.17 15 .512

N, D. VERHELST AND C. A. W. GLAS 409

part of the table, the parameter estimates for the Rasch model are given. It can be seen that the different assumptions do not produce substantially different results. However, the evaluation of model fit, which was carried out using the likelihood ratio statistic, provides a different picture. Using a 5% significance level, the over-all tests show that Models 1 and 2 are not acceptable under the normal assumption, while they are in the nonparametric approach. In the next three rows, specific models are tested against more general models using the likelihood-ratio statistics. It can be seen that under the normal assumption, Model 3 produces a significant result when compared to Model 4, so that only the most general model, Model 4, appears to be acceptable. In the case of nonparametric estimation, only the restriction I versus 2 has to be rejected, yielding a much more parsimonious description of the data.

The Relationship with Mathematical Learning Theory

Mathematical learning theory is an area that has gained much attention among mathematical psychologists in the early 1960's. Chapters 9, 10, and 17 of the Handbook

J of Mathematical Psychology (Luce, Bush, & Galanter, 1963) were entirely devoted to formal learning models and contain many interesting results. To get a good impression of the scope of these models, and of the problems that were recognized to be difficult, an example will be given. In the avoidance training experiment, an animal is placed in a box, where it can avoid an electric shock by jumping over a barrier within 10 seconds after the occurrence of a conditioned stimulus (a buzzer sound). In a simple learning model, it is assumed that (a) learning (i.e., a change in the tendency to avoid the threatening situation before being shocked) occurs only on escape trials, that is, when the animal escapes after being shocked; (b) the "inherent" difficulty of the situation is constant, and (c) there are no initial differences between the animals in the initial tendency to avoid shocks. Of course this theory implies subject control. If the trials are identified with "real items", then (b) and (c) imply that 6i is constant, say 6i = 6, and that the initial ability 0 is constant. Letting an escape be a success, the probability of a success on trial i, given r i successes on previous trials is given by

Pr (Xi = llv, R(X) = j ) = - - p a j

1 + va j ' (30)

where v = exp (0 - 3) and t~ = exp (/3), which is known as Luce's (1959) one-operator beta model. Notice that (30) is a special instance of the above case (ii), defined by (28). If it is assumed that there is some learning following a failure (an avoidance), then

Pr (Xi = l]v, R(X) =j ) = FoLJoLi2-J - I

1 + ua~a~ - ) - 1 ' (31)

with al = exp (/3) and a 2 = exp (e), which is equivalent to Luce's two-operator beta-model. So this model is just a special instance of case (iv) discussed above.

The assumptions of no variability in the difficulty parameters or in the initial 0 are characteristic for the many learning models developed roughly between 1955 and 1970. The lack of variability in the difficulty parameters may be attributed mainly to the fact that most applications concerned experiments with constant conditions over trials, while the assumed constancy in initial ability was recognized as a problem: " . . . in most applications of learning models, it is assumed that the same values of the initial p robab i l i t y . . , characterize all the subjects in an experimental g r o u p . . . It is conve- nience, not theory, that leads to the homogeneity assumption" (Sternberg, 1963, p. 99).

410 PSYCHOMETRIKA

The convenience has to be understood as the lack of tools at that time to incorporate individual differences as a genuine model component, and the rather crude estimation procedures used. Maximum likelihood methods were used, although rarely, only in conjunction with Luce's beta-model; but this model was by far less popular than the family of linear models introduced by Bush and Mosteller (195I), where the probability of a success can be expressed as a linear difference equation, while Luce's model can be written as a linear difference equation in the logit of success probability. Most estimation methods in the linear model were modified moment methods, frequently yielding problems, because it was clearly acknowledged that suppression of interindi- vidual variability would underestimate the variance of almost every statistic: " . . . un- less we are interested specifically in testing the homogeneity assumption, it is probably unwise to use an observed variance as a statistic for estimation, and this is seldom done" (Sternberg, 1963, p. 99).

The next example that will be given serves a triple purpose: (i) it is an example of a mixed model with noncommuting operators; that is, (1) applies but not (4); (ii) it illustrates the rich imagination of the older learning theorists and at the same time their struggle to handle rather complex mathematical equations, and (iii) it yields a nice suggestion to construct a statistical test for the axiom of local stochastic independence in the Rasch model. The model is the 'logistic' variant of the one-trial perseveration model of Sternberg (1959). The model was developed because the autocorrelation of lag 1 in the response pattern X was larger than predicted by the theory of the one-operator linear model, suggesting that apparently there was a tendency to repeat the preceding response (the experiment was a binary choice experiment, where one choice was sys- tematically rewarded by a food pellet; the subjects were rats). Defining the choice of the nonrewarded response as a success, the model Sternberg proposed is given by

p i = ( 1 - b ) a i - l p l +bXn-1 , ( i > - - 2 , 0 < a , b < l ) , (32)

where Pi = Prob (Xi = 1), a is the parameter expressing the learning rate, and b is the perseveration parameter, expressing the tendency to repeat the previous response. The logistic analogue, mentioned in Sternberg (1963, p. 36), but not analyzed is given by

l o g i t p i = 0 + ( i - 1 ) 7 + / 3 X i - 1 , ( i > 1), (33)

where 0 = logit P l is treated as a constant. Equation (33) is readily recognized as a special case of (2), where ~i = 0, g i ( Z i ) = ( i -- 1)T andfi(X i) = /3Xi-l. Notice that fi(') is not symmetrical, so (33) is not a special case of (4). Notice further that (33) is more flexible than (32): by the restrictions put on the perseveration parameter b, tendencies to alternate the response require another model, while in (33) a positive/3 expresses a perseveration tendency, while a negative/3 expresses a tendency to alter- nate.

It is immediately clear that (33) violates the assumption of local stochastic inde- pendence. Now suppose data collected with an attitude questionnaire are to be ana- lyzed with the common Rasch model, but there is some suspicion of response tenden- cies, in the sense of for example a tendency to alternate responses. Model (33) is readily adapted to this situation: set 7 = 0, allow variation in the "difficulty" parameters 8i and in the latent variable 0. There are 2k - 1 virtual items: (i, 0), (i, 1) for i > 1, and (1, I) - (1, 0), the second member of the ordered pairs being equal to the previous re- sponse. The assumption of local stochastic independence can be tested with a likeli- hood-ratio test as explained in the previous section using a restricted model where/3 = 0, but this is nothing else than the common Rasch model.

N. D. VERHELST AND C. A. W. GLAS 411

It may be interesting to contrast the models discussed above with a dynamic item response model developed by Kempf (1974, 1977a, 1977b), in which it is assumed that

Pr ( X i = llx i, if, zx,, K i ) = + ~ x ~

~ + K i (34)

where ~, ~ > 0, is an ability parameter, Ki, K i > 0 , is the difficulty parameter of item i and ze represents the transfer as a result of the partial response vector x i, with 0 - "r,e < Ki. Defining or i = K/-- l and dividing the numerator and denominator of (34) by K i yields

Pr (X i = llx i, if, Zx,, cri) - (~" + Tx,)cr/

1 + ~0" i (35)

which is equivalent with the Rasch model if all parameters ~'x' are equal to zero, if no transfer exists. Of course, the model defined in (35), which associates a new transfer parameter with each partial response vector is hardly interesting, but Kempf (1977b) demonstrates that the more restricted model defined by

+ 'rr~ = - - , (36) Pr ( X i l l r i , ~, *r , , Ki) -- ~ + Ki

with r i the partial sum score as defined in (3a), has sufficient statistics for the person parameters, allowing for CML estimation of item and transfer parameters. Kempf (I977a, p. 306) notes that in the neighborhood of the maximum, the likelihood surface may be very flat, causing instability and large standard errors of the parameter esti- mates. Apart from the numerical difficulties of the model, it again proves interesting to connect this model with mathematical learning theory. Let the assumption that the difficulty of the situation does not change be translated into Ki = r and let the assump- tion of no initial differences be translated into ~ = 0. Further,

{ ~'r, if trial i - 1 was a success, let "rri =

r ~ _ 1, if trial i - 1 was a failure.

I f it is assumed that ~'r = a~'r-i + (I - a)K, (0 < a < 1), a n d p i and pi_ 1 are the probability of, respectively, a success on trail i and i - 1, it follows that

I a p i - 1 + (1 -- a ) , if trial i - 1 was a success,

Pi = l . P i - 1, if trial i -- 1 was a failure.

This shows that Kempf 's model is easily specialized to the linear operator model by Bush and Mosteller (1951). Thus, the two basic approaches to mathematical learning theory, the linear operator model and the beta model find natural generalizations in Kempf 's model and the dynamic test model of the present paper.

Conclusion

The model proposed in this paper is in fact nothing else than the common Rasch model. Its power to handle dynamic data stems from the manipulation of the concept of incomplete designs by the introduction of"v i r tua l i tems". The general model defined by (2), however, is much too general to be identified. In the present paper a subclass of

412 PSYCHOMETRIKA

models, the class with commuting operators, is studied extensively. It is shown that simple likelihood-ratio tests are applicable, so model testing is straightforward. The approach, however, is not restricted to commuting operators, as is shown by the construction of a statistical test for the basic axiom of local stochastic independence. As to the parameter estimation it is shown that the result by Glas (1988) also applies in the case of subject controlled or mixed models: CML is not applicable for the present class of models and recourse has to be taken to MML.

The comparison with a large class of mathematical learning theories, namely, Luce's beta-model, and all "logistic" variants, such as (33), which does not follow from Luce's choice axiom, revealed that these learning models are all special cases of (2) in two respects: They do not allow variability in the "inherent" difficulty of the item or trial, and most important, they are not able to explain individual differences. At the time mathematical learning theories were flourishing, important papers, nowadays acknowledged as the founding papers of IRT (Ferguson, 1942; Lawley, 1943; Lord, 1952; and Rasch, 1960), were available. The contact seemingly never took place. The model presented here does not have the two aforementioned restrictions: The variability of the latent variable is almost the defining characteristic of IRT models and by allowing items or trials to have their own specific parameters, learning theories can find their applications outside the psychological laboratories, for example, in educa- tional testing.

Not all problems are solved, however. A very hard one was hidden in an ellipsis in the first quotation of Sternberg. The full text reads: " . . . in most applications of learn- ing models, it is assumed that the same values of the initial probability and other parameters characterize all the subjects in an experimental g r o u p . . . " (1963, p. 99). The italic text refers to the possibility that learning effects also may show individual differences. These, however, are not easily built into the general model, which means it is easy to attach a subscript to some parameter symbols, but not easy to estimate individual learning parameters.

Appendix Derivation of the Observed Information Matrix

For two arbitrary parameters ,)k i and Aj, the element I(Ai, Aj) can be written as

[O21nL(k)]_E[O lnL(k) OlnL(k)]+ [0 ln L(k)]E[O lnL(k)] I(Ai , Aj) = -El ~i0~j .] 0Ai ~ -j E ~t~ j L 19Aj ]'

(A1)

where L(~) is used as a shorthand for L(klX, T). The expected value is to be taken over the conditional distribution f(TIX; k), which by virtue of the experimental indepen- dence can be written as

f(TIX; k) =/--[ f(tn IXn ; 2k). (A2) n

Using (A2) and Ln(k) as a shorthand for L(XlXn, tn), (A1) can be rewritten as

0 lnLn (~,) O lnLn, (k) OA i OAj

f( tn, tn' [Xn, Xn' ) dt~ dtn'

N. D. VERHELST AND C. A. W. GLAS 413

~fOlnL"(X)f(t.lx.)dt.~yOlnL"(X) + OAi n OAj

f(tnlxn) dtn

E[a21nLO')]-[ 0~i0-~j ] ~n fOlnL.(X) alnL.(X)OAi OAj f(t.lx.)dt. (A3)

f ~ f a lnL.(x) ~ O In L.(X)f( t . Jx . ) dtn + Oh; OAj

f ( t . Ix.) dtn

[O21nL(A)] ~ (O ln Ln(k) O ln Ln(k) ) E L c o y oh, ' Ix..

The righthand side of (A3) is of general use for expressing the observed information matrix when some of the data are missing. The first term is the expected value of the observed information matrix given the full data (X, T), with respect to the conditional distribution of T given X, and is analogous to (16). The second term gives explicitly the loss of information due to the nonobservation of T.

To simplify the notation, a single index will be used to denote the item. The log likelihood in incomplete designs given the observed data X and the unobserved data T can then be written as

(A4)

with 1rq defined by (17) and

e x p (Oq + ~i) t~iq = 1 + e x p (0q + ~ i ) " (A5)

Application of (A3) to the Rasch model with a discrete distribution of the latent variable is straightforward but tedious algebra. The results are listed below in a format which reflects the structure of the general formula (A3), by enclosing the two terms in brack- ets.

(A7)

414 PSYCHOMETRIKA

where

I(Si, Oq)= [ ~ dniFnq*iq(1 - qJiq)] -

(A8)

Anq = ~, dni(xni - q/iq); (A9) i

l(Oq Oq) = Fnq E dni$iq(1 - q/iq) - Fnq(l - Fnq)Anq ; (A l l ) i

I(Oq Op)=[O]-I-~nFnqFnpAnqAnp], ( p # q ) ; (A12)

b ~

1,

Notice that the above formulas yield minus the matrix of second derivatives of the log likelihood given that only X is observed. During estimation they can be evaluated at the current estimate for use in the NR-algorithm. At the ML estimates, they yield the observed information matrix, which, after inversion, gives a consistent estimate of the variance-covariance matrix of the parameter estimates.

Let M be the number of parameters, and suppose M - m independent linear restrictions are imposed on the parameter space. Then there exists a matrix B of rank m such that ~ = Bk. It is easy to show that

I('q, "q) - Bl(k, k)a ' . (A17)

Usually, the inverse transformation is given (i.e., the M model parameters are ex- pressed as linear combinations of a more restricted set of m basic parameters, or equivalently a matrix Q is defined such that k = Q-q). The above matrix B is then given by B = (Q,Q)-IQ,.

N. D. VERHELST AND C. A. W. GLAS 415

References

Andersen, E. B. (1973). Conditional inference and models for measuring. Unpublished dissertation, Men- talhygienisk Forskningsinstitut, Copenhagen.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: an applica- tion of an EM-algorithm. Psychometrika, 46, 443-459.

Bush, R. R., & Mosteller, F. (t951). A mathematical model for simple learning. Psychological Review, 58, 313-323.

de Leeuw, J., & Verhelst, N. D. (1986). Maximum likelihood estimation in generalized Rasch models. Journal of Educational Statistics, 11, 183-196.

Estes, W. K. (1972). Research and theory on the learning of probabilities. Journal of the American Statistical Association, 67, 81-102.

Ferguson, G. A. (1942). Item selection by the constant process. Psychometrika, 7, 19-29. Fischer, G. H. (1972). A step towards dynamic test theory (Research Bulletin 24). Vienna: University of

Vienna, Institute of Psychology. Fischer, G. H. (1974). Einfiihrung in die Theorie Psychologischer Tests [Introduction to the theory of

psychological tests]. Bern: Huber. Fischer, G. H. (198 t). On the existence and uniqueness of maximum likelihood estimates in the Rasch model.

Psychometrika, 46, 59-77. Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48, 3-26. Follmann, D. (1988). Consistent estimation in the Rasch model based on nonparametric margins. Psy-

chometrika, 53, 553-562. Glas, C. A. W. (1988). The Rasch model and multi-stage testing. Journal of Educational Statistics, 13, 45-52. Glas, C. A. W, (1989). Contributions to estimating and testing Rasch models. Arnhem: Cito. Glas, C. A. W., & Verhelst, N. D. (1989). Extensions of the partial credit model. Psychometrika, 54,635--659. Jannarone, R. J. (1986). Conjunctive item response theory kernels. Psychometrika, 51,357-373. Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 223-245. Kempf, W. (1974). Dynamische Modelle zur Messung sozialer Verhaltenspositionen [Dynamic models for

measuring social relationships]. In W. Kempf (Ed.), Probabilistische Modelle in de Sozialpsychologie [Probabilistic models in social psychology (pp. 13-55). Bern: Huber.

Kempf, W. (1977a). A dynamic test model and its use in the micro-evaluation of instrumental material. In H. Spada, & W. Kempf, (Eds.), Structural models for thinking and learning (pp. 295-318). Bern: Huber.

Kempf, W. (1977b). Dynamic models for the measurement of 'traits' in social behavior. In W. Kempf, & B. H. Repp (Eds.), Mathematical models for social psychology (pp. 14-58). New York: Wiley.

Kiefer, J., & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-903.

Laird, N. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association, 73,805-811.

Lawley, D. N. (1943). On problems connected with item selection and test construction. Proceedings of the Royal Society of Edinburgh, 61,273-287.

Lord, F. M. (1952). A theory of test scores. Psychometric Monograph No. 7, 17(4, Pt. 2). Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the

Royal Statistical Society, Series B, 44, 226-233. Luce, R. D. (1959). Individual choice behavior. New York: Wiley. Luce, R. D., Bush, R. R., & Galanter, E. (Eds.). (1963). Handbook of mathematical psychology (3 volumes).

New York: Wiley. Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381. Neyman, J., & Scott, E. L. (1948). Consistent estimates, based on partially consistent observations. Econo-

metrica, 16, 1-32. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish

Institute for Educational Research. Sternberg, S. H. (1959). A path dependent linear model. In R. R. Bush & W. K. Estes (Eds.), Studies in

mathematical learning theory (pp. 308-339). Stanford: Stanford University Press. Sternberg, S. H. (1963). Stochastic learning theory. In R. D. Luee, R. R. Bush, & E. Galanter (Eds.),

Handbook of mathematical psychology, Vol. H (pp. 1-120). New York: Wiley. Thissen, D. (1982). Marginal maximum likelihood estimation in the one-parameter logistic model. Psy-

chometrika, 47, 175-186. Verhelst, N. D., & Eggen, T. J. H. M. (1989). Psychometrische en statistische aspecten van peilingsonder-

zoek [Psychometric and statistical aspects of assessment research], PPON-rapport, 4. Arnhem: Cito.

Manuscript received 6/21/91 Final version received 7/7/92