Journal of Educational and Behavioral Statistics-2015-Liang-5-34

Embed Size (px)

Citation preview

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    1/30

    A Quasi-Parametric Method for Fitting FlexibleItem Response Functions

    Longjuan Liang

     Educational Testing Service

    Michael W. Browne

    Ohio State University

     If standard two-parameter item response functions are employed in the analysis

    of a test with some newly constructed items, it can be expected that, for some

    items, the item response function (IRF) will not fit the data well. This lack of fit 

    can also occur when standard IRFs are fitted to personality or psychopathology

    items. When investigating reasons for misfit, it is helpful to compare item

    response curves (IRCs) visually to detect outlier items. This is only feasible if   

    the IRF employed is sufficiently flexible to display deviations in shape from the

    norm. A quasi-parametric IRF that can be made arbitrarily flexible by increas-

    ing the number of parameters is proposed for this purpose. To take capitaliza-

    tion on chance into account, the use of Akaike information criterion or Bayesian

    information criterion goodness of approximation measures is recommended for 

     suggesting the number of parameters to be retained. These measures balance

    the effect on fit of random error of estimation against systematic error of   

    approximation. Computational aspects are considered and efficacy of the

    methodology developed is demonstrated.

    Keywords: item response theory; flexible item response function; monotonic polynomial 

    1. Introduction

    When dichotomous items of an ability test are being analyzed, the most

    widely employed item response functions (IRFs) have two parameters. Although

    these IRFs have been found to be useful in general, they lack flexibility and there

    are situations where they fail to fit some items. When this happens, it could be

    either that the items have flaws or the data have characteristics that cannot be

    handled by the IRF. In this situation, it is helpful to have access to a flexible IRF

    that yields an item response curve (IRC) that will display differences in shape

     between items.

     Journal of Educational and Behavioral Statistics

    2015, Vol. 40, No. 1, pp. 5–34

     DOI: 10.3102/1076998614556816 

    # 2014 AERA. http://jebs.aera.net 

    5

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    2/30

    A number of articles on flexible IRFs have appeared. For example, Drasgow,

    Levine, Williams, McLaughlin, and Candell (1989) describe and illustrate the

    use of multilinear formula score theory for nonparametric IRFs; Ramsay and 

    Winsberg (1991) use monotonic spline basis functions and calculate maximum

    marginal likelihood (MML) estimates for the item parameters; Meijer and 

    Baneke (2004) discuss the use of nonparametric methods when analyzing psy-

    chopathology and personality items. Other approaches for improving goodness

    of fit of item response models are also possible. For example, Woods and Thissen

    (2006) employ the two-parameter logistic (2PL) function for the IRF in all items

     but replaced the standard normal ability distribution by a spline-based density. A

    Bayesian approach to nonparametric item response modeling has been recently

    developed by Duncan and MacEachern (2008, 2013). Although very good esti-

    mates of nonstandard IRCs are produced, a considerable amount of computation

    is required.

    In his seminal article on nonparametric item response theory, Ramsay (1991)

    suggested use of a nonparametric regression of item scores on normalized ability

    surrogates using kernel smoothing as IRC. This approach is simple and robust

    and allows the shape of the IRC to vary freely from one item to another. It has

    made a substantial impact. Kernel smoothing does not constrain the IRF to be

    monotonic, so that it can provide option response curves for incorrect options.

    When the correct option of ability items is being analyzed, however, it is desir-

    able to be able to constrain the IRCs to be monotonic. Lee (2007) investigated 

    this matter and used isotonic regression in conjunction with Ramsay’s approach

    to obtain monotonic IRFs. A disadvantage of these nonparametric approaches is

    that the IRF is not readily portable to scores of examinees that are not in the cali-

     bration sample so that scoring test results for future examinees is difficult.

    This article presents an IRF that is ‘‘quasi-parametric’’ (Ramsay, 1991, p. 613)

    in the sense that it employs parameters that are intended solely for the provision of 

    a graphical representation of the IRC and not for interpretation in terms of some

    underlying psychological process. Monotonicity can be guaranteed with this

    approach that permits flexibility of the IRC and facilitates the use of the existing

     parametric Bayesian Expected A Posteriori (EAP) estimates for the stochastic abil-

    ity parameter. Our proposed filtered monotonic polynomial (FMP) IRF is the com-

     position of a logistic function and a monotonic polynomial. Results concerning

    cumulative distribution functions (cdfs) given by Elphinstone (1983) show that the

    FMP IRF may be used to approximate any IRF with a continuous derivative arbi-

    trarily closely by increasing the number of parameters in the monotonic polyno-

    mial. Thus, this FMP IRF not only is flexible but is formulated as an algebraic

    expression that is easily portable to future examinees not in the calibration sample.

    When the FMP function is required to approximate some ‘‘population’’ IRF,

    with few parameters it can happen that the approximating FMP IRF will require

    more parameters than the (usually unknown) population IRF. When samples are

    very large, and sampling error can be disregarded, the necessity for many FMP

     A Quasi-Parametric Method for Fitting 

    6

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    3/30

     parameters would not matter. In smaller samples where sampling error needs to

     be considered, an attempt to use many FMP parameters could result in appreci-

    able capitalization on chance. In general, it is preferable to retain fewer para-

    meters in smaller samples than in large samples (see Browne, 2000; Cudeck &

    Henly, 1991). We shall implement this principle by using the Akaike information

    criteria (AIC; Akaike, 1973) or the Bayesian information criteria (BIC; Schwarz,

    1978) as guides when choosing the number of FMP parameters for an item. These

    are goodness of approximation measures that make no assumption of an exactly

    correct model in the population, take sample size into account, limit the number 

    of parameters when the sample size is small, and allow more parameters as the

    sample size increases.

    The FMP approach involves no assumption that the number of item para-

    meters is the same for all items, so that the shape of the IRC can vary from one

    item to another, as is the case with the nonparametric regression approaches.

    Because the usual two PL (2PL) IRF is a special case of the FMP family of IRFs,

    it can be fitted at the same time as more flexible IRFs for comparative purposes.

    Furthermore, the FMP requirement of monotonicity for the IRF may be discarded 

    to result in a filtered unconstrained polynomial (FUP) procedure that can assist

    the diagnosis of nonmonotonic items. Although this article concentrates on

    extensions to the computationally convenient 2PL IRF, basic theory is presented 

    in a manner that can be extended to other IRF families.

    Unlike the 2PL, the one PL (1PL) IRF, or Rasch model, constrains an item para-

    meter (discrimination) to equality across items (cf. Thissen & Orlando, 2001, p. 76,

    equation 3). Consequently, the 1PL does not fit into the FMP computational frame-

    work that estimates item parameters successively, 1 item at a time, rather than con-

    currently for all items. Furthermore, the fundamental philosophy of the FMP

    approach is to seek a model that fits given data as well as possible and contradicts

    that of the Rasch model which requires that data should fit a given model to satisfy

    mandatory measurement requirements (cf. Thissen & Orlando, 2001, pp. 90–91).

    The following section gives a brief review of the parametric and nonpara-

    metric approaches for estimating the IRF, including the joint maximum like-

    lihood (JML) and MML parametric estimation methods. Thereafter, we

    introduce the filtered polynomial IRF estimation method and consider the

    choice of the number of parameters using the AIC information theoretical

    approach. Subsequently, we present results from simulation studies and an

    example with actual data. A summary of findings and conclusions of the

    research is provided in the closing part of the article. Details concerning

     parameter estimation are given in Online Appendices A and B.

    2. Item Response Theory

    Consider a N  n data matrix, Y , with typical element y si which represents theresponse of examinee s, to item i with y si ¼ 1 if the response is correct and  y si ¼ 0

     Liang and Browne

    7

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    4/30

    otherwise. The responses of all examinees to item  i  are contained in the  N  1vector, y#i, formed from column i of  Y . A row s of  Y  provides the response patternfor examinee s  and will be denoted by the 1 n vector, y0 s!  with ith element y si.Thus, a column of  Y  represents scores of  N  examinees on an item and a row rep-resents scores of an examinee on  n  items.

    We assume that there is a single latent trait,  y, that influences an examinee’s

    response to each item. The IRF for item  i

     PiðyÞ ¼ Probð yi ¼ 1jyÞ ð1Þgives the probability that an examinee with ability   y   will give the ‘‘correct’’

    answer to item   i. Because   Pi(y) represents a probability, it must be bounded 

     below by 0 and above by 1. With ability or achievement tests, also, it makes sense

    to assume that the probability of passing an item increases as  y  increases, so thatthe IRF will be monotonically increasing, bounded below by 0 and above by 1.

    Any IRF Pi(y) that decreases as y  increases would be symptomatic of an unusual

    item.

    The vectors,   y s!, are regarded as independent realizations of a randomvector  y.  For each examinee  s, there corresponds a realization  y s  of the latent

    trait,  y, that represents examinee ability. We assume local independence, that

    is, that conditionally on   y ¼   y s   the elements of   y  are independently distrib-uted. Consequently, the probability of a specific response pattern   y s!   given

    y s   is

    Probð y s!jy ¼ y sÞ ¼Yni¼1

     P y si si   1  P sið Þ 1 y sið Þ;   ð2Þ

    where

     P si ¼  Piðy sÞ:   ð3Þ

    2.1. Parametric IRFs

    In addition to the abilities,   y s, parametric IRFs involve additional item-specific parameters. As examples, we shall consider two well-known IRFs each

    with two parameters, an item discrimination parameter  ai and an item difficulty

     parameter  bi. The normal ogive IRF for item   i   is given (e.g., Lord & Novick,

    1968, p. 366) by

     Pi   yð Þ ¼Z   miðyÞ

    11 ffiffiffiffiffiffi2

    p    expð z 2Þdz    ð4Þ

    and the two PL IRF (Birnbaum, 1968) by

     Pi   yð Þ ¼   11 þ expfmi   yð Þg

      ð5Þ

     A Quasi-Parametric Method for Fitting 

    8

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    5/30

    where mi(y) is a linear function of  y:

    mi   yð Þ ¼ ai   y bið Þ:   ð6ÞBoth IRFs are monotonic increasing if  ai  > 0 and are bounded by 0 and 1.

    They are similar but differ in the scale of  ai. This difference between the two

    IRFs can be reduced substantially by replacing the function  mi(y) in Equation

    5 by the rescaled function 1:702mi   yð Þ:1Early in the development of item response theory, the normal ogive IRF

    was used predominantly but was replaced subsequently by the 2PL which is com-

     putationally more convenient. Two methods are best known for obtaining para-

    meter estimate using the 2PL. The first was originally suggested by Birnbaum

    (1968). The unobservable ability variables,  y s, are regarded as parameters to be

    estimated rather than as realizations of a latent variable with a prespecified normal

    distribution. A likelihood function is maximized jointly with respect to the item

     parameters   ai; bi; i ¼ 1; . . . ; n   and the ability parameters   y s; s ¼ 1; . . . ; N   usingan alternating iterative algorithm. This method of estimation is referred to as JML

    estimation. Consistency of the estimates has never been proved (e.g., Baker, 1992,

     pp. 104–105) and ‘‘tuning’’ of the algorithm is necessary (Baker, 1992, p. 112).

    The second method of estimation, introduced by Bock and Lieberman (1970)

    for the 2PL, treated ability as a latent trait with a specified normal distribution

    and maximized the marginal likelihood for item parameters alone integrating out

    the ability variable,   y. A Newton–Raphson algorithm was proposed and gave

    acceptable results for a small number of items but was not practical for many

    items. Significant improvements were provided by Bock and Aitkin (1981) who

    approximated the density of  y  by a step function that facilitated the use of an

    expectation-maximization algorithm (EM algorithm). This method of estimation

    is known as MML and is now frequently employed.

    The MML estimation method has advantages over JML in that it can obtain

    estimates of the item parameters without estimating ability parameters,

    y s; s ¼ 1; . . . ; N   at the same time. Estimates for the latent traits,   y s, may beobtained subsequently using a Bayesian method.

    2.2. Ramsay’s Nonparametric IRF 

    Ramsay (1991, 2000) introduced a nonparametric approach to estimating an

    IRC using kernel smoothing. This approach requires a surrogate ability value,~y s, for each examinee. All examinees are ranked according to total test score and ~yr  is defined to be the estimated quantile of the standard normal distribution cor-

    responding to rank  r . A smoothed estimate of the IRF is given by:

     b PiðyÞ ¼ P N r ¼1 K    ~yr yh  yriP N 

    r ¼1 K   ~yr y

    h

      ;   ð7Þ

     Liang and Browne

    9

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    6/30

    where yri is the score on item i  of the examinee with rank  r . The symmetric non-

    negative weighting function

     K z 

    ð Þ ¼ ð2

    Þ1=2

    exp

    ð z 2=2

    Þwhere z 

    ¼  ~yr 

    y =h   ð8Þ

    is known as the Gaussian kernel. It will have a maximum when   z  ¼   0 and decrease toward zero as j z j increases. An increase in the bandwidth h  will resultin slower changes of the function b PiðyÞ  but also increase bias of the function.Rapid changes or wiggles due to sampling fluctuation decrease as  N  increases,

    so that bias can be reduced by reducing h as N  increases. In the computer program

    TestGraf (Ramsay, 2000), the default of  h ¼ 1:1 N 0:2 is employed, so that h  is afunction of  N  alone and decreases as  N  increases.

    It is possible to use the nonparametric IRF defined by Equation 7 to compute

    maximum likelihood estimates of the ability parameters  y s; s ¼ 1; . . . ; N ;   feed them back into the process for reranking the examinees, obtaining new surrogate

    variables  ~y s; s ¼ 1; . . . ; N , and carrying out an iterative procedure. In the Test-Graf (Ramsay, 2000) program, this can be done, but a manual intervention at

    each iteration is required. This precludes use of the iterative procedure in random

    sampling experiments. If no iterations are carried out, the original surrogate vari-

    able values,  ~y s, are output as estimates of the ability variables.

    3. Quasi-parametric IRFs

    The extension of the IRF for the 2PL in Equation 5 to yield IRFs that are

    simultaneously both flexible and parametric will now be considered.

    Elphinstone (1983, 1985) proposed a monotonic polynomial–based approach

    for estimating an unknown univariate distribution function. Sinnott (1997) sub-

    sequently named it the ‘‘filtered polynomial’’ distribution estimation method and 

    extended it to a multivariate setting. Here, the general methodology provided by

    Elphinstone (1983) will be adapted to estimate an IRF of unknown functional

    form. The likelihood function appropriate here for estimating an unknown IRFis different to that used by Elphinstone (1985) for estimating a distribution func-

    tion of unknown functional form.

    3.1. Filtered Polynomials

    The IRF   Pi(y) yields the probability that an examinee with ability   y   will

    answer a specified item, i, correctly. Unless otherwise stated, each IRF to be con-

    sidered here is assumed (i) to be monotonic increasing, (ii) to be bounded by 0

    and 1, and (iii) to have a continuous first derivative with respect to  y  implyingthat the IRF is also continuous. Suppose that the functional form of some ‘‘true’’

    IRF   ~ Pi   yð Þ   is not known, but a known scalar valued function,  H (m), of a scalar 

     A Quasi-Parametric Method for Fitting 

    10

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    7/30

    valued argument,  m, satisfies the three requirements of an IRF specified previ-

    ously: for example, either the logistic function

     H m

    ð Þ ¼  1

    1 þ expðmÞ;

      ð9

    Þor normal ogive

     H mð Þ ¼Z   m

    11 ffiffiffiffiffiffi2

    p    expð z 2=2Þdz ;   ð10Þ

    would be suitable.

    It is known (e.g., Elphinstone, 1983, p. 167) that there exists at least one con-

    tinuous monotonic function   ~mi   yð Þ such that~ Pi   yð Þ ¼  H    ~mi   yð Þð Þ:   ð11Þ

    This monotonic function,   ~mi   yð Þ is, in general, not of a known functional form.It may, however, be approximated arbitrarily closely by a polynomial  mi   yð Þ  of odd degree, 2k i þ 1, where  k i  0, if  k i  is made sufficiently large (Elphinstone,1983, section 4). Thus,

    mi   yð Þ ¼ b0i þ b1iy þ b2iy2 þ þ b2k þ1;iy2k iþ1   ~mi   yð Þ;   ð12Þ

    with 2k i

    þ2 parameters represented by the vector  b

    0i

     ¼  b0i; b1i; . . . ; b2k 

    þ1;i

    . A

    reparameterization of  bi  that is used to ensure that the polynomial in Equation12 is monotonic will be described in subsection 3.2.

    Any population IRF   ~ Pi   yð Þ  in Equation 11 that is of an unknown functionalform may be approximated arbitrarily closely by the IRF of known functional

    form

     Pi   yð Þ ¼  H mi   yð Þð Þ;   ð13Þif  k i is sufficiently large. That is, the IRF is the composition of the filter with the

    monotonic polynomial,   P

    ¼ H 

     m. Thus, the ‘‘filter’’ H 

    ðÞ  transforms the

    unbounded monotonic polynomial, mi   yð Þ, in Equation 12 into a monotonic IRC, Pi   yð Þ; that is bounded by 0 and 1. (This terminology is motivated by an analogoussituation in signal processing in which a potentially unbounded signal is trans-

    formed into a bounded signal through a device known as a ‘‘filter.’’) The IRF

    defined in Equation 13 will be consequently referred to as an FMP model. If 

    no constraints are imposed on the coefficients in Equation 12 to ensure that mi(y)

    is monotonic, the filtered function in Equation 13 will still be bounded by 0 and 1

     but need not be monotonic. The resulting model will then be referred to as a fil-

    tered unconstrained polynomial (FUP) model.

    The logistic function in Equation 9 will be used henceforth as a filter because

    it is algebraically convenient to do so. Use of the normal ogive as a filter would 

    give essentially the same results but leads to algebraic expressions that are more

     Liang and Browne

    11

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    8/30

    complicated and less easily evaluated. Substitution of the polynomial in Equation

    12 into the logistic filter in Equation 9 yields the IRF:

     Pi   y

    ð Þ ¼ P   y

    jbi

    ð Þ ¼  1

    1 þ exp    b0i þ b1iy þ b2iy2 þ þ b2k iþ1;iy2k iþ1 ;   ð14Þwhich applies to both the FUP and the FMP models. The difference is that the

    coefficient vector,   bi, is unconstrained for the FUP model and constraints are

    applied to bi  to ensure monotonicity of the IRC for the FMP model. These con-

    straints are applied by means of reparameterizations that will be described in

    Subsection 3.2.

    Because   k i   can vary from one item to another, the shapes of the IRC for 

    different items may be different. When   k i¼   0, the IRF in Equation 14 isequivalent to the 2PL IRF of Equations (5) and (6) with   b0i ¼ aibi   and b1i ¼ ai. The filter,   H ðÞ, may be any monotonic function that is bounded 

     by zero and one and has a continuous first derivative. It is also desirable that

     H ðÞ   should have the same domain as the domain hypothesized for theunknown   ~ Pi   yð Þ, so that the approximating IRF   Pi(y) and the approximated IRF   ~ Pi   yð Þ   have domains that match. The filter has the same mathematical

     properties as a statistical cdf, so that alternative cdfs to those in Equations

    9 and 10 could be tried as a filter.

    Consequently in situations where it is plausible to restrict y to the nonnegative

    real line, the gamma ogive could be tried as a filter. If  y   is assumed to be con-tained in a closed interval, the beta ogive could be employed. It is worth bearing

    in mind that filters that are close in shape to that of the unknown IRF   ~ Pi   yð Þ willneed a lower degree for the monotonic polynomial than those that are more dis-

    similar. The choice of filter is not always critical, however, because one can com-

     pensate for an inappropriate filter to some extent by increasing the degree of the

    monotonic polynomial. There are, however, practical limits to the degree of the

    monotonic polynomial because computational instabilities are associated with

     polynomial models of high degree.

    3.2. Monotonicity Constraints

    A necessary condition for the polynomial,  mi(y), given in Equation 12 to be

    monotonic is that it be of odd degree, 2k i þ 1. Here, we shall employ a parame-terization of an odd-degree polynomial that ensures that it is monotonic. The key

    ideas were contained in a single formula that was presented by Ramsay (1977,

     p. 108) in the context of monotonic transformations to additivity. These were

    developed in detail by Elphinstone (1983, section 4) in the context of distribution

    estimation.

    A necessary and sufficient condition for  mi(y) to be monotonic is that its first

    derivative be a nonnegative polynomial

     A Quasi-Parametric Method for Fitting 

    12

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    9/30

     pi   yð Þ ¼   d d y

    mi   yð Þ ¼ a0i þ a1iy þ þ a2k i;iy2k i 0 for all y   ð15Þ

    and must consequently be of even degree, 2k i. This polynomial, pi(y), has 2k i

    þ1

    coefficients that will be represented by the vector  a0i ¼   a0i; a1i; . . . ; a2k i;i .Given the nonnegative polynomial   pi(y) in Equation 15, the corresponding

    monotonic polynomial   mi(y) in Equation 12 is obtained from the indefinite

    integral

    mi   yð Þ ¼  i þZ 

      pi   yð Þd y;   ð16Þ

    where  i  is the constant of integration. Consequently, the relationships betweenthe coefficients of  mi(y) and those of  pi(y) are given by:

    b0i ¼  i   and  b j ;i ¼a j 1;i

     j for   j ¼ 1; 2; . . . ; 2k i þ 1:   ð17Þ

    The polynomial   pi(y) in Equation 15 needs to be evaluated subject to the

    requirement that  pi   yð Þ 0 for all admissible  y. This may be accomplished byusing the following reparameterization of  pi(y) (Elphinstone, 1983, p. 173):

     pþi   y

    ð Þ ¼  i Q

    k i

     j ¼11

    2a j ;iy

    þ ða2 j ;i

    þb j ;i

    Þy2

    h i;   k i  >  0

    i;   k i ¼ 0:

    8>>>: ð

    18

    Þ

    The 2k i þ 1 coefficients   i; a1;i;b1;i; . . . ;ak i ; bk i

      of   pþi   yð Þ   are required tosatisfy the  k i þ 1 inequality constraints

    i  0;   and  b j   0; j ¼ 1; . . . ; k i:   ð19ÞThen, given the parameter vector 

     γ 0i

     ¼ ð i; i;a1;i;b1;i; . . . ;ak i ;bk i

    Þ ð20

    Þfor  pþi   yð Þ;  the procedure described in Online Appendix A may be used to com-

     pute the corresponding parameter vector 

    a0i ¼   a0i; a1i; . . . ; a2k i;i

      ð21Þfor  pi(y) that will ensure that  pi   yð Þ ¼  pþi   yð Þ >  0. This procedure makes use of arecurrence relation. When   ai   has been obtained, Equation 17 may be used to

    obtain the parameter vector 

    b

    0

    i ¼   b0i; b1i; . . . ; b2k iþ1;i   ð22Þthat ensures that  mi(y) in Equation 12 will be monotonic increasing in  y. Thus,

    the complicated inequality constraints on  bi   that are required for monotonicity

     Liang and Browne

    13

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    10/30

    of the polynomial in Equation 12 are imposed by means of a double reparame-

    terization: The parameter vector  bi is a function (17) of  ai, which in turn is a func-

    tion of the parameter vector   γ i   that satisfies the simple linear inequality

    constraints in Equation 19.

    As an alternative to the reparameterization approach employed here, an

    approach due to Hawkins (1994) for dealing with a monotonic polynomial by

    applying equality constraints at judiciously chosen values of  y  would be worth

    investigation.

    3.3. Parameter Estimation

    A two-stage estimation method based on Ramsay’s (1991) procedure will be

    employed to estimate the item parameters and the abilities. Stage 1 is to obtain sur-

    rogate values,  ~y s; s ¼ 1; . . . ; N , for the examinees’ abilities,  y s. In Ramsay’s pro-cedure, these surrogates are the quantiles of a standard normal distribution based 

    on ranked total test scores. A problem with ranking test scores is that ties can occur 

    very frequently especially for a short test with many examinees. In Ramsay’s Test-

    Graf, ranks are randomly assigned to the tied test scores. To avoid this need for 

    random rank assignment, first principal component scores are used here to assign

    ranks. Component scores are consequently obtained from the left singular vector 

    corresponding to the largest singular value of the centered data matrix   Y 1y0Þð   .If the sum of elements of the corresponding right singular vector is negative, both

    the left and right singular vector are reflected. In addition to eliminating the occur-

    rence of tied ranks, the first principal component scores optimally summarizes the

    data matrix Y  in one dimension. The principal component score ranks are trans-

    formed to the quantiles, qi, of a standard normal distribution to yield the N  1 vec-tor, ~; of ability surrogates. This normalization of surrogate ability scores providesan identification constraint (Ramsay, 1991, p. 614) required for the model.

    In the second stage, after the vector,   ~;   of normalized ability surrogates is

    available, the conditional maximum likelihood estimates,

     b γ i; i ¼ 1; . . . ; n, of the

    item parameter vectors, given ~

    , are obtained. Because of the assumption of localindependence, these estimates may be obtained 1 item at a time by minimizing

    the scaled negative log-likelihood objective function:

     F i ¼  N 1ln L   γ ijy#i; ~ ¼  N 1 X N 

     s¼1 y si ln ð P siÞ þ   1  y sið Þ ln 1  P sið Þf g;   ð23Þ

    where P si ¼  Pi   ~ s

    : (The scaling by N 1 in Equation 23 is convenient because itavoids dependence of the magnitude of  F i on sample size.) Computational details

    are given in Online Appendix B.

    This procedure may be viewed as a modified version of the JML estimation

    method that is truncated after the first iteration. In the initial stages of our 

    research, a full JML iterative process for jointly estimating the FMP item

     A Quasi-Parametric Method for Fitting 

    14

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    11/30

     parameters, γ , and abilities, θ, by maximum likelihood was tried out on data that

    were randomly generated according to an FMP model. After obtaining the con-

    ditional maximum likelihood estimates

     b γ  ¼

      b γ 01; . . . ; b

     γ 0n

    0 by minimizing Equa-

    tion 23, each examinee’s ability parameter  y s s ¼ 1; . . . ; N ; was estimated, one ata time, by minimizing the scaled conditional negative log-likelihood objective

    function

     N 1ln L   y sjy s!; b γ ð Þ ¼ Xni¼1

     y si lnð P siÞ þ   1  y sið Þ ln 1  P sið Þf g; s ¼ 1; . . . ; N ;   ð24Þ

    with respect to  y s. The estimates obtained were then ranked and normalized to

    replace the surrogates and iterative cycles were continued until convergence.

    This procedure was not found satisfactory in the present context and was con-

    sequently discarded. During iteration, item parameter estimates often drifted away

    from the known values. This tendency increased as  k  was increased. Concurrently,

    there was a tendency for the ability estimates, by s, to drift away from the randomlygenerated, and therefore known, y s, as the cycling procedure continued whether or 

    not convergence occurred. Thus, the iterated JML estimate of  g  was less satisfac-

    tory than the currently used surrogate-based estimate. Further evidence that this

    type of iterative algorithm is unsatisfactory will be found in Subsection 4.2.

    Rather than regarding the abilities,  y s, as parameters estimated by minimizing

    Equation 24, they are therefore regarded here as realizations of a random variableand the EAP approach (cf. Bock & Moustaki, 2007, Subsection 5.3) is used to

    obtain Bayesian estimates. To be consistent with the normalization of surrogates

    in the first stage, the standard normal density  j(y) is employed for  y, so that the

    expected value of the a posteriori distribution of abilities is given by:

     E ðyjys; γ Þ ¼R 11

    Qni¼1 PiðyÞ y si 1  PiðyÞf g1 y si yj yð Þd yR 1

    1Qn

    i¼1 PiðyÞ y si 1  PiðyÞf g1 y si jðyÞd yð25Þ

    This expected value is estimated by replacing item parameters by estimates in

    Equation 25 and approximating the two integrals involved using rectangular quadrature to obtain

     by s ¼  "̂   yjy s!; ^ γ ð Þ ¼PQ

    r ¼1Qn

    i¼1½ Pið€yr Þ y si 1  Pið€yr Þn o1 y si

    jð€yr Þ€yr PQr ¼1

    Qni¼1½ Pið€yr Þ y si 1  Pið€yr Þ

    n o1 y sijð€yr Þ

    ; s ¼ 1; . . . ; N ;   ð26Þ

    where  €yr ; r ¼ 1; . . . ; Q are equally spaced points on the closed interval [4, 4].

    3.4. Choice of the Number of Parameters

    In Subsection 3.1, a hypothetical ‘‘true’’ IRF,   ~ Pi   yð Þ;   for item   i  is specified.Because its functional form is unknown, it cannot be estimated directly but can

     Liang and Browne

    15

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    12/30

     be approximated arbitrarily closely by an IRF, Pi(y), of known functional form

    (Equation 13) by using a sufficient number of parameters. In this situation, it

    is not possible to provide a goodness-of-fit test with a null hypothesis involving

    an algebraic specification for a ‘‘true’’ model,  ~

     Pi   yð Þ. The AIC (Akaike, 1973;Burnham & Anderson, 2004, pp. 266–268) is helpful under these circumstances,however. It may be regarded as an estimate of an expected cross-validation cri-

    terion using the Kullback–Leibler measure of the distance between two distribu-

    tions (De Leeuw, 1992) and is based on information theory (Burnham &

    Anderson, 2004, section 2, pp. 264–266) rather than on classical statistical infer-

    ence. The AIC is evaluated for each set of candidate models for Item  i where each

    model is obtained by varying the number of parameters in the approximating

    IRF,  PiðyÞ. The candidate model yielding the smallest AIC tentatively suggeststhe number of parameters to be employed. No statistical test, null hypothesis, or 

    significance level is involved.

    The computing procedure described in Online Appendix B produces a

    sequence of nested FMP models with k i ¼ 0; 1; 2; . . . ; k max because the final iter-ated parameter values for one model are employed in the definition of good start-

    ing values for the next model. This sequence of models also provides a

    convenient candidate set for the AIC. Because the unknown ‘‘true’’ model (11)

    cannot be included in this candidate set, the aim of the analysis can only be

    ‘‘model approximation’’ and not ‘‘model verification.’’

    The AIC is defined as AIC¼

    2 ln Lþ

    2q (e.g., Burnham & Anderson, 2004,

     p. 268) where L  is the likelihood function and  q  represents the number of estim-

    able parameters for a model in the candidate set. Only the rank order of models

    according to the AIC is employed in the selection process. Consequently, all val-

    ues of the AIC in the candidate set may be multiplied by the same positive con-

    stant without affecting any conclusions. The AIC increases without bound as  N 

    increases, but this problem may be corrected by multiplying the AIC for each

    candidate model by  N 1. The scaled AIC for item  i then is

    AICi ¼

    2 N 1 ln L b γ ijy#i;  ~ þ

    2qi

     N  ¼ 2  L

    þ 2

     N qi;

      ð27

    Þwhere    L ¼  N 1 P N 

     s¼1 L si;

     L si ¼  L b γ ij y si;   ~y s ¼   y si   ln  P si þ   1  y sið Þ ln 1  P sið Þf g > 0  s ¼ 1; . . . ; N ;   ð28Þ

    is the contribution of examinee  s to the log likelihood,  P si ¼  Pið~y sÞ and 

    qi ¼ 2k i þ 2;   ð29Þis the number of parameters. Because    L is the mean of the identically distributed 

     L si, s ¼ 1; . . . ; N , its expected value  E ð  LÞ remains constant as  N  increases.

     A Quasi-Parametric Method for Fitting 

    16

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    13/30

    The negative log likelihood,  N 1 ln L b γ ijy#i; ~ , is minimized with respectto the parameter vector  γ i   so that it decreases as the number of parameters,  qi,

    increases. Given N , therefore, the first term in Equation 27 decreases and the sec-

    ond term, (2/ N )qi, increases as  qi increases and, as a result, acts as a penalty onAICi. If sample size,  N , is very large, however, the effect of an increase of  qi on

    the penalty (2/ N )qi  will be negligibly small and the candidate model with the

    largest number of parameters will yield the smallest AICi. Thus, the AIC tends

    to favor IRFs with few parameters when samples are small, thereby avoiding

    overfitting, and to favor IRFs with many parameters in large samples when over-

    fitting is not an issue. Use of the AIC is not intended to provide an estimate of 

    some correct number of parameters in a population but rather to lead to a model,

     possibly with few parameters, that will predict optimally outside the calibration

    sample. Examples of the effect of sample size on the AIC in the analysis of cov-ariance structures are given in Browne (2000, Subsection 4.8).

    The theoretical justification for the penalty term, (2/ N )qi, of AIC involves an

    assumption that the likelihood function is correctly specified, so that the item

     parameter estimates in b γ  are maximum likelihood. This is not the case in the pres-ent situation because item parameter estimates are obtained by regarding the

    latent ability variables  y s   as observed quantities, whereas in practice, they are

    unobservable and replaced by surrogates   ~y s. Consequently, the item parameter 

    estimates may only be regarded as some sort of pseudo-maximum likelihood. For 

    this reason, it is best to regard AICi  in Equation 27 as a pseudo-AIC, having the

    same formula as a legitimate AIC but being applied under other assumptions.

    This pseudo-AIC still has the property of penalizing models with many para-

    meters when the sample size is small, but the value of the penalty may not be

    optimal.

    The BIC proposed by Schwarz (1978) is similar to the AIC but has a different

     penalty term. Burnham and Anderson (2004) compare the AIC and BIC and point

    out that the BIC is not related to information theory. After scaling by  N 1, the

    BIC becomes

    BICi ¼ 2 L þ ln N  N 

    qi:   ð30Þ

    Again the BIC penalty term ðln N = N Þqi   tends toward zero as   N   increases.Also, as used here, the BIC is in effect a pseudo-BIC. Characteristics of the AIC

    and BIC that are shared by the pseudo-AIC and pseudo-BIC are as follows.

    Both the AIC and BIC favor a small number of parameters in ‘‘small’’ samples

     but can favor many parameters in ‘‘large’’ samples. If  N  8, the number of para-meters indicated by the BIC will not be greater than that indicated by the AIC.

     Neither the AIC nor the BIC is intended to suggest a ‘‘correct’’ model. Rather,

    they give an indication of the number of parameters to use in order to give a good 

    approximation to an unspecified ‘‘true’’ IRF taking sample size into account. In

     Liang and Browne

    17

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    14/30

    small samples where coefficient estimates are contaminated by error, fewer poly-

    nomial coefficients should be used than in large samples where estimates will be

    more accurate (cf. Browne, 2000).

    The choice of the number of parameters  q ¼

     2k  þ

     2 using the AIC or BIC

     plays a similar role when using the FMP or FUP models as the choice of the

     bandwidth h in Ramsay’s nonparametric IRF. However,  q and  h operate in oppo-

    site directions. Flexibility of the FMP IRC increases as the positive integer   q

    increases, whereas flexibility of the nonparametric IRC increases as the positive

    real number  h decreases toward zero. The use of  h ¼ 1:1 N 0:2 for nonparametricIRC smoothing and  k  for flexibility of the FMP IRC have similar aims. Both are

    intended to guard against using overflexible IRCs if sample sizes are small so that

    random sampling fluctuations are large and overfitting can occur.

    4. Numerical Studies

    Two simulation studies are reported to illustrate properties of the FMP

    approach in comparison with other approaches. These are followed by a numer-

    ical illustration of the effect of alternative identification conditions for the FMP

    model.

    4.1. Simulation Study Design

    This section deals with notation and with common aspects of the simulations.FMP_ k  will represent an FMP model with index  k  yielding a monotonic poly-

    nomial of degree 2k þ 1 in Equation 12 and an IRF with  q ¼ 2k þ 2 parametersin Equation 14. In particular, the IRF for FMP_ 0 is a reparameterization of the

    IRF for the 2PL, so that the two models are equivalent, even if estimation meth-

    ods differ.

    In both of the simulation studies, 100 random samples were generated. Each

    sample consisted of 2,000 examinees’ responses on 20 items. Sets of 20 items

    had IRF of the same algebraic form. Population parameter values were chosen

     by generating them from specified distributions. Appropriate details will begiven in subsections 4.2 and 4.3.

    Ability variables, y s, generated for Subsections 4.2 and 4.3, were independently

    distributed according to the normal distribution with mean 0 and variance 1. Given

    an IRF, P(y), and an examinee ability value, y s, the examinee’s response,  y s, was

    computed by drawing a random number, u s, from a uniform distribution U [0,1] and 

    defining the response by  y s ¼ 1 when  u s  <  P   y sð Þ and  y s ¼ 0 otherwise.Performance of the models was evaluated from the following two perspec-

    tives: (i) for each item i, the closeness of the estimated IRC, b Pi   yð Þ, to the chosen

     population IRC, Pi   yð Þ, and (ii) closeness of the N  estimated abilities, by s estimated from Equation 26 to the actual randomly generated abilities,  y s, employed to

     produce the data. The root integrated mean square error (RIMSE; Ramsay,

     A Quasi-Parametric Method for Fitting 

    18

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    15/30

    1991, p. 621) was used as a measure of the closeness of the estimated IRC, b Pi   yð Þ, to the IRC,  Pi   yð Þ, used for data generation of item   i; i ¼ 1; . . . ; n. Thisis defined as

    RIMSEðiÞIRC ¼

    P Rr ¼1 ^ Pi   €yr 

     Pi   €yr  2jð€yr ÞP Rr ¼1 jð€yr Þ

    264375

    12

    ;   ð31Þ

    where the   €yr ,   r ¼ 1; 2; . . . ; R, are evaluation points that are equally spaced on½3:5; 3:5, and  j ð Þ represents the density function of a standard normal distri-

     bution. Closeness of the estimates, by s, to the actually generated abilities, y s, wasevaluated using the root mean square error 

    RMSEy ¼P N 

     s¼1 by s y s 2 N 

    264375

    12

    ;   ð32Þ

    where N  is the number of examinees.

    Because the rank order of ability estimates are often regarded as more impor-

    tant than their actual values, the Spearman rank correlation coefficient,   rŷ;y,

     between the estimates,

     by s, and the actually generated ability variables,  y s, was

    also used as a measure of equivalence of the estimated and actual ability

    variables.

    4.2. Simulation Study 1

    Comparison of FMP_0 With MML and JML for the 2PL Model 

    The 2PL and FMP_0 IRCs are equivalent. This section compares our FMP_0

    estimates of this IRC with two alternative estimates: the MML estimates

    obtained using MULTILOG (Thissen, Chen, & Bock, 2003) and the JML esti-

    mates obtained using the TESTAT module of the SYSTAT software package

    (Version 10.2). MML estimates were chosen as a gold standard for comparison

    with FMP_0 estimates because they appear to be the most widely employed. As

     pointed out in Subsection 3.3, the FMP_0 estimation procedure may be regarded 

    as a JML estimation procedure truncated after the first iteration. Although JML

    estimates are often regarded less favorably than MML estimates, JML estimates

     provided by an independently written commercial program were also included to

    demonstrate that the FMP_0 estimates, although related, do not have the same

    suboptimal performance as the JML estimates.

    Data for this simulation study were generated using the parameterization of 

    the 2PL IRF defined by Equations 5 and 6. Population parameter values for each

    of the 20 items were chosen randomly. Discrimination parameters,   a j ,

     j ¼ 1; . . . ; 20 were drawn from a uniform distribution   a U ½1:1; 1:8   and the

     Liang and Browne

    19

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    16/30

    difficulty parameters,  b  j , from a normal distribution,  b Nð0; 1Þ, truncated at2:5 and þ2:5.

    In Figure 1, three different estimates of the same IRC are plotted for 4 selected 

    items in one of the samples. The FMP_0 estimate of the IRC was obtained using

    our FMP computer program, the MML estimate of the IRC with MULTILOG

    Version 7 (Thissen et al., 2003), and the JML estimate of the IRC with the TES-

    TAT module of the SYSTAT (Version 10.2) software package. For each item, the

     population IRC is also shown. In general, the MML estimated curve almost coin-

    cides with the population curve, and the FMP_0 estimated curve is very slightly

    further away. This suggests that the FMP_0 surrogate-based IRF estimates (k  ¼0) are almost as good as the MML estimates. In all four diagrams in Figure 1, the

    JML estimated curve is clearly further away from the population curve than is the

    FMP_0 estimated curve. This indicates superiority of the FMP_0 item parameter 

    estimates over the JML estimates. In view of the fact that the FMP procedure is

    a JML algorithm terminated after the first iteration, it appears that the further itera-

    tion is harmful rather than helpful. This finding is concordant with comments in

    Subsection 3.3 and is not surprising because of difficulties associated with maxi-

    mum likelihood estimation when the number of parameters increases as the num-

     ber of examinees increases (Neyman & Scott, 1948).

    true

    FMP_0

    MML

    JML

    true

    FMP_0

    MML

    JML

    true

    FMP_0

    MML

    JML

    true

    FMP_0

    MML

    JML

    Item 1 Item 2

    Item 3

          0 .      0

          0 .      2

          0 .      4

          0 .      6

          0 .      8

          1 .      0

    -4 -2 0 2 4

          0 .      0

          0 .      2

          0 .      4

          0 .      6

          0 .      8

          1 .      0

    -4 -2 0 2 4

    Item 4

    ability, θ

          P     r     o      b     a      b      i      l      i      t     y

    FIGURE 1. Comparisons of estimated IRCs among FMP, MML, and JML. IRC  ¼ itemresponse curve; FMP  ¼   filtered monotonic polynomial; JML ¼   joint maximumlikelihood.

     A Quasi-Parametric Method for Fitting 

    20

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    17/30

    Plots of pairs of measures of closeness of estimated abilities, by s, s ¼ 1; . . . ; 100, to the actual randomly generated abilities,   y s, are given inFigure 2. In this figure, the left plot compares FMP_0 with MML and JML

    in terms of RMSEy  and the right plot compares the rank correlations. MML

    estimates have very slightly smaller (better) RMSEy’s than the FMP_0 esti-mates obtained using Equation 26. The rank correlations from FMP_0 esti-

    mates and from MML estimates are very close. The FMP_0 ability

    estimates produce smaller (better) RMSEy   values and higher (better) rank 

    correlation values than the JML estimates. This finding is in agreement with

    the discussion in Subsection 3.3.

    Means and standard deviations (in parentheses) of accuracy measures over the

    100 generated samples are shown in Table 1. As an overall measure of accuracy

    of estimated IRCs, the average RIMSEIRC  (see Equation 31)

    RIMSEIRC ¼   120

    X20i¼1

    RIMSEðiÞIRC   ð33Þ

    was used. Mean accuracy measures, RIMSEIRC, are shown in the first row. The

    RIMSEIRC measures for FMP_0 and MML are quite close; the measure for MML

     being smaller (better), as can be expected. The RIMSEIRC  measure for JML is

    clearly inferior to (higher than) those of FMP_0 and MML. This observation is

    concordant with the trends visible in Figure 1.

    The second and third rows give the mean RMSEy  and rank correlation mea-sures of accuracy of the ability variable estimates, by, provided by the three esti-mation procedures. Accuracy as measured mean RMSEy is essentially the same

    0.36 0.38 0.40 0.42

       0 .   3

       6

       0 .   3

       8

       0 .   4

       0

       0 .   4

       2

    FMP_0 (RMSE for abilities)

       M   M   L  o  r   J   M   L

    MMLJML

    0.91 0.92 0.93 0.94

       0 .   9

       1

       0 .   9

       2

       0 .   9

       3

       0 .   9

       4

    FMP_0 (rank correlations for abilities)

       M   M   L  o  r   J   M   L

    MMLJML

    FIGURE 2. Comparisons of RMSE y’s for FMP_0, MML, and JML. RMSE  ¼ root mean square error; FMP ¼  filtered monotonic polynomial; JML ¼  joint maximum likelihood.

     Liang and Browne

    21

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    18/30

    for FMP_0 and MML and is somewhat inferior for JML. The mean rank correla-

    tion is essentially the same for the three methods.The overall impression given by this simulation study is that the FMP_0 esti-

    mates are nearly as accurate as MML estimates and are clearly more accurate

    than JML despite the fact that FMP_0 may be regarded as JML truncated after 

    one iteration (also see subsection 3.3).

    4.3. Simulation Study 2

    This simulation study represents the type of situation for which the FMP

    model is intended (see subsection 3.1). The true IRF,   ~ Pi   yð Þ, is unknown to theuser and is approximated by the FMP in Equation 13.

    In this simulation study, the true IRF was chosen to be the cdf of a mixture of 

    two normal distributions:

    ~ P   yj;m1;s1;m2;s2ð Þ ¼ F yjm1;s1ð Þ þ ð1 ÞF yjm2;s2ð Þ ð34Þwhere   is the selection probability and  F yjm; sð Þ represents the cdf of a normaldistribution with mean  m  and variance  s. Values of these parameters for each

    of the   n

     ¼  20 items were generated randomly using the distributions:

    U ½0:3; 0:7,   m1   N  1:5; 0:1ð Þ;s1  N   1; 0:1ð Þ,   m2  N   1:0; 0:1ð Þ, and  s2  N   0:4; 0:1ð Þ.

    The Ramsay TestGraf model, with the default bandwidth   h ¼ 1:120000:2 ¼ 0:24, and FMP_  k  models with k  ¼ 0; . . . ; 4 were fitted to 100 ran-dom samples with  N  ¼ 2,000 and  n ¼ 20. Thus, the simplest FMP model wasFMP_0 (2PL) with 2 parameters and the most complex FMP_4 with 10 para-

    meters, while the ‘‘true’’ model, treated as unknown in all analyses, had 5

     parameters.

    Figure 3 shows IRCs for 4 of the items estimated from one of the samples. Values

    for  k AIC (k  suggested by AIC) are shown in the lower right-hand corner. All  k AICturned out to be equal to 1, not only in the 4 selected items but also in the remaining

    16 items. The true curve, TestGraf curve, and the FMP curves for k AIC are shown. It

    TABLE 1.

     Means and Standard Deviations of Accuracy Measures.

    FMP_0 MML JML

    RIMSEIRCð 0Þ   0.024 (0.001) 0.014 (0.001) 0.076 (0.001)RMSEyð 0Þ   0.382 (0.011) 0.379 (0.011) 0.403 (0.011)Rank Corr   yð Þð 1Þ   0.928 (0.006) 0.928 (0.006) 0.924 (0.006) Note. FMP ¼ filtered monotonic polynomial; MML ¼ maximum marginal likelihood; JML ¼  jointmaximum likelihood; RIMSE ¼ root integrated mean square error; RMSE ¼ root mean square error.

     A Quasi-Parametric Method for Fitting 

    22

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    19/30

    is difficult to compare the TestGraf and FMP curves because they use different

    quantities (h and k ) to control smoothness. In this study, however, TestGraf tends

    to fit the straight lower part of the true curve better and FMP the sharp curve in the

    upper half.

    Figure 4 compares the RMSEy   fit measures for estimated ability by between TestGraf and FMP_ k AIC. The top figure shows that the RMSEy’sof FMP_ k AIC  are better (closer to zero) than those of TestGraf in this simu-

    lation study. In the bottom figure, the rank correlations for FMP_ k AIC   are

    again better (closer to 1) than those for TestGraf. It should be borne in mind,

    however, that the default choice of normalized total test scores for ability

    estimates was used in TestGraf. The alternative iterative facility requires a

    user intervention at each iteration and therefore is not practical in simulation

    experiments. A possible explanation for the poorer results of TestGraf is that

    sample size is  N 

    ¼2,000 and number of items is  n

     ¼20 which would lead to

    many ties in the total scores used to provide ranks for the normalization pro-

    cess. These ties are resolved in TestGraf by generating random orderings

    within ties (cf. subsection 2.2).

    Item 13

    TrueTestGraf 

    FMP_k1_AIC

    TrueTestGraf 

    FMP_k1_AIC

    TrueTestGraf 

    FMP_k1_AIC

    TrueTestGraf 

    FMP_k1_AIC

    Item 14

    Item 15

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

    -2 -1 0 1 2

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0

     .   8

       1 .   0

    -2 -1 0 1 2

    Item 16

    ability, θ

       P  r  o   b  a   b   i   l   i   t  y

    FIGURE 3. Comparisons of the estimated IRCs (  N ¼ 2,000). IRC ¼ item response curve.

     Liang and Browne

    23

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    20/30

    Table 2 summarizes accuracy measures for TestGraf, FMP_ k AIC, and 

    FMP_ k AIC. Entries are means and standard deviations (in parentheses) calculated 

    over items and random samples (i.e., 20 100 ¼ 2; 000 observations). The firstrow shows that the three estimation methods yielded essentially the same

    RIMSEIRC, so that there was little to choose in overall accuracy of the three

    methods for approximating the chosen true IRCs. When abilities,   y, are esti-

    mated, the situation changes. It can be seen from row 2 that the mean RMSEy

    was essentially the same for FMP_ k AIC and FMP_ k BIC, but these were noticeably better (smaller) than that for TestGraf. Again row 3 shows that the FMP_ k AIC and 

    FMP_ k BIC yielded essentially the same Rank_Corr (y) which was noticeably bet-

    ter (larger) than that for TestGraf.

    To evaluate how the FMP model performs with a smaller sample size, the

    FMP IRF was also fitted to the first 300 of the 2,000 simulated examinees for 

    each of the 100 simulated samples. Table 3 summarizes the same information

    as in Table 2, but with a sample size of  N ¼ 300 instead of  N ¼ 2,000.As can be expected, the IRC fit measures, RIMSEIRC, in Tables 2 and 3 indi-

    cate less accuracy of estimates when sample size drops from 2,000 to 300. On theother hand, interpretation of IRCs is hardly affected by the reduction of sample

    size. Figure 5 shows IRCs based on   N  ¼   300 for the same 4 items plotted in

    0.45 0.50 0.55 0.60 0.65 0.70 0.75

       0 .   4

       5

       0 .   5

       5

       0 .   6

       5

       0 .   7

       5

     AIC selected model (RMSE for abilities)

       T   E   S

       T   G   R   A   F

    0.70 0.75 0.80 0.85

       0 .   7

       0

       0 .   7

       5

       0 .   8   0

       0 .   8

       5

     AIC selected model (rank correlation for abilities)

       T   E   S   T   G   R

       A   F

    FIGURE 4.  Comparisons between TestGraf and FMP_ k  AIC  of accuracy measures of   

     by.

     FMP

    ¼ filtered monotonic polynomial.

     A Quasi-Parametric Method for Fitting 

    24

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    21/30

    Figure 3 for  N ¼ 2,000. Comparison of Figures 5 and 3 suggests that conclusionsdrawn from IRCs based on  N  ¼ 300 do not differ much from those drawn fromthe corresponding IRCs based on  N ¼ 2,000.

    Differences in ability fit measures, RMSEy   and Rank_Corr (y), between

    Tables 2 and 3 seem sufficiently small to be disregarded.

    4.4. A Numerical Experiment to Investigate the Assumption of a Normal 

     Distribution for  y

    As pointed out by Ramsay (1991, p. 614, equation 6), a change of distributionfor  y  does not affect model fit, provided that it is accompanied by an appropriate

    change in the IRF. Thus, the assumption of a normal distribution for  y is an iden-

    tification condition for the data generation process when the IRFs are uncon-

    strained. The distribution of   y   cannot be estimated unless constraints are

    imposed on the functional form of the IRFs (cf. Woods & Thissen, 2006). It is

    not possible to simultaneously estimate the density of  y  and the item IRFs.

    The FMP methodology proposed here for estimating an IRF specifies a

     N ð0; 1

    Þdistribution of  y  for identification purposes and uses normalized surro-

    gate abilities (see subsection 3.3). Furthermore, when the EAP procedure (Equa-

    tion 26) for obtaining ability estimates, by, is employed, a N ð0; 1Þ   is assumed again as the prior distribution for  y.

    TABLE 2.

     Means and Standard Deviations of RMSEs for TestGraf and FMP.

    TestGraf FMP_  k AIC   FMP_ k BIC

    RIMSEIRC  0ð Þ   0.041 (0.003) 0.042 (0.003) 0.042 (0.004)RMSEy  0ð Þ   0.707 (0.048) 0.481 (0.012) 0.482 (0.012)Rank Corr ðyÞ 1ð Þ   0.769 (0.037) 0.834 (0.012) 0.835 (0.012) Note.  N  ¼ 2,000. FMP ¼ filtered monotonic polynomial; RIMSE ¼ root integrated mean squareerror; RMSE ¼ root mean square error.

    TABLE 3.

     Means and Standard Deviations of RMSEs for TestGraf and FMP.

    TestGraf FMP_  k AIC   FMP_ k BIC

    RIMSEIRC  0ð Þ   0.069 (0.008) 0.064 (0.009) 0.075 (0.007)RMSEy  0ð Þ   0.695 (0.076) 0.492 (0.021) 0.492 (0.021)(Rank Corr ðyÞ 1ð Þ   0.763 (0.050) 0.828 (0.027) 0.829 (0.027) Note. N ¼ 300. FMP ¼ filtered monotonic polynomial; RIMSE ¼ root integrated mean square error;RMSE ¼ root mean square error.

     Liang and Browne

    25

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    22/30

    We shall demonstrate by means of a numerical example that if 

    i. in an artificially constructed population, the generation distribution used for  y  is

    nonnormal (e.g., bimodal) and simultaneously all IRFs are generated as 2PL

    ii. the identification condition that y  is normal is used in the FMP estimation proce-

    dure by normalizing the surrogates then

    iii. the resulting unconstrained estimates of the item IRFs are not 2PL.

    This result is stated at the population level but is investigated here using two

    finite, but very large ( N  ¼   100,000) data sets, regarded as finite pseudo- populations. These are used to demonstrate the effect of changing the distribution

    chosen for y without changing the IRCs for  n ¼ 20 items. In one data set, referred to as DS-B, the distribution used for generating y  is chosen to be symmetric and 

    strongly bimodal with a mean of 0 and a standard deviation of 1. This bimodal

    distribution is generated by the mixture of a N ð2; 51=2Þ   and an independent N ð2; 51=2Þ with a probability of .5 for selecting each component distribution.For the other data set, referred to as DS-N, the distribution used for generating  y

    Item 13

    True

    TestGraf 

    FMP_k1_AIC

    True

    TestGraf 

    FMP_k2_AIC

    True

    TestGraf 

    FMP_k1_AIC

    True

    TestGraf 

    FMP_k1_AIC

    Item 14

    Item 15

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

    -2 -1 0 1 2

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

    -2 -1 0 1 2

    Item 16

    ability, θ

       P  r  o   b  a   b   i   l   i   t  y

    FIGURE 5. Comparisons of the estimated IRCs (  N ¼ 300). IRC ¼ item response curve.

     A Quasi-Parametric Method for Fitting 

    26

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    23/30

    is N ð0; 1Þ. Item scores, y s, for the two data sets are generated as described in sub-section 4.1 using a 2PL (FMP_0) IRF defined by Equations 5 and 6 for each item.

    Item parameter values are equal across the two data sets for each of the 20

    items. The only difference in the generation process for the two data sets is that

     bimodal y’s are used for DS-B and normal  y’s for DS-N. Superimposed kernel-

    smoothed density functions for  y  in the two data sets are shown in Figure 6.

    Both data sets are then analyzed in the same way using the FMP method 

    described in Subsection 3.3. For item parameter estimation in both DS-B and 

    DS-N, k i ¼ 2 is chosen for all items to yield equally flexible IRCs in the two datasets. Thus, the y distribution identification condition used when generating DS-N

    matches the y distribution identification condition made in its analysis. However,

    the  y  distribution identification condition used when generating DS-B conflicts

    with the y  distribution identification condition made in its analysis. Because the

    two data sets employ the same items and are analyzed in exactly the same man-

    ner, any differences in estimated IRFs can be attributed to the conflict of identi-

    fication conditions for the distribution of  y  in the analysis of DS-B.

    Superimposed IRCs obtained from the two data sets are shown for 4 of the items

    in Figure 7. In all four figures, the estimated B-IRC does not coincide with the esti-

    mated N-IRC although the same IRF was used at the generation stage. This is due

    to the conflict in identification conditions on the distribution of  y  in DS_B at the

    generation stage with those at the estimation stage. There is no such conflict in

    DS-N. The distortion of B-IRC in the two figures in the first row occurs with items

    of medium difficulty and is hardly noticeable. In the second row, the distortion is

    more visible and occurs with items of high and of low difficulty. Thus, the differ-

    ence in identification conditions on  y  employed at the generation and estimation

    stages can affect different types of items in different ways.

     Ability, θ

       D  e  n  s   i   t  y

    Bimodal θ

    Normal θ

    -4 -2 0 2 4

       0 .   0

       0 .   1

       0 .   2

       0 .   3

       0 .   4

       0 .   5

    FIGURE 6. Superimposed normal and bimodal densities for  y.

     Liang and Browne

    27

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    24/30

    Without knowledge of the generation process, the two B-IRCs in the second 

    row could easily be misinterpreted as indicating that a 2PL IRF is inappropriate

    for the data. There is, however, a way of detecting (without prior knowledge) a

    difference between the   y  distribution at the generation stage from the known

    assumption of a normal distribution at the estimation stage. This is to obtain EAP

    estimates, by s, of the abilities using Equation 26 and estimate their density usingkernel smoothing. Figure 8 shows kernel smoothed plots of densities for these

    ability estimates obtained from data sets N and B. The density of  by  from DS-Bis clearly bimodal, although not as noticeably as that in Figure 6, and the densityof  by  from DS-N is essentially normal.

    In summary, Figures 7 and 8 indicate that a conflict of distribution assump-

    tions affects both the estimated IRFs and the distribution of ability estimates.

    5. An Example Using Actual Data

    In the previous section, the FMP approach was shown to be useful by means of 

    simulation studies. Here, FMP and FUP models will be applied to an actual data

    set that is included with the TestGraf distribution (Ramsay, 2000). The FMP and 

    FUP results will be compared with those from the TestGraf program that does not

    impose monotonicity requirements on the estimated IRCs. The data set consists

    Bimodal θ

    Normal θ

    Bimodal θ

    Normal θ

    Bimodal θ

    Normal θ

       0 .   0

       0 .   2

       0 .   4

       0 .

       6

       0 .   8

       1 .   0

    -4 -2 0 2 4

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

    -4 -2 0 2 4

    Bimodal θ

    Normal θ

     Ability, θ

       P  r  o   b  a   b   i   l   i   t  y

    k= 2

    FIGURE 7. Examples of estimated IRCs when density of   y  is either bimodal or normal.

     IRC ¼ item response curve.

     A Quasi-Parametric Method for Fitting 

    28

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    25/30

    of 379 students’ responses to an examination with 100 four-option multiple-

    choice questions which was given for Psychology 101, an introductory psychol-

    ogy course.The data in the original file were recoded dichotomously with a ‘‘1’’ for a

    correct response and a ‘‘0’’ otherwise. Missing responses were treated as incor-

    rect responses. To decide on the degree of the polynomial, models were fitted 

    sequentially with  k  ¼ 0; 1; . . . ; 4 yielding corresponding polynomials of degree1, 3, . . . , 9 and the optimal values of  k  suggested by both the AIC and BIC were

    recorded. This was done independently for the FMP and FUP. The default value

    h ¼ 1:1 3790:2 ¼ 0.34 of the bandwidth was employed for TestGraf.FMP, FUP, and TestGraf IRFs may all be regarded as different regressions

    with the probability of passing an item as independent variable on the surrogateability score, ~y, as independent variable. In order to provide a graphical represen-

    tation of the relationship between data and the IRCs, reference points were

     plotted on the same graph as the IRCs. To obtain these points, a truncated ability

    range of [3, 3] was first divided into 12 intervals of length .5. Corresponding toeach interval, a single reference point (y, p) was obtained with y equal to the mid-

     point of the interval and  p  equal to the proportion of examinees with surrogate

    abilities,

     by s, in the interval who correctly answered the item. If any interval was

    empty, the corresponding reference point was omitted. These reference points are

    valid for the FMP and FUP IRCs for all k  because the same surrogate abilities areused. For convenience, these reference points could also be used for the IRC from

    TestGraf that uses different surrogate values.

    k=2

    EAP θ^

       D  e  n  s   i   t  y

    Bimodal θ

    Normal θ

    -4 -2 0 2 4

       0 .   0

       0 .   1

       0 .   2

       0 .   3

       0 .   4

       0 .   5

    FIGURE 8. Superimposed densities of estimates by from DS-B and DS-N.

     Liang and Browne

    29

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    26/30

    IRC plots for some selected non-2PL items from the introductory psychology test

    are shown in Figure 9. For each item, three IRCs are plotted: (i) TestGraf, (ii)

    FMP_ k AICwhere k AICyields the lowest AIC for FMP, and (iii) FUP_ k AIC, where k AICyields the lowest AIC for FUP. (In the legend, FMP-k1 stands for FMP_ k AIC¼ 1and so on.) For each item, the reference points are represented by small circles.

    The following observations may be made from Figure 9. The unconstrained 

    FUP_ k AIC and TestGraf curves tend to be similar, but the FUP curves tend to undu-

    late more smoothly and the TestGraf curves to wiggle more. It is difficult to say

    whether or not this difference is due to inherent properties of the two fitting meth-

    ods or to the different criteria,  k  and  h, for controlling flexibility in FUP and Test-

    Graf (cf. Items 69 and 96). Also it is of interest to inspect closeness of monotonic

    FMP_ k AIC curves to nonmonotonic FUP_ k AIC and TestGraf curves. Note that Item

    96 appears to be a problematic item. The IRCs from both TestGraf and FUP show

    that the probability of correctly answering this item decreases as ability increases.

    With the constraint of monotonicity, the FMP IRC comes out as a flat line.

    TestGraf FMP-k0FUP-k1

    Item 3

    TestGraf FMP-k1FUP-k1

    Item 5

    TestGraf FMP-k1FUP-k2

    Item 13

    TestGraf FMP-k0FUP-k1

    Item 22

    TestGraf FMP-k1FUP-k0

    Item 24

    TestGraf FMP-k0FUP-k2

    Item 39

    TestGraf FMP-k1

    FUP-k1

    Item 69

    TestGraf FMP-k1

    FUP-k1

    Item 74

    TestGraf FMP-k0

    FUP-k2

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

       0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

    -3 -2 -1 0 1 2 3   0 .   0

       0 .   2

       0 .   4

       0 .   6

       0 .   8

       1 .   0

    -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

    Item 96

    ability

       P  r  o   b  a   b   i   l   i   t  y

    FIGURE 9. Estimated IRCs for Psychology 101 data. IRC  ¼   item response curve.

     A Quasi-Parametric Method for Fitting 

    30

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    27/30

    Figure 10 provides plots of the estimates by of abilities from TestGraf againstthose from FMP_ k AIC and FMP_ k BIC. The two plots are similar and in both cases,

    estimated abilities from TestGraf are slightly lower than those from FMP at low

    values and slightly higher at high values. In both cases, the TestGraf  y  estimates

    are close to the FMP  y  estimates.

    6. Summary and Conclusions

    General filtered polynomial (FMP/FUP) approaches for constructing a flex-

    ible IRF have been developed. The model is quasi-parametric because the para-

    meters involved are not intended for interpretation. Their main function is to

    define a flexible IRF that simultaneously (i) produces graphical displays of 

    deviations from the usually assumed S-shape and (ii) is easily portable to future

    examinees not present in the calibration sample. Although the usual property of 

    monotonicity of an IRF is imposed in FMP, the monotonicity constraints are dis-

    carded in FUP to provide a filtered unconstrained polynomial IRC that need not

     be monotonic but is still bounded by 0 and 1.

    The IRCs developed are intended for visual inspection to obtain diagnostic

    information about deviant items. This will be helpful for detecting unsatisfactory

    items when constructing ability tests. Another potential application will be in the

    analysis of psychopathology scales (Meijer & Baneke, 2004) where the usual

    assumptions made for ability tests are no longer applicable. Furthermore, the

    FUP facility will be useful for providing option response curves for incorrect

    options in multioption tests.

    Computational procedures have been developed for estimation purposes and a

    computer program, FMP, written in FORTRAN 90.2 Monotonicity constraints

    are imposed by means of a reparameterization. This methodology has been tried 

    out in two simulation studies and on an actual example and found to compare

    favorably with existing methods. In Simulation Study 1 where the true IRC was

    FIGURE 10. Comparison of    by0 s from TestGraf and FMP (Psychology 101 data).

     Liang and Browne

    31

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    28/30

    a 2PL (or, equivalently, FMP_0), the FMP IRCs were very close to those from

    the gold standard MML and were clearly superior to those from JML. This is

    reassuring because the FMP_0 algorithm may be regarded as a first iteration

    of JML and difficulties with JML are recognized. In Simulation Study 2, where

    a nonstandard IRF was used for the generating model, the FMP approach yielded 

    as good an approximation to the actual generating IRF as the well-known non-

     parametric method, implemented in the program TestGraf, and clearly more

    accurate estimates of the abilities y s. In the actual example, the current approach

    compares favorably with TestGraf but has the additional advantages of being

    able to produce either monotonic increasing or nonmonotonic IRCs as well as

    easily portable IRFs. Although the current article has concentrated on the use

    of a logistic filter, the theory presented can easily be adapted to the use of other 

    filters such as those derived from normal, beta, or gamma ogives.

    Acknowledgments

    The authors are grateful to Michael Edwards, Steven MacEachern, the editor, and the

    reviewers for their thought provoking comments and helpful suggestions.

    Declaration of Conflicting Interests

    The author(s) declared no potential conflicts of interest with respect to the research,

    authorship, and/or publication of this article.

    FundingThe author(s) disclosed receipt of the following financial support for the research, author-

    ship, and/or publication of this article: This research was supported in part by NSF grant

    SES-0437251. It was carried out in partial fulfillment of the requirements for the first

    author’s PhD degree in quantitative psychology at the Ohio State University with the

    second author as advisor.

    Notes

    1. Hayley (1952) suggested multiplication of the logit by  D ¼ 1.702 to approxi-mate the Normal Ogive.

    2. The program, FMP, is being prepared for distribution on the Internet. Please

    address all inquiries to the first author.

    Supplementary Material

    The online appendices are available at http:/jeb.sagepub.com/supplemental.

    References

    Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-

    ciple. In B. N. Petrox & F. Caski (Eds.),  Second international symposium on informa-tion theory   (pp. 267–281). Budapest, Hungary: Akademiai Kiado.

    Baker, F. B. (1992).  Item response theory: Parameter estimation techniques. New York,

     NY: Marcel Dekker.

     A Quasi-Parametric Method for Fitting 

    32

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    29/30

    Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s

    ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores

    (pp. 399–402). Reading MA: Addison-Wesley.

    Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item para-

    meters: Application of an EM algorithm.  Psychometrika,  46 , 443–459.Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously

    scored items.  Psychometrika,  35, 179–197.

    Bock, R. D., & Moustaki, I. (2007). Item response theory in a general framework. In C. R.

    Rao & S. Sinharay (Eds.),   Handbook of statistics, volume 26: Psychometrics   (pp.

    469–514). Amsterdam, The Netherlands: North-Holland.

    Browne, M. W. (2000). Cross-validation methods. Journal of Mathematical Psychology,

    44, 108–132.

    Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: Understanding AIC

    and BIC in model selection.   Sociological Methods and Research, 33, 261–304.

    Cudeck, R., & Henly, S. J. (1991). Model selection in covariance structures analysis

    and the ‘‘problem’’ of sample size: A clarification.   Psychological Bulletin,   109,

    512–519.

    De Leeuw, J. (1992). Introduction to Akaike (1973) information theory and an extension

    of the maximum likelihood principle. In S. Kotz & N. L. Johnson (Eds.),   Break-

    throughs in statistics (Vol. 1, pp. 599–609). London, England: Springer-Verlag.

    Drasgow, F., Levine, M. V., Williams, B., McLaughlin, M. E., & Candell, G. L. (1989).

    Modeling incorrect responses to multiple-choice items with multilinear formula score

    theory. Applied Psychological Measurement ,  13, 285–299.

    Duncan, K. A., & MacEachern, S. N. (2008). Nonparametric Bayesian modeling for item

    response. Statistical Modeling , 8, 41–66.

    Duncan, K. A., & MacEachern, S. N. (2013). Nonparametric Bayesian modeling for item

    response with a three parameter logistic prior mean. In M. C. Edwards & R. C.

    MacCallum (Eds.),   Current topics in the theory and application of latent variable

    methods. New York, NY: Routledge.

    Elphinstone, C. D. (1983). A target distribution model for nonparametric density estima-

    tion.  Communications in Statistics—Theory and Methods,  12, 161–198.

    Elphinstone, C. D. (1985). A method of distribution and density estimation (Unpublished 

    dissertation). University of South Africa, Pretoria, South Africa.

    Hayley, D.C. (1952). Estimation of the dosage mortality relationship when the dose is sub- ject to error . (Technical Report No. 15). Stanford, CA: Stanford University, Applied 

    Mathematics and Statistics Laboratory.

    Hawkins, D. M. (1994). Fitting monotonic polynomials to data.  Computational Statistics,

    9, 233–247.

    Lee, Y.-S. (2007). A comparison of methods for nonparametric estimation of item char-

    acteristic curves for binary items. Applied Psychological Measurement , 31, 121–134.

    Lord, F. M., & Novick, M. R. (1968).  Statistical theories of mental test scores. Reading,

    MA: Addison Wesley.

    Meijer, R. R., & Baneke, J. J. (2004). Analyzing psychopathology items: A case for non-

    metric item response theory modeling. Psychological Methods,  9, 354–368. Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent

    observations. Econometrika,  16 , 1–32.

     Liang and Browne

    33

     at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from 

    http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/

  • 8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

    30/30

    Ramsay, J. O. (1977). Monotonic weighted power transformations to additivity. Psycho-

    metrika,  42, 83–109.

    Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic

    curve estimation. Psychometrika,  56 , 611–630.

    Ramsay, J. O. (2000). TestGraf: A program for the graphical analysis of multiple choicetest and questionnaire data  [Computer program and manual]. Retrieved from http://

    www.psych.mcgill.ca/faculty/ramsay/ramsay.html

    Ramsay, J. O., & Winsberg, S. (1991). Maximum marginal likelihood estimation for semi-

     parametric item analysis. Psychometrika,  56 , 365–379.

    Schwarz, G. (1978). Estimating the dimension of a model.   The Annals of Statistics,   6 ,

    461–464.

    Sinnott, L. T. (1997). Filtered polynomial density approximations and their application to

    discriminant analysis (MS Thesis). The Ohio State University, Columbus, OH.

    Thissen, D., Chen, W.-H, & Bock, R. D. (2003). Multilog (version 7) [Computer soft-

    ware]. Lincolnwood, IL: Scientific Software International.

    Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two cate-

    gories. In D. Thissen & H. Wainer (Eds.),  Test scoring  (pp. 73–140). Mahwah, NJ:

    Lawrence Erlbaum.

    Woods, C. M., & Thissen, D. (2006). Item response theory with estimation of the latent

     population distribution using spline-based densities. Psychometrika,  71, 281–301.

    Authors

    LONGJUAN LIANG is a psychometric manager at Educational Testing Service,Rosedale Rd, Princeton, NJ 08822; e-mail: [email protected]. Her research interests