23
psychometrikavol. 83, no. 3, 515–537 September 2018 https://doi.org/10.1007/s11336-018-9629-6 HYPOTHESIS TESTING OF THE Q-MATRIX Yuqi Gu UNIVERSITY OF MICHIGAN Jingchen Liu COLUMBIA UNIVERSITY Gongjun Xu UNIVERSITY OF MICHIGAN Zhiliang Ying COLUMBIA UNIVERSITY The recent surge of interests in cognitive assessment has led to the development of cognitive diagnosis models. Central to many such models is a specification of the Q-matrix, which relates items to latent attributes that have natural interpretations. In practice, the Q-matrix is usually constructed subjectively by the test designers. This could lead to misspecification, which could result in lack of fit of the underlying statistical model. To test possible misspecification of the Q-matrix, traditional goodness of fit tests, such as the Chi-square test and the likelihood ratio test, may not be applied straightforwardly due to the large number of possible response patterns. To address this problem, this paper proposes a new statistical method to test the goodness fit of the Q-matrix, by constructing test statistics that measure the consistency between a provisional Q-matrix and the observed data for a general family of cognitive diagnosis models. Limiting distributions of the test statistics are derived under the null hypothesis that can be used for obtaining the test p-values. Simulation studies as well as a real data example are presented to demonstrate the usefulness of the proposed method. Key words: Q-matrix, diagnostic classification models, hypothesis testing. 1. Introduction Cognitive diagnosis models (CDMs), also called diagnostic classification models, have recently gained prominence in educational assessment, psychiatric evaluation, and many other disciplines Rupp and Templin (2008b). Instead of an overall test score, a cognitive diagnostic test provides each subject with a profile detailing the concepts and skills (often called attributes) that he/she has mastered. For instance, teachers want to identify students’ mastery of different skills (attributes) based on their answers (responses) to exam questions (items). Such feedback could have a significant impact on the learning process by providing students and teachers with detailed information on students’ strengths and weaknesses. A central component of CDMs is the Q-matrix, in which each row corresponds to an item and each column corresponds to an attribute. The Q-matrix specifies the item–attribute relationship Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11336-018-9629- 6) contains supplementary material, which is available to authorized users. Correspondence should be made to Gongjun Xu, Department of Statistics, University of Michigan, 456 West Hall, 1085 South University, Ann Arbor, MI 48109, USA. Email: [email protected] 515 © 2018 The Psychometric Society

Hypothesis Testing of the Q-matrix - NSF

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hypothesis Testing of the Q-matrix - NSF

psychometrika—vol. 83, no. 3, 515–537September 2018https://doi.org/10.1007/s11336-018-9629-6

HYPOTHESIS TESTING OF THE Q-MATRIX

Yuqi Gu

UNIVERSITY OF MICHIGAN

Jingchen Liu

COLUMBIA UNIVERSITY

Gongjun Xu

UNIVERSITY OF MICHIGAN

Zhiliang Ying

COLUMBIA UNIVERSITY

The recent surge of interests in cognitive assessment has led to the development of cognitive diagnosismodels. Central to many such models is a specification of the Q-matrix, which relates items to latentattributes that have natural interpretations. In practice, the Q-matrix is usually constructed subjectively bythe test designers. This could lead to misspecification, which could result in lack of fit of the underlyingstatistical model. To test possible misspecification of the Q-matrix, traditional goodness of fit tests, suchas the Chi-square test and the likelihood ratio test, may not be applied straightforwardly due to the largenumber of possible response patterns. To address this problem, this paper proposes a new statistical methodto test the goodness fit of the Q-matrix, by constructing test statistics that measure the consistency betweena provisional Q-matrix and the observed data for a general family of cognitive diagnosis models. Limitingdistributions of the test statistics are derived under the null hypothesis that can be used for obtaining thetest p-values. Simulation studies as well as a real data example are presented to demonstrate the usefulnessof the proposed method.

Key words: Q-matrix, diagnostic classification models, hypothesis testing.

1. Introduction

Cognitive diagnosis models (CDMs), also called diagnostic classification models, haverecently gained prominence in educational assessment, psychiatric evaluation, and many otherdisciplines Rupp and Templin (2008b). Instead of an overall test score, a cognitive diagnostic testprovides each subject with a profile detailing the concepts and skills (often called attributes) thathe/she has mastered. For instance, teachers want to identify students’ mastery of different skills(attributes) based on their answers (responses) to exam questions (items). Such feedback couldhave a significant impact on the learning process by providing students and teachers with detailedinformation on students’ strengths and weaknesses.

A central component of CDMs is the Q-matrix, in which each row corresponds to an item andeach column corresponds to an attribute. The Q-matrix specifies the item–attribute relationship

Electronic supplementarymaterial The online version of this article (https://doi.org/10.1007/s11336-018-9629-6) contains supplementary material, which is available to authorized users.

Correspondence should be made to Gongjun Xu, Department of Statistics, University of Michigan, 456 West Hall,1085 South University, Ann Arbor, MI 48109, USA. Email: [email protected]

515© 2018 The Psychometric Society

Page 2: Hypothesis Testing of the Q-matrix - NSF

516 PSYCHOMETRIKA

by defining the entry in the j th row and kth column as 1 if item j requires attribute k and 0 oth-erwise. By incorporating the latent attributes with a specific test construction, the Q-matrix playsan important role in the development and analysis of a cognitive diagnosis test. Different CDMshave been built based on the Q-matrix. A short list includes the conjunctive DINA and NIDAmodels (Junker and Sijtsma, 2001; Tatsuoka, 2002; de la Torre and Douglas, 2004), the reparam-eterized unified/fusion model (RUM) (DiBello et al., 1995; Hartz, 2002; Templin et al., 2003),the compensatory DINO and NIDOmodels (Templin and Henson, 2006; Templin, 2006), the rulespace method (Tatsuoka, 1985, 2009), the attribute hierarchy method (Leighton et al., 2004), andGeneralized DINA models (de la Torre, 2011); see also von Davier (2005, 2008), Henson et al.(2009), Rupp et al. (2010) for more developments and approaches to cognitive diagnosis.

In applications, a Q-matrix is constructed along with the test items by the test designers.Based on such a prespecified matrix, there is a growing literature on the statistical inference ofCDMs that addresses the issues of item parameter estimation (see, e.g., Rupp, 2002; Hensonand Templin, 2005; Roussos et al., 2007; Stout, 2007). Having a correctly specified Q-matrix iscrucial both for itemparameter estimation and for the assessment of subjects’ underlying attributes.However, a prespecified Q-matrix is sometimes inaccurate. A misspecified matrix could lead tosubstantial lack of fit and, consequently, erroneous assessment (Rupp and Templin, 2008a; de laTorre and Douglas, 2004). Therefore, it is desirable to develop a validation method which is thefocus of this work.

A related problem to this study is the estimation of Q-matrix. Liu et al. (2013) providethe theoretical foundation by proving the identifiability of the Q-matrix based on the data forthe DINA model. Chen et al. (2015) and Xu and Shang (2018) further extend the theoreticalresults to more general settings. Estimation methods are also developed (Liu et al., 2012, 2013).Computationally more efficient methods are proposed in Chen et al. (2015) and Xu and Shang(2018) based on regularized likelihood. Other recent developments of Q-matrix misspecificationdetection methods can be found in de la Torre (2008), DeCarlo (2012), Chiu (2013), de la Torreand Chiu (2016) and many others.

This paper considers the hypothesis testing problem with a provisional Q-matrix, denoted byQ0. This matrix may be prespecified by the test maker or estimated based on the data. Of interestis whether Q0 along with a parametric diagnostic class model fits the data well. Statistically, thisproblem can be formulated as a pure significance test of the null hypothesis

H0 : Q = Q0. (1)

There is no specific alternative hypothesis. This test could serve multiple purposes in practice.In the case that Q0 is specified based on subjective knowledge, this test verifies whether theknowledge is consistent with the data. Furthermore, most estimation methods mentioned in theprevious paragraph are based on some local search optimization or Monte Carlo algorithms. Suchalgorithms (such as local greedy search algorithm andMarkov chainMonte Carlo) may be trappedat some local optimal points; see, for instance, Sect. 3.1 of Liu et al. (2012). For a Q0 obtainedbased on one of those algorithms, this test would help identifying whether the Q0 fits the data welland provide the uncertainty level of accepting or rejecting Q0. It also assesses the appropriatenessof the entire model parameterization such as the compensatory or non-compensatory relationshipamong attributes.

Testing (1) under general CDMs is a challenging problem. Traditional goodness of fit tests(such as the Chi-square test and likelihood ratio test) cannot be directly applied. This is mostly dueto the slow convergence of the distributions of the test statistics to their asymptotic distributions.Such asymptotic distributions are developed based on the asymptotic normality of themultinomialdistribution. In the context of cognitive diagnosis, the number of possible response patterns is oftenmuch larger than the sample size. Thus, the large sample theories of the test statistics are often

Page 3: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 517

practically inapplicable. For instance, for a test of 20 items, there are 220 ≈ 1.0 × 106 possibleitem response vectors. The sample sizes of most empirical studies are on the order of or less than afew thousands, for which usual large sample theory fails to apply. In the literature of contingencytable analysis, this issue has been known and studied by various researchers (e.g., Bartholomewand Tzamourani, 1999; Tollenaar andMooijaart, 2003;Maydeu-Olivares, 2001;Maydeu-Olivaresand Joe, 2005; Cai et al., 2006).

This paper proposes a test statistic that is based on the theoretical framework proposed inLiu et al. (2012, 2013), Xu (2017) and Xu and Shang (2018) to study the identifiability of themodel parameters and the Q-matrix under various CDMs. The proposed method can be appliedto all CDMs in the literature, without requiring the model parameters to be known in advance.The test statistic also shares a similar spirit to the limited information statistics (Maydeu-Olivaresand Joe, 2005; Cai et al., 2006) by using the residuals of margins. Differently from the previousstudies, this paper focuses on testing the Q-matrix of the diagnostic classification models, and thetest statistic directly provides an objective function on detectingmisspecifications of the Q-matrix(Liu et al., 2012). Asymptotic distribution of the test statistic under the null hypothesis is derivedand the test algorithm is provided with explicit formulas to calculate the limiting distribution,which is applicable to reasonably small samples. Simulation studies are also presented to assessthe performance of the proposed test procedure under various CDM settings. It is worth pointingout that our procedure is applicable to general CDMs with both complete and incomplete Q-matrices and thus has great applicability.

The remainder of this paper is organized as follows. Section 2 provides a brief introductionto CDMs. Section 3 introduces the testing procedure and the asymptotic distribution of the teststatistic. Section 4 presents simulation results to help assess the performance of the proposedprocedure. The fraction subtraction data are analyzed in Sect. 5. A discussion is given in Sect. 6,and finally, technical derivations and additional simulation results are provided in the onlinesupplementary material.

2. Model Setup

Consider the situation that N subjects (examinees) take a test consistingof J items (questions).Assume that the responses are binary, so that the data can be expressed as an N × J matrix withentries 0 or 1. A diagnostic model envisions K attributes that are related to both the subjects andthe items. The Q-matrix establishes a link between the items and the attributes. In particular,Q = (q jk)J×K is a J × K matrix with binary entries. For each pair of j and k, q jk = 1 indicatesthat item j requires attribute k and q jk = 0 otherwise. Let q j denote the j th row of Q.

For a subject, let α = (α1, . . . , αK )� be the vector of attributes (attribute profile), whereαk = 1 or 0, indicating the mastery or non-mastery of the kth attribute. In addition, let R =(R1, . . . , RJ )

� denote the vector of responses to the J test items. Given a specific subject’sattribute profile α, the response R j to item j under the corresponding model follows a Bernoullidistribution with parameter θ j,α := P(R j = 1 | Q,α), i.e., θ j,α is the probability of providingpositive response to item j for a subject with attribute profile α. Note that the specific form of θ j,α

depends on the Q-matrix, the item parameter vector θ , and model parameterization. Conditionalon α, we assume that R1, …, RJ are jointly independent.

Letαi = (αi,1, . . . , αi,K ) and Ri = (Ri,1, . . . , Ri,J )� be the i th subject’s attribute profile and

response vector, respectively. The observed data are R1, . . . , RN but not the latent α1, . . . ,αN .The subjects’ attribute profiles are further assumed to independently follow a categorical distri-bution with pα = P(αi = α) and p = (pα : α ∈ {0, 1}K )�. Let the first element of p bep0 = P(αi = 0) and define the rest of p as p∗, that is, p = (

p0, (p∗)�)�

. Let 0 and 1 denote

Page 4: Hypothesis Testing of the Q-matrix - NSF

518 PSYCHOMETRIKA

the zero vector and the one vector, i.e., (0, . . . , 0)� and (1, . . . , 1)�, respectively. The followingexamples give illustration of the above model construction.

Example 1. (DINA Model and DINO Model) The DINA model was proposed by Junker andSijtsma (2001). For each item j and attribute vector α, define the ideal response

ξ DI N Aj,α (Q) =

K∏

k=1

(αk)q jk = 1(αk ≥ q jk for all k) (2)

that is, whether α has the all the attributes required by item j . For each item, there are twoadditional parameters s j and g j that are known as the slipping and guessing parameters. The

positive response probability θ j,α takes the form θDI N Aj,α = (1 − s j )

ξDI N Aj,α (Q)g

1−ξDI N Aj,α (Q)

j . Theitem parameter vector is θ = {s j , g j : j = 1, . . . , J }. The DINA model assumes a conjunctive(non-compensatory) relationship among attributes, meaning that it is necessary to possess all theattributes indicated by the Q-matrix to be capable of providing a positive response.

The DINO model was proposed by Templin and Henson (2006), and it assumes a compen-satory relationship among attributes by defining the ideal response as ξ DI NO

j,α (Q) = 1(αk ≥q jk for at least one k). With slipping and guessing parameters s j and g j , the positive response

probability under DINO takes the form θDI NOj,α = (1 − s j )

ξDI NOj,α (Q)g

1−ξDI NOj,α (Q)

j .

Example 2. (ACDM and GDINA Model) The additive cognitive diagnosis model (ACDM) andthe generalized DINA (GDINA) model were proposed in de la Torre (2011). ACDM models themain effects of the required attributes per item, while GDINA models all the possible interactioneffects of them. Under these two models, the positive response probability of attribute profile α

to item j takes the following forms:

θ ACDMj,α = β j,0 +

∑K

k=1β j,k(q jkαk);

θGDI N Aj,α = β j0 +

K∑

k=1

β jkq jkαk +K∑

k′=k+1

K−1∑

k=1

β jkk′(q jkαk)(q jk′αk′)

+ · · · + β j12···KK∏

k=1

(q jkαk).

Note that not all β’s in the above equation are included in the model. For instance, when q j �= 1�,we do not need parameter β j12···K since

∏k(q jkαk) = 0. The intercept β j0 represents probability

of a positive response when none of the required attributes is present; when q jk = 1, β jk isincluded in the model and it shows the main effect of mastering a single attribute αk ; whenq jk = q jk′ = 1, β jkk′ is in the model and it shows the interaction effect due to the mastery ofboth αk and αk′ ; similarly, when q j = 1�, β j12···K represents the interaction effect of mastery ofall the required attributes.

The slow convergence of likelihood ratio test Consider the significance test of null hypothesisH0 : Q = Q0, with the goal to check the consistency between the observed data and the currentmodel formulation with a prespecified Q0. The asymptotic reference distributions of traditionalgoodness of fit tests, such as the Chi-square test and the likelihood ratio test, are derived based onthe asymptotic normality of the multinomial distribution that typically have a slow convergencerate when the number of cells is large. In particular, under the current setting with 2J responsepatterns to the J items, it is necessary to have a sample size growing exponentially fast in J forthe convergence to hold. In practice, the sample size is very often much smaller than required.The following example illustrates the slow convergence.

Page 5: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 519

Figure 1.Q–Q plots of p-values for the likelihood ratio test. (a) N = 104 (b) N = 105.

Example 3. Consider a test of 10 items with 2 attributes given as follows:

Q�0 =

(1 0 1 0 1 0 1 0 1 00 1 0 1 0 1 0 1 0 1

)

2×10.

For different sample sizes N , data sets are generated under DINA with s j = g j = 0.2, j =1, . . . , 10, and pα = 1/4 for all α ∈ {0, 1}2. The response pattern R follows a multinomialdistribution with 210 = 1024 possible outcomes. Consider the likelihood ratio test under the trueQ-matrix (see, e.g., Chapter 12, Rupp et al., 2010). The asymptotic reference distribution is χ2

with degrees of freedom 210 − 24. However, the simulation study shows that the type I error isvery different from the nominal level α = 0.05 even when N is as large as 104. The proportionof rejections under the 0.05 significance level is 0.552 for N = 104 and 0.049 for N = 105. Tofurther illustrate this, Fig. 1 shows the Q–Q plots of the p-values of the likelihood ratio test outof 1000 independent simulations against the uniform distribution, which is the distribution of thep-value under null hypothesis. Generally speaking, the necessary sample size for the Chi-squaredistribution to be approximately valid grows exponentially with the number of items (J ). Thisexample shows that traditional test procedures using asymptotic Chi-square limiting distributionto test the Q-matrix may fail with the usual sample size.

Remark 1. Note that the likelihood ratio test may still be used with the parametric bootstrapmethod to construct a reference distribution of the likelihood ratio test (e.g., Tollenaar and Mooi-jaart, 2003), especially with the help of modern parallel computation techniques. However, incertain non-standard problems, it is known that the usual bootstrap method might not be consis-tent (e.g., Sen et al., 2010; Sen and Xu, 2015). It would be of interest to further investigate thetheoretical validity of the bootstrap method when testing the discrete Q-matrix.

3. Test Procedure

3.1. Useful Concepts

The test statistic is constructed based on the theoretical framework of the T -matrix that wasproposed in Liu et al. (2012, 2013), Xu (2017) andXu and Shang (2018) to study the identifiability

Page 6: Hypothesis Testing of the Q-matrix - NSF

520 PSYCHOMETRIKA

of the CDMs. In order to define the test statistic, we introduce the following important quantitiesfrom Xu (2017) and Xu and Shang (2018) which are introduced for general CDMs.

T -matrix The T -matrix T (Q, θ) is a key quantity to construct the test statistic and it servesas a connection between the observed response distribution and the attribute distribution. Nextwe specify the row vectors of the T -matrix for a general diagnostic model. Rows of T (Q, θ)

are indexed by the 2J possible response patterns r = (r1, . . . , rJ )� ∈ {0, 1}J and columns ofT (Q, θ) are indexed by the 2K attribute profiles α ∈ {0, 1}K , while the (r,α)th entry of T (Q, θ),denoted by tr,α(Q, θ), represents the marginal probability that subjects with attribute profile α

provide positive responses to the set of items { j : r j = 1}, namely tr,α(Q, θ) = P(R r |θ ,α) = ∏J

j=1 θr jj,α. Denote the rth row vector of the T -matrix by Tr(Q, θ). Let e j denote the

J -dimensional unit vector with the j th element being one and all the other elements zero, thenany response pattern r can be written as a sum of some e-vectors, namely r = ∑

j : r j=1 e j . The

rth element of the 2J -dimensional vector T (Q, θ)p is

{T (Q, θ)p}r = Tr,·(Q, θ)p =∑

α∈{0,1}Ktr,α(Q, θ)pα = P(R r | θ).

For each item j , one can write P(R j = 1 | Q,p, θ) = ∑α pαθ j,α = Te j ,·(Q, θ)p.

Similarly, for each pair of items j1 and j2, the probability of responding positively to bothitems is P(R j1 = 1, R j2 = 1 | Q,p, θ) = ∑

α pαθ j1,αθ j2,α = Te1+e2,·(Q, θ)p, whereTe1+e2,·(Q, θ) is a 2K -dimensional row vector containing the probabilities θ j1,αθ j2,α for all α.Each element of Te1+e2,·(Q, θ) is the product of the corresponding elements of Te1,·(Q, θ) andTe2,·(Q, θ). With an analogous construction, for any combination of distinct items { j1, . . . , jl},P(R j1 = 1, . . . , R jl = 1 | Q,p, θ) = Te j1+···+e jl ,·(Q, θ)p. Similarly, Te j1+···+e jl ,·(Q, θ) is theelement-wise product of Te j1 ,·(Q, θ), …, Te jl ,·(Q, θ). Each row vector of the T -matrix is one of

the vectors Tr,·(Q, θ) indexed by r, for r ∈ {0, 1}J . We order the rows of T such that the top onescorrespond to smaller order of item combinations. From the above definition of the T -matrix, wehave

T (Q, θ)p =

⎜⎜⎜⎜⎜⎜⎝

Te1,·(θ)...

TeJ ,·(θ)

Te1+e2,·(θ)...

⎟⎟⎟⎟⎟⎟⎠

p =

⎜⎜⎜⎜⎜⎜⎝

∑α pαθ1,α

...∑α pαθJ,α∑

α pαθ1,αθ2,α...

⎟⎟⎟⎟⎟⎟⎠

=

⎜⎜⎜⎜⎜⎜⎝

P(R1 = 1 | Q, θ ,p)...

P(RJ = 1 | Q, θ ,p)

P(R1 = 1, R2 = 1 | Q, θ ,p)...

⎟⎟⎟⎟⎟⎟⎠

,

(3)which is a vector containing the corresponding probabilities associated with the matrix Q and aparticular set of parameters (θ ,p).

From the computational point of view, one only needs to construct the Te j ,·(Q, θ)’s for eachindividual item j and then take element-wise products to obtain the corresponding combinations.Conceptually, one may include all the combinations of items in the T -matrix, which results ina T -matrix of 2J − 1 rows. A T -matrix is called saturated by Liu et al. (2012) if it containsall the possible combinations of items. However, from a practical point of view, to construct amatrix of 2J rows typically induces a substantial computational overhead. With this in mind, it isrecommended that some of the high-order combinations not to be included in the T -matrix (Liuet al., 2012, 2013). In the simulation study presented in Sect. 4, different combinations of itemsare studied and it is observed that a T -matrix including the first two-way combinations generallyperforms well empirically in testing the goodness of fit of a given Q-matrix. In what follows, let

Page 7: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 521

n be the number of rows in T (Q, θ). Then for the T -matrix with the first two-way combinations,n = J + (J

2

) = J (J+1)2 . The DINA model is used to illustrate the construction of the T -matrix.

Example 4. Consider a test concerning two attributes. The population is naturally divided intofour latent classes following distribution p = (p00, p10, p01, p11)�. The test consists of twoproblems and admits the Q-matrix in (4). Also given in (4) is a T -matrix with n = 4, that is, Tis a 4 × 4 matrix. The four columns of T (Q0, θ) correspond to attribute profiles (0, 0), (1, 0),(0, 1), and (1, 1), respectively. The first three rows of T (Q0, θ) correspond to items one, two andthree, and the fourth row corresponds to the combination of items one and two. Under the DINAmodel in Example 1, the T -matrix with n = 4 takes the form

Q0 =(1 00 1

), T (Q0, θ) =

⎝g1 1 − s1 g1 1 − s1g2 g2 1 − s2 1 − s2g1g2 (1 − s1)g2 g1(1 − s2) (1 − s1)(1 − s2)

⎠ . (4)

γ -vector Define γ to be an n × 1 vector containing the empirical probabilities corresponding tothose in (3). That is, γ = ( 1

N

∑Ni=1 I (Ri,1 = 1), . . . , 1

N

∑Ni=1 I (Ri,J = 1), 1

N

∑Ni=1 I (Ri,1 =

1, Ri,2 = 1), . . .)�

. For Example 4, γ = (N1/N , N2/N , N3/N , N1∧2/N )� , where N j is thetotal number of correct responses to item j and N1∧2 is the number of correct responses to bothitems 1 and 2.

3.2. An Example of Ideal Case

The test statistic is constructed based on the T -matrix and γ -vector. To better illustrate theidea, we first consider the ideal case that the true model parameters θ0 and p0 are known. By thelaw of large numbers, if Q0 is the true Q-matrix, the response vector γ converges almost surelyas N → ∞:

γ =

⎜⎜⎜⎜⎜⎜⎜⎝

1N

∑Ni=1 I (Ri,1 = 1)

...1N

∑Ni=1 I (Ri,J = 1)

1N

∑Ni=1 I (Ri,1 = 1, Ri,2 = 1)

...

⎟⎟⎟⎟⎟⎟⎟⎠

H0−−→

⎜⎜⎜⎜⎜⎜⎝

P(R1 = 1 | Q0, θ0, p0)...

P(RJ = 1 | Q0, θ0, p0)P(R1 = 1, R2 = 1 | Q0, θ0, p0)

...

⎟⎟⎟⎟⎟⎟⎠

= T (Q0, θ0) p0.

That is, if Q0 is correct, γ − T (Q0, θ0) p0 → 0 almost surely as N → ∞.On the other hand, if Q0 is a misspecified Q-matrix, it is expected that |γ − T (Q0, θ0) p0|

is bounded away from 0 (Liu et al., 2013; Xu, 2017), where | · | is the Euclidean norm. Thedifference between the observed response vector γ and the expected response vector Tθ0(Q0) p0under the null hypothesis then provides a measure of the goodness of fit between Q0 and theobserved response data.

Based on the above observations, under the ideal case with known θ0 and p0, the test statisticis taken as S(Q0) = |T (Q0, θ0)p0−γ |2. It is the L2 distance between the observed data frequencyand that suggested by Q0. To construct the critical region of rejecting the null hypothesis, oneneeds to compute the asymptotic distribution of S(Q0) under the null hypothesis H0 : Q = Q0.A direct application of the central limit theorem yields that if Q0 is correct,

√N (γ − T (Q0, θ0)p0)

d−→ N (0, �γ ), (5)

Page 8: Hypothesis Testing of the Q-matrix - NSF

522 PSYCHOMETRIKA

as N → ∞, whered−→ denotes the convergence in distribution, and �γ = (σi, j )n×n is an n × n

covariance matrix specified as follows. Suppose that the i th element of γ corresponds to the com-bination of items (i1, . . . , ik) and the j th element of γ corresponds to items ( j1, . . . , jl). Denotethe sets {i1, . . . , ik} and { j1, . . . , jl} by Si and S j , respectively. Then the element σi, j in the i th

row and j th column of � is given by σi, j = T∑h∈Si∪S j

eh ,·(Q0, θ0)p0 −{T∑

h∈Si eh ,·(Q0, θ0)p0}

·{T∑

h∈S j eh ,·(Q0, θ0)p0}

. By definition, T∑h∈Si eh ,·(Q0, θ0)p0 is the probability of providing cor-

rect answers to all items in the set Si = {i1, . . . , ik}, T∑h∈S j eh ,·(Q0, θ0)p0 is that for set S j , and

T∑h∈Si∪S j

eh ,·(Q0, θ0)p0 is that for set Si ∪ S j . Thus, the asymptotic distribution is obtained as

N · S(Q0) = N |γ − T (Q0, θ0)p0|2 d−→n∑

l=1

λl Z2l , (6)

where λ1, …, λn are the eigenvalues of �γ and Z1, …, Zn are i.i.d. standard normal randomvariables. The test procedure is constructed based on the above asymptotic distribution and thenull hypothesis is rejected if S(Q0) is large enough. Specifically, one rejects H0 : Q = Q0 ifS(Q0) > vα , where vα is the (1 − α)th quantile of the limiting distribution as given by the righthand side of (6).

3.3. The Test Statistic

In practice, the true parameters (θ0,p0) are unknown. To construct a valid test statistic thatsuccessfully controls the type I error, we propose to replace the unknown parameters with thecorresponding maximum likelihood estimator (MLE) (θ̂ , p̂) derived under the null hypothesisH0 : Q = Q0. Thus, the test statistic S(Q0) becomes

Sθ̂ ,p̂(Q0) =

∣∣∣T (Q0, θ̂)p̂ − γ

∣∣∣2. (7)

Note that the asymptotic distribution of Sθ̂ ,p̂(Q0) is different from S(Q0) in Sect. 3.2 due to

the additional estimation uncertainty from the MLE (θ̂, p̂). We next derive the asymptotic nulldistribution of S

θ̂ ,p̂(Q0) under H0, i.e., when Q0 is the true.

In Eq. (7), the estimators θ̂ and p̂ appear in the construction of the test statistic only throughthe term T (Q0, θ̂)p̂. To establish the asymptotic null distribution of S

θ̂ ,p̂(Q0), we need to have

identifiability of T (Q0, θ̂)p̂ such that it is estimable under Q0. Note that the identifiability ofT (Q0, θ̂)p̂ is a weaker requirement than that of all parameters (θ̂ , p̂). We next discuss the identi-fiability issue in two cases, following the discussions in Gu and Xu (2018). The first case concernsthe Two-Parameter CDMs, for which each item is associated with exactly two item parameters. Bydefinition, the Two-Parameter family includes the DINA and DINO models. The second case isfor theMulti-Parameter CDMs, where each item is allowed to havemultiple item parameters. ThisMulti-Parameter family includes the main-effect CDMs such as the ACDM and the interaction-effect CDMs such as the GDINA (see Example 2). The identifiability results regarding these twofamilies of CDMs are summarized as follows.

Two-ParameterCDMsWefirst introduce some terminologies. Under the Two-Parameter CDMs,two attribute profiles are called Q-equivalent if they have identical ideal responses to all the itemsunder the Q-matrix. For example, under the DINAmodel, if the Q-matrix lacks a single-attributeitemmeasuring the kth attribute, then the two attribute profilesα = 0 andα′ = ek are Q-equivalentsince ξDI N A, j (α, Q) = ξDI N A, j (α

′, Q) for all j . Based on this equivalence relation induced by a

Page 9: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 523

Q-matrix, the set of all attribute profiles, i.e., {0, 1}K , can be partitioned into disjoint equivalenceclasses A1, . . . ,AC , where attribute profiles in the same Ai have identical ideal responses whilethose in the different classes have different ideal responses. We further denote νAi := ∑

α∈Aipα

for each i = 1, . . . ,C , and νQ := (νA1 , . . . , νAC ). The equivalence classes were introduced inZhang et al. (2013) and Gu and Xu (2018) to study the identifiability of the Two-Parameter CDMswhen the Q-matrix is incomplete. Here a Q-matrix is called complete if it can differentiate alllatent attribute profiles, and incomplete otherwise (Chiu et al., 2009). Note that for a completeQ-matrix, each attribute profile represents its own equivalent class and νQ = p; on the otherhand, for an incomplete Q-matrix, νQ �= p. Gu and Xu (2018) showed that attribute profilesin the same equivalence class are not distinguishable under the Two-Parameter model, and theparameters that one is able to identify are θ and νQ .

To study the asymptotic null distribution of Sθ̂ ,p̂(Q0), the next proposition shows that we

only need to focus on estimating the parameters (θ , νQ), instead of (θ ,p) when the Q-matrix isincomplete. The proof of it is given in Section A.3 in the Supplementary Material.

Proposition 1. For any (θ ,p), T (Q0, θ)p depends on p only through νQ. Specifically,T (Q, θ)p = T eq(Q, θ)νQ, where T eq(Q, θ) is a sub-matrix of T (Q, θ) with the form speci-fied in Supplementary Material A.3.

To ensure the identifiability and consistent estimation of (θ, νQ), Gu and Xu (2018) proposedeasily checkable conditions on the form of the Q-matrix, which are restated in Section A.4 of theSupplementary Material. The conditions do not require the completeness of the Q-matrix and canbe applied to various practical designs. For instance, the following Q-matrices are incompleteunder Two-Parameter CDMs, but they give identifiable (θ, νQ):

Q1 =

⎜⎜⎜⎜⎜⎜⎝

1 1 11 0 10 1 11 1 01 0 10 1 1

⎟⎟⎟⎟⎟⎟⎠

, Q2 =

⎜⎜⎜⎜⎜⎜⎝

1 1 01 0 10 1 11 1 01 0 10 1 1

⎟⎟⎟⎟⎟⎟⎠

, Q3 =

⎜⎜⎜⎜⎜⎜⎝

1 0 01 0 10 1 11 1 01 0 10 1 1

⎟⎟⎟⎟⎟⎟⎠

.

Multi-parameter CDMs As for the main-effect models such as the ACDM or interaction-effectmodels such as the GDINA, Gu and Xu (2018) also proposed mild conditions on the form ofthe Q-matrix that lead to generic identifiability, which ensures that the model parameters (θ ,p)

are identifiable almost everywhere in the restricted parameter space except a Lebesgue measurezero set. In particular, from their Theorem 7, if a Q matrix takes the following form up to rowpermutations

Q =⎛

⎝Q1Q2Q�

⎠ ; Qi =

⎜⎜⎜⎝

1 ∗ . . . ∗∗ 1 . . . ∗...

.... . .

...

∗ ∗ . . . 1

⎟⎟⎟⎠

K×K

, i = 1, 2, (8)

where each “∗” can either be 1 or 0 and each attribute is required by at least one item in Q�,then the main-effect and interaction-effect CDMs are generically identifiable and with the trueparameters ranging almost everywhere in the parameter space, the MLE (θ̂, p̂) are statisticallyconsistent.

Based on the above discussions, we assume the following regularity condition:

Page 10: Hypothesis Testing of the Q-matrix - NSF

524 PSYCHOMETRIKA

C1. Under the null hypothesis that Q0 is the true Q-matrix, the corresponding MLE (θ̂ , ν̂)

(under Two-Parameter CDMs) or (θ̂, p̂) (under Multi-Parameter CDMs) is consistentand asymptotically normal with covariance matrix being the inverse of the Fisher Infor-mation matrix evaluated at the true parameters (θ0, ν0) or (θ0,p0).

The consistency of the MLE in C1 is satisfied if the Q0 satisfies the above identifiabilityconditions from Gu and Xu (2018). For instance, the Q-matrix of the widely analyzed fractionsubtraction data, which are presented in Table 7 in Sect. 5, satisfies the identifiability conditions inGu and Xu (2018), under the DINA, DINO, main-effect (such as ACDM), and interaction-effectmodels (such as GDINA). In addition, if the Fisher information matrix evaluated at the true modelparameters is non-singular, the asymptotic normality of the MLE would also hold (e.g., Van derVaart, 2000). Such an assumption is usually required explicitly or implicitly in practice whenmaking inference of the estimated model parameters.

Under the null hypothesis, the asymptotic distribution of Sθ̂ ,p̂(Q0) is given in the following

theorem.

Theorem 1. Under the null hypothesis H0 : Q = Q0 and Condition C1, as N → ∞, the test

statistic in (7) has the limiting distribution N Sθ̂ ,p̂(Q0)

d−→ ∑nl=1 λl Z2

l , where n is the number ofrows of the T -matrix, Z1, . . . , Zn are i.i.d. standard normal random variables, and λ1 ≥ · · · ≥ λnare the eigenvalues of an n × n matrix � taking the form of � = A �β A�, where the matrix

�β is the covariance of β with β = (I (R = r), r ∈ {0, 1}J\0)�

, and the matrix A is given inequation (A.5) in the Supplementary Material.

For the test statistic in (7), the computation of (θ̂, p̂) (or (θ̂ , ν̂)) can be done efficiently viathe EM algorithm (Dempster et al., 1977). The covariance matrix � depends on the unknownparameters (θ0,p0) (or (θ0, ν0)). In the applications, (θ0,p0) (or (θ0, ν0)) is replaced by theMLE based on Q0. For a specific CDM, A and �β depend on the probabilities θ j,α = P(R j =1|Q,α, θ). Section A.5 of the Supplementary Material gives a detailed derivation of these neededquantities for the GDINA model. The proposed test procedure is summarized as follows.

Test Procedure We perform the following steps:

1. ComputeMLE (θ̂, p̂) under the null hypothesis H0 : Q = Q0. Calculate the test statisticSθ̂ ,p̂(Q0) = |T (Q0, θ̂)p̂ − γ |2.

2. Compute vα that is the (1 − α) quantile of the asymptotic distribution in Theorem 1.Reject the null hypothesis H0 : Q = Q0 if Sθ̂ ,p̂(Q0) > vα; otherwise accept H0.

Remark 2. The test procedure can also be applied to the case when the distribution of α isunbalanced (the attributes are highly correlated) or even someα categories are emptywith pα = 0.When some categories of α’s are empty, unconstrained estimators would fulfill assumption C1and lead to valid inference based on our theorem. Without knowing the distribution of α, theasymptotic distribution in Theorem 1 incorporates the estimation error of p̂α into the final limitingdistribution. In particular, note that λ’s in (6) are eigenvalues of �γ while those in Theorem 1 areeigenvalues of a more complicated matrix �, and the difference between the λ’s in (6) and thosein Theorem 1 is due to the estimation uncertainty from (θ̂ , p̂) (or (θ̂ , ν̂)).

More generally, one can introduce a nonnegative definite weight matrix W and useSθ̂ ,p̂,W (Q0) = (T (Q0, θ̂)p̂ − γ )� W (T (Q0, θ̂)p̂ − γ ) as the test statistic. When W is the

identity matrix, Sθ̂ ,p̂,W (Q0) becomes (7). The following corollary gives the asymptotic distribu-

tion of Sθ̂ ,p̂,W (Q0) under the null hypothesis.

Page 11: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 525

Corollary 1. Under the null hypothesis H0 : Q = Q0 and Condition C1, as N → ∞,

N Sθ̂ ,p̂,W (Q0)

d−→ ∑nl=1 λl Z2

l , where Z1, . . . , Zn are i.i.d. standard normal random vari-ables, and λ1 ≥ · · · ≥ λn are the eigenvalues of an n × n matrix �W taking the form of�W = W−1/2A �β A�W−1/2.

Remark 3. This paper focuses on the Euclidean distance of the residuals (Theorem 1), and itcan more directly provide the local misfit information when the null hypothesis gets rejected. Inparticular, when the provisional Q-matrix is rejected (i.e., Q0 is not consistent with the observedresponses), wemay use the testing statistic as an objective function to estimate themisspecificationof the Q-matrix, which corresponds to the Q-matrix estimation procedure in Liu et al. (2012,2013). Consistency of such a procedure was studied in Liu et al. (2013). WhenW is taken as the(generalized) inverse of the�matrix in Theorem 1, the test statistics become similar to the limitedinformation statistics in the literature (Maydeu-Olivares and Joe, 2005; Cai et al., 2006). It is ofinterest to study which W gives higher testing power when the null hypothesis Q0 is incorrect;however, the choice of the optimal W depends on the underlying true Q-matrix and the specificdiagnosis model assumption. Therefore, similarly to many composite hypothesis testing problemsin statistics, the uniformly most powerful test might be difficult to obtain (Van der Vaart, 2000;(Van der Vaart, 2006).

4. Simulations

The performance of the test procedure is illustrated through three simulation studies.

Study 1: DINA, ACDM, and GDINA under balanced attributesWe first consider the Q-matrices Q11, Q21 and Q31 in (9) with J = 20 items and K = 3, 4, 5

attributes. The attributes are generated from a discrete uniform distribution pα = 2−K , for anyα ∈ {0, 1}K . Sample sizes of N = 500, 1000 and 2000 are considered in the simulation.

Q11 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 00 1 00 0 11 1 01 0 10 1 11 0 00 1 00 0 11 1 11 0 00 1 00 0 11 0 00 1 00 0 11 1 00 1 11 0 11 1 1

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, Q21 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 00 1 0 00 0 1 00 0 0 11 1 0 01 0 1 01 0 0 10 1 1 00 1 0 10 0 1 11 0 0 00 1 0 00 0 1 00 0 0 11 1 1 01 1 0 11 0 1 10 1 1 11 1 1 11 1 1 1

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, Q31 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 11 1 0 0 01 0 1 0 00 1 0 1 00 0 1 0 10 0 0 1 11 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 10 1 1 0 01 0 0 1 01 0 0 0 10 1 0 0 10 0 1 1 0

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

. (9)

Page 12: Hypothesis Testing of the Q-matrix - NSF

526 PSYCHOMETRIKA

Table 1.Type I error studies: proportions of rejections for testing true Q-matrices.

N Q11 (K = 3) Q21 (K = 4) Q31 (K = 5)

DINA ACDM GDINA DINA ACDM GDINA DINA ACDM GDINA

500 0.072 0.042 0.054 0.036 0.054 0.054 0.062 0.048 0.0441000 0.058 0.040 0.056 0.056 0.058 0.062 0.046 0.060 0.0602000 0.054 0.062 0.052 0.032 0.048 0.046 0.062 0.050 0.042

Three CDMs are considered—the DINAmodel, the additive CDM (ACDM) and the GDINAmodel. For the DINA model, the slipping parameters and the guessing parameters are set to bes j = g j = 0.2 for j = 1, · · · , 20. For the ACDM and the saturated GDINA model, data aregenerated following the setting in de la Torre and Chiu (2016). In particular, for the ACDM, thebaseline probability is p0 = 0.2 and the probability for possessing all attributes is p1 = 0.8. Theprobability of correct response increases by (p1 − p0)/K ∗

j for mastering each required attribute,where K ∗

j is the total number of required attribute of item j . For the GDINAmodel, similarly, thebaseline probability is p0 = 0.2 and the probability for possessing all attributes is p1 = 0.8. Themain effects and interaction effects are taken equal. 500 independent data sets are generated undereach of the above settings and we use the proposed testing procedure to test if the true Q-matricesare accepted. The T -matrix includes all of the first two-way combinations of items. The simulationresults given in Table 1 show that the asymptotic distribution approximates the “true” referencedistribution well for these models and the type I errors are close to the prespecified significancylevel 5%.

In addition, we present the Q–Q plots of the p-values against the uniform distribution. It isknown that under the null hypothesis (i.e., the previsional Q-matrix is correct), the p-value fromthe “true” distribution of the test statistic should follow the uniform distribution (e.g., Lehmannand Romano, 2006). The Q–Q plots of the p-values from the simulation study are presented inFig. 2 for N = 500, corresponding to the first row of Table 1. The Q–Q plots show the p-valuesare uniformly distributed and therefore further illustrate the good approximation of the asymptoticdistribution in Theorem 1 to the “true” distribution.

We further check the power of the test procedure. Consider the case when there is onlyone entry misspecified, which is the “most difficult” situation. Data sets are generated from thematrices Q11, Q21, Q31 under the same settings as in Table 1. The proposed method is appliedto test the misspecified Q-matrices Q12, Q22, and Q32, which are constructed corresponding toQ11, Q21 and Q31 by altering one entry, respectively. In particular, for Q12, we alter the fourthitem to be (1 1 0) and all other 19 items have the same row vectors as those in Q11; for Q22, wealter the fifth item to be (1 0 0 0) and all others are the same as in Q21; for Q32, set the sixth itemto be (1 0 0 0 0) and all others are the same as in Q31. They are given as follows:

Q12 =

⎜⎜⎜⎜⎜⎝

1 0 00 1 00 0 11 0 0...

......

⎟⎟⎟⎟⎟⎠

, Q22 =

⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 00 1 0 00 0 1 00 0 0 11 0 0 0...

......

...

⎟⎟⎟⎟⎟⎟⎟⎠

, Q32 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 11 0 0 0 0...

......

......

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

. (10)

Based on 500 independent simulations, proportions of rejections are given in Table 2 fordifferent sample sizes. Generally the power decreases as K increases, but for K as large as 5 and

Page 13: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 527

Figure 2.Q–Q plots of p-values for testing true Q-matrices with N = 500. (a) DINA: Q11, K = 3 (b) DINA: Q21, K = 4(c) DINA: Q31, K = 5 (d) ACDM: Q11, K = 3 (e) ACDM: Q21, K = 4 (f) ACDM: Q31, K = 5 (g) GDINA:Q11, K = 3 (h) GDINA: Q21, K = 4 (i) GDINA: Q31, K = 5.

sample size as small as 500, the testing procedure still has power around 70% for the ACDM andGDINA models, and 80% for the DINA model. When the sample size is as large as 2000, thetesting procedure achieves 100% power for all the models and all the considered Q-matrices.

The Q–Q plots of the p-values against the uniform distribution are presented in Fig. 3 forN = 500, corresponding to the first row of Table 2. It is expected that when the null hypothesisis incorrect (i.e., the previsional Q-matrix is misspecified), the distribution of the p-values wouldbe right-skewed to have high test power. We can see from the Q–Q plots that even for a relativelysmall sample size N = 500, in each of the simulation setting considered, most p-values arevery close to 0 and thus most misspecified Q-matrices are successfully rejected by the proposedprocedure.

Moreover, when K = 5, to further study the impact of maximal number of attributes requiredby the Q-matrix, the last five rows of Q31 are changed to having four attributes. Similar resultsare observed and therefore are not presented here.

We further study the effect of the highest combination order of items of the T -matrix, whichis denoted by D. We consider the DINA model with the first 10 items of Q11, Q21 and Q31 andK = 3, 4, 5 attributes. The new Q-matrices considered here are denoted by Q10, Q20 and Q30.We study the cases when D ranges from 2 to 10.More specifically, D = 2means only single itemsand two-way combinations of items are included in constructing the test statistic and computing

Page 14: Hypothesis Testing of the Q-matrix - NSF

528 PSYCHOMETRIKA

Figure 3.Q–Qplots of p-values for testingmisspecified Q-matrices with N = 500. (a) DINA: Q12, K = 3 (b) DINA: Q22, K = 4(c) DINA: Q32, K = 5 (d) ACDM: Q12, K = 3 (e) ACDM: Q22, K = 4 (f) ACDM: Q32, K = 5 (g) GDINA:Q12, K = 3 (h) GDINA: Q22, K = 4 (i) GDINA: Q32, K = 5.

Table 2.Power studies: proportions of rejections for testing misspecified Q-matrices.

N Q12 (K = 3) Q22 (K = 4) Q32 (K = 5)

DINA ACDM GDINA DINA ACDM GDINA DINA ACDM GDINA

500 0.990 0.964 0.972 0.856 0.774 0.816 0.806 0.682 0.7461000 1.000 1.000 1.000 0.996 0.990 0.992 0.990 0.992 0.9902000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

the asymptotic distribution, and D = 10 means all the first two-way combinations of items areincluded. The significance level α is set to be 0.05, and 500 independent data sets are generatedunder each of the above settings.

Figure 4 (a), (c), and (e) shows the proportions of rejections versus D for testing the trueQ-matrices with different sample sizes. The size of the test is close to the nominal level 0.05 under

Page 15: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 529

Figure 4.Proportions of rejections versus D for testing true and misspecified Q-matrices. (a) Testing true Q10. (b) Misspecifyingone item in Q10. (c) Testing true Q20. (d) Misspecifying one item in Q20. (e) Testing true Q30. (f) Misspecifying oneitem in Q30.

all settings, showing that the asymptotic distribution approximates the reference distribution wellfor all D’s.

Additionally, when the previsional Q0 differs from the true Q-matrix by one entry, as in (10),Fig. 4 (b), (d), and (f) shows that the rejection rates stay quite stable for all considered settingswhen D increases from 2 to 10, indicating the test procedure has stable power performance fordifferent orders of item combinations. Note that the larger the sample size is, the higher the testpower would be. On the other hand, we observe that when the sample size is relatively small(N = 500), the test power seems to have a slight decrease (if any) as D increases. This might bebecause that with N = 500, if a large order such as D = 10 is used, then the number of observedresponses in many of the high-order combination categories will be small or even empty, whichwould lead to a slight loss of power. Therefore in practice, we would recommend to use the

Page 16: Hypothesis Testing of the Q-matrix - NSF

530 PSYCHOMETRIKA

Table 3.Type I error studies: proportions of rejections for testing true Q-matrices with correlated attributes.

ρ = Q11 (K = 3) Q21 (K = 4) Q31 (K = 5)

0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75

DINA N = 500 0.044 0.048 0.068 0.058 0.044 0.056 0.054 0.060 0.072N = 1000 0.042 0.044 0.060 0.050 0.058 0.048 0.054 0.050 0.060N = 2000 0.054 0.054 0.054 0.064 0.038 0.042 0.038 0.036 0.052

ACDM N = 500 0.078 0.040 0.062 0.058 0.052 0.050 0.048 0.064 0.042N = 1000 0.066 0.052 0.050 0.048 0.050 0.036 0.062 0.066 0.038N = 2000 0.056 0.048 0.048 0.074 0.048 0.036 0.064 0.060 0.034

GDINA N = 500 0.050 0.050 0.048 0.062 0.038 0.050 0.046 0.042 0.050N = 1000 0.058 0.044 0.038 0.048 0.034 0.046 0.034 0.038 0.036N = 2000 0.046 0.044 0.044 0.056 0.050 0.034 0.058 0.060 0.046

first two-way combinations due to its computational advantage compared with a higher-ordercombination.

Study 2: DINA, ACDM, and GDINA under correlated attributesTo examine the performance under the settingwith non-uniform distribution of α, a multivari-

ate probit model is used to simulate α’s. In particular, for each subject, let φ = (φ1, . . . , φK ) bethe underlying ability following a multivariate normal distribution N (0, �), where the covariancematrix � has unit variance and common correlation ρ taking values of 0.25, 0.5 and 0.75. Thenthe attribute profile α = (α1, . . . , αK ) is determined by

αk ={1 if φk ≥ 00 otherwise.

The other settings are similar to the simulation study 1. 500 independent data sets are generatedunder each of the above settings and the first two-way combinations of the items are used tocompute the test statistics. Table 3 shows that the asymptotic distribution controls the type I errorat the prespecified 5% significance level. Furthermore, theQ–Q plots of the p-values are presentedin the supplementary material, which are consistent with uniform distribution and further justifythe good performance of the proposed procedure.

We also check the power of the test procedure in the case of correlated attributes. Simi-larly to those under the setting of balanced attributes, the data are generated from Q-matricesQ11, Q21, Q31 with correlation ρ = 0.25, 0.5, 0.75 for the three considered CDMs, and then theproposed method is applied to test the null Q-matrices Q12, Q22, and Q32, respectively. Table 4shows that the power of the test decreases as the number of attributes K increases, and as thecorrelation ρ increases. However, as long as the sample size is large enough, e.g., 2000, thetesting power is above 99% in all the studied settings. The Q–Q plots of the p-values are alsopresented in the supplementary material, which as expected show the right-skewed distributionsof the p-values under these misspecified Q-matrices.

Study 3: DINA, ACDM, andGDINA under Q not containing IK Next we consider hypothesistesting under three Q-matrices that does not contain an identity matrix IK after any row permu-tation. As is shown below, Qin,1 lacks the (0 0 1) entry, Qin,2 lacks two single-attribute entries:(0 1 0) and (0 0 1), and Qin,3 lacks all three single-attribute entries (1 0 0), (0 1 0) and (0 0 1).The model parameters are taken the same as those in study 1.

Page 17: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 531

Table 4.Power studies: proportions of rejections for testing misspecified Q-matrices Q12, Q23, and Q32 with correlated attributes.

ρ = Q12 (K = 3) Q22 (K = 4) Q32 (K = 5)

0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75

DINA N = 500 0.985 0.985 0.825 0.870 0.830 0.535 0.820 0.715 0.440N = 1000 1.000 1.000 0.995 1.000 1.000 0.840 0.995 0.980 0.815N = 2000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

ACDM N = 500 0.955 0.910 0.705 0.780 0.695 0.420 0.765 0.645 0.395N = 1000 1.000 0.995 0.990 0.995 0.965 0.785 0.995 0.965 0.740N = 2000 1.000 1.000 1.000 1.000 1.000 0.990 1.000 1.000 1.000

GDINA N = 500 0.985 0.955 0.745 0.770 0.720 0.435 0.765 0.720 0.420N = 1000 1.000 1.000 0.995 1.000 0.980 0.780 0.990 0.985 0.725N = 2000 1.000 1.000 1.000 1.000 1.000 0.985 1.000 1.000 0.995

Qin,1 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 01 0 01 0 00 1 00 1 00 1 01 1 01 0 10 1 11 1 11 0 01 0 01 0 00 1 00 1 00 1 01 1 01 0 10 1 11 1 1

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, Qin,2 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 01 0 01 0 01 1 01 1 01 0 11 0 10 1 10 1 11 1 11 0 01 0 01 0 01 1 01 1 01 0 11 0 10 1 10 1 11 1 1

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, Qin,3 =

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 01 1 01 1 01 0 11 0 11 0 10 1 10 1 10 1 11 1 11 1 01 1 01 1 01 0 11 0 11 0 10 1 10 1 10 1 11 1 1

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

. (11)

Table 5 records the proportions of rejections out of 500 independent simulations when testingthe true Qin,1, Qin,2 and Qin,3, respectively.Wecan see that the proposed testing procedure is validunder theseQmatrices. TheQ–Qplots of the p-values are presented in the supplementarymaterial,which show the consistency with uniform distribution and further justify the good performanceof the test method.

We also check the power of the test for these Q-matrices. For each of Qin,1, Qin,2 and Qin,3,we change one item of (1 1 0) to (1 0 0) and denote the new matrices by Qin,4, Qin,5 and Qin,6.Data are generated from Qin,1, Qin,2 and Qin,3 and tested under null matrices Qin,4, Qin,5 andQin,6, respectively. Table 6 records the proportions of rejections. One can observe that the powerdecreases as the Q-matrix lacks more types of single-attribute items. Another observation is thatthe test for the DINA model has the highest power, while GDINA model comes second, and

Page 18: Hypothesis Testing of the Q-matrix - NSF

532 PSYCHOMETRIKA

Table 5.Type I error studies: proportions of rejections for testing true Q-matrices.

N Qin,1 Qin,2 Qin,3

DINA ACDM GDINA DINA ACDM GDINA DINA ACDM GDINA

500 0.054 0.054 0.042 0.052 0.058 0.042 0.040 0.048 0.0461000 0.066 0.046 0.062 0.062 0.056 0.058 0.060 0.038 0.0342000 0.060 0.052 0.070 0.048 0.058 0.050 0.040 0.062 0.048

Table 6.Power studies: proportions of rejections for testing misspecified Q-matrices.

N Qin,4 Qin,5 Qin,6

DINA ACDM GDINA DINA ACDM GDINA DINA ACDM GDINA

500 1.000 0.995 1.000 0.975 0.565 0.670 0.925 0.275 0.3951000 1.000 1.000 1.000 1.000 0.930 0.960 1.000 0.710 0.7752000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.965 0.995

ACDM model the lowest. But in general, when the sample size is as large as 2000, the testingprocedure has power above 95% for all the models and all the Q-matrices considered. The Q–Qplots of the p-values are also presented in the supplementary material, which show the expectednon-uniform distribution under the misspecified Q-matrices.

5. Real Data Analysis

This section uses the widely analyzed fraction subtraction data as an example. The data setcontains 536 middle school students’ responses to 20 fraction subtraction items. The originalitems, attributes, and response data were conceived and collected by K. Tatsuoka and the originalQ-matrix was specified in Tatsuoka (1990). Table 7 presents the Q-matrix used in de la Torreand Douglas (2004) that contains eight attributes (K = 8): (α1) convert a whole number to afraction; (α2) separate a whole number from a fraction; (α3) simplify before subtracting; (α4) finda common denominator; (α5) borrow from whole number part; (α6) column borrow to subtractthe second numerator from the first; (α7) subtract numerators; (α8) reduce answers to simplestform. Based on the proposed test statistic, the p-values of the Q-matrix under the DINA modeland under the GDINA model are both less than 0.05, suggesting that the Q-matrix does not fitthe data well.

Many previous works have analyzed the fraction subtraction data and the Q-matrix as spec-ified in Table 7, including Tatsuoka (2002), Tatsuoka (2005), DeCarlo (2011), and Chiu (2013).Specifically, Tatsuoka (2002) and Tatsuoka (2005) used multiple strategies, instead of single strat-egy, tomodel certain items. A strategy there refers to a specific set of attributes required to solve anitem. In particular, the author found that for items 5 and 18 with the fraction subtraction data, twostrategies were identified to solve them, one including attributes 1, 2, 7, and the other includingattributes 2, 3, 7. Better estimates of the item parameters were observed when multiple strate-gies are incorporated in the model. This is in line with our hypothesis testing result that whenfitting to the data, the single strategy Q-matrix is rejected. DeCarlo (2011) used the posterior

Page 19: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 533

Table 7.The Q-matrix Q20×8 specified in de la Torre and Douglas (2004).

Item ID Content α1 α2 α3 α4 α5 α6 α7 α8

1 53 − 3

4 0 0 0 1 0 1 1 0

2 34 − 3

8 0 0 0 1 0 0 1 0

3 56 − 1

9 0 0 0 1 0 0 1 0

4 312 − 2 32 0 1 1 0 1 0 1 0

5 435 − 3 410 0 1 0 1 0 0 1 1

6 67 − 4

7 0 0 0 0 0 0 1 0

7 3 − 2 15 1 1 0 0 0 0 1 0

8 23 − 2

3 0 0 0 0 0 0 1 0

9 378 − 2 0 1 0 0 0 0 0 0

10 4 412 − 2 7

12 0 1 0 0 1 0 1 1

11 413 − 2 43 0 1 0 0 1 0 1 0

12 118 − 1

8 0 0 0 0 0 0 1 1

13 338 − 2 56 0 1 0 1 1 0 1 0

14 345 − 325 0 1 0 0 0 0 1 0

15 2 − 13 1 0 0 0 0 0 1 0

16 457 − 147 0 1 0 0 0 0 1 0

17 735 − 45 0 1 0 0 1 0 1 0

18 4 110 − 2 8

10 0 1 0 0 1 1 1 0

19 4 − 143 1 1 1 0 1 0 1 0

20 413 − 153 0 1 1 0 1 0 1 0

mode estimation to perform classification using this Q-matrix under the DINAmodel. The authorfound that the particular specification of the Q-matrix, namely the incompleteness of it under theDINA model leads to problems with classification. Chiu (2013) proposed a Q-matrix refinementapproach based on a nonparametric classification method of the examinees. As recognized by theauthor, the validity of the refinement method relies on the correct classification of the examinesinto the classes they belong to. The author applied their method multiple times to get two differentrefinement results under the DINA model, and both results indicate there exists misspecificationof some entries in the original Q-matrix, which match our result of the rejection of the Q-matrixhere under the DINA model. The proposed procedure takes into account the incompleteness ofQ-matrix in the parameter estimation stage, and consistent MLE are obtained to construct the teststatistic, so it is robust to the incompleteness of the Q-matrix.

We further perform a simulation study to evaluate the type I error of testing this relativelylarge Q-matrix Q20×8 shown in Table 7. Most analyses about this dataset and this Q-matrix inthe literature were done based on the DINA model, such as those in DeCarlo (2011) and Chiu(2013), so we conduct simulations also based on the DINA model for this Q-matrix. We simulatedata from DINA model with the slipping and guessing parameters 0.2 and generate attribute

Page 20: Hypothesis Testing of the Q-matrix - NSF

534 PSYCHOMETRIKA

Table 8.Type I error studies: proportions of rejections for testing Q20×8, DINA model.

N Uniform |ρ| ≤ 0.25 |ρ| ≤ 0.5 |ρ| ≤ 0.75

500 0.004 0.006 0.022 0.0301000 0.032 0.020 0.036 0.0262000 0.046 0.044 0.030 0.042

N |supp(p)| ≤ 156 |supp(p)| ≤ 56 |supp(p)| ≤ 20 |supp(p)| ≤ 10

500 0.006 0.008 0.024 0.0281000 0.028 0.032 0.028 0.0222000 0.036 0.052 0.034 0.034

profiles following several different scenarios: (1) “Uniform”, pα = 1/2K for all α ∈ {0, 1}K ; (2)|ρ| ≤ 0.25, (3) |ρ| ≤ 0.5 and (4) |ρ| ≤ 0.75,whereρ is the parameter used to generate the attributeprofiles as in simulation study 2. Specifically, in each simulation run under settings (2), (3), or (4),we first randomly generate a unit variance–covariance matrix � with all off-diagonal elements’absolute values less than the threshold 0.25, 0.5, or 0.75, then use N (0, �) to generate subjects’attribute profiles as in simulation study 2. The proportion of rejections out of 500 simulation runsper scenario is shown in the upper panel of Table 8. We can see that even given the relatively largenumber of attributes K = 8, the proposed approach still performs quite well in that Type I errorsare well controlled under different scenarios.

Considering the large number of attribute patterns 2K = 256 under Q20×8, to further lookinto how the number of attribute patterns existing in the sample impacts the performance ofthe proposed procedure, we specifically design a simulation controlling the support size of theproportion parameters p = (pα,α ∈ {0, 1}K ). Denote the number of nonzero elements in thevector p, namely the support size of p, by |supp(p)|. Under the scenario |supp(p)| ≤ B, in eachsimulation run, we randomly set 2K − B of the elements of p to zero and let each of the remainingattribute pattern having proportion 1/B in generating the data. This would guarantee there areat most B attribute patterns in the sample. The bottom panel of Table 8 presents the simulationresults when |supp(p)| is set to less than four different bounds, B = 156, 56, 20, and even 10.We observe that the proposed testing procedure has reasonably good performance even when thenumber of attribute patterns in the sample is much smaller than 2K .

In summary, the simulation results with Q20×8 justify the rejection of it on the fractionsubtraction dataset. Table 8 shows that even given quite small number of attribute patterns in thesample, the testing procedure still enjoys very low type I error rates.

Therefore, the rejection of the original Q-matrix under the DINA model with the fractionsubtraction data is less likely due to the incompleteness of the Q-matrix or the relative largeK = 8, than due to the misfit of the Q-matrix.

To further study the power performance, we generate data from the original Q-matrix inTable 7 and test a misspecified Q-matrix by altering some entries of several randomly chosenitems. In this “Misspecify” scenario, for sample sizes N = 500, 1000 and 2000, when the numberof misspecified items is one, the values of testing power in our simulations are 0.430, 0.684 and0.856; while when the number of misspecified items is set to be two, we observe power values0.456, 0.810, 0.936. These are reasonably high powers given the total number of latent attributes28 = 256.

In addition to the 20 × 8 Q-matrix, we also test the 11 × 4 Q-matrix that was studied in dela Torre and Chiu (2016) under the GDINA model. The p-value corresponding to this Q-matrix

Page 21: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 535

is 0.15 under the DINA model and 0.89 under the GDINA model. This suggests the Q-matrixfits the data well under both the DINA and the GDINA models. Please see Section B of theSupplementary Material for more details.

6. Discussion

Amisspecified Q-matrixmay lead tomodel lack of fit and erroneous parameter estimation. Toaddress this concern, this paper proposes a hypothesis testing procedure to validate a prespecifiedQ-matrix. Asymptotic distribution of the test statistic is derived, and results from the simulationstudy confirm the applicability of the asymptotic approximation in various settings. Furthermore,the proposed approach provides a rigorous way to control the statistical testing errors and it can befurther used to calculate the required sample size when designing tests. The technical assumptionC1 is often satisfied in practice. Specifically, Gu and Xu (2018) established sufficient conditionsfor strict and generic identifiability of model parameters for the DINA, DINO, main-effect modelssuch as the ACDM and all-effect models such as the GDINA.

When the null hypothesis is rejected, i.e., the prespecified Q-matrix is not consistent withthe observed responses, we can further use diagnostic tools or estimation procedures to obtain abetter fitted Q-matrix. For instance, using the same objective function, Liu et al. (2012) proposean estimation procedure by minimizing S over different Q-matrices. A more efficient method hasrecently been proposed in Chen et al. (2015) and Xu and Shang (2018).

The proposed method gives a global measure of fit for a given Q-matrix, but it has thelimitation that it cannot indicate if the problem is due to the Q-matrix, the assumed form of theitem response functions, or the violation of the CDM assumptions, such as the conjunctive andcompensatory assumptions or the local independence assumption.

As suggested by one reviewer, more local information about the items and the causes ofmisfit could be obtained with standardized residuals, i.e., the marginal residuals divided by thecorresponding standard errors, which can be used as a second-stage diagnosis tool and would helpto further distinguish the misspecificatons of the Q-matrix and the item response functions. Inaddition, Tatsuoka (2002) discussed “global” versus “local” diagnostics and found that throughanalyzing patterns of posterior probabilities of latent class membership and item parameter esti-mates in CDMs, guidance for specific ways to modify the Q-matrix can be obtained.

In this paper we study the random effect CDMs, the population proportion parameters pα’squantify the population distribution of attribute profiles. In practice, if one has a small sample buta large number of attribute profiles, then some of attribute profiles may only have few or even nonerepresentatives in the data and the corresponding pα’s may not be well estimated due to the smallsample size. In such case,wewould recommend to use a two-stage approach to first select out thoseattribute profiles that have significantly large proportions, which can be thought of as “clinicallysignificant” attribute profiles, then follow the proposed procedure to construct a test statistic thatonly depends on these “significant” attribute profiles. The number of the “significant” attributeprofiles might be much smaller than 2K for relatively small N and large K , and this would benefitthe testing procedure by reducing the parameter estimation uncertainty. It would be interesting tostudy the theoretical property and numerical performance of such a two-stage approach and wewould leave this for future study.

Acknowledgments

The authors thank the Editor, the Associate Editor, and four reviewers for many helpfuland constructive comments. This work is partially supported by National Science Foundation

Page 22: Hypothesis Testing of the Q-matrix - NSF

536 PSYCHOMETRIKA

(Grant No. SES-1659328, DMS-1712717, IIS-1633360, MMS-1826540), Institute of EducationSciences (Grant No. R305D160010), and Army Grant (Grant No. W911NF-15-1-0159).

References

Bartholomew, D. J., & Tzamourani, P. (1999). The goodness of fit of latent trait models in attitude measurement. Socio-logical Methods & Research, 27(4), 525–546.

Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-information goodness-of-fit testing of itemresponse theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59(1),173–194.

Chen, Y., Liu, J., Xu, G., & Ying, Z. (2015). Statistical analysis of Q-matrix based diagnostic classification models.Journal of the American Statistical Association, 110, 850–866.

Chiu, C.-Y. (2013). Statistical refinement of the Q-matrix in cognitive diagnosis. Applied Psychological Measurement,37, 598–618.

Chiu, C., Douglas, J., & Li, X. (2009). Cluster analysis for cognitive diagnosis: Theory and applications. Psychometrika,74(4), 633–665.

de la Torre, J. (2008). An empirically-based method of Q-matrix validation for the DINA model: Development andapplications. Journal of Educational Measurement, 45, 343–362.

de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76(2), 179–199.de la Torre, J., & Chiu, C.-Y. (2016). A general method of empirical Q-matrix validation. Psychometrika, 81(2), 253–273.de la Torre, J., & Douglas, J. (2004). Higher order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353.DeCarlo, L. T. (2011). On the analysis of fraction subtraction data: The DINA model, classification, latent class sizes,

and the Q-matrix. Applied Psychological Measurement, 35, 8–26.DeCarlo, L. T. (2012). Recognizing uncertainty in the Q-matrix via a bayesian extension of the DINA model. Applied

Psychological Measurement, 36(6), 447–468.Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via EM algorithm.

Journal of the Royal Statistical Society Series B-Methodological, 39(1), 1–38.DiBello, L., Stout, W., & Roussos, L. (1995). Unified cognitive psychometric assessment likelihood-based classification

techniques. In P. D. Nichols, S. F. Chipman,&R. L. Brennan (Eds.),Cognitively diagnostic assessment (pp. 361–390).Hillsdale, NJ: Erlbaum.

Gu, Y., & Xu, G. (2018). Partial identifiability of restricted latent class models. arXiv preprint arXiv:1803.04353.Hartz, S. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with

practicality. Doctoral Dissertation, University of Illinois, Urbana-Champaign.Henson, R.,&Templin, J. (2005).Hierarchical log-linearmodeling of the skill joint distribution. Technical report, External

Diagnostic Research Group.Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear

models with latent variables. Psychometrika, 74(2), 191–210.Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric

item response theory. Applied Psychological Measurement, 25, 258–272.Lehmann, E. L., & Romano, J. P. (2006). Testing statistical hypotheses. Berlin: Springer.Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation

on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205–237.Liu, J., Xu,G.,&Ying, Z. (2012).Data-driven learning of Q-matrix.AppliedPsychologicalMeasurement, 36(7), 548–564.Liu, J., Xu, G., & Ying, Z. (2013). Theory of self-learning Q-matrix. Bernoulli, 19(5A), 1790–1817.Maydeu-Olivares, A. (2001). Limited information estimation and testing of thurstonian models for paired comparison

data under multiple judgment sampling. Psychometrika, 66(2), 209–227.Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in 2n con-

tingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009–1020.Roussos, L. A., Templin, J. L., & Henson, R. A. (2007). Skills diagnosis using IRT-based latent class models. Journal of

Educational Measurement, 44, 293–311.Rupp, A. (2002). Feature selection for choosing and assembling measurement models: A building-block-based organiza-

tion. Psychometrika, 2, 311–360.Rupp, A., & Templin, J. (2008a). Effects of q-matrix misspecification on parameter estimates and misclassification rates

in the dina model. Educational and Psychological Measurement, 68, 78–98.Rupp, A., & Templin, J. (2008b). Unique characteristics of diagnostic classification models: A comprehensive review of

the current state-of-the-art. Measurement: Interdisciplinary Research and Perspective, 6, 219–262.Rupp, A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York

City: Guilford Press.Sen, B., Banerjee, M., Woodroofe, M., et al. (2010). Inconsistency of bootstrap: The Grenander estimator. The Annals of

Statistics, 38(4), 1953–1977.Sen, B., & Xu, G. (2015). Model based bootstrap methods for interval censored data. Computational Statistics & Data

Analysis, 81, 121–129.Stout, W. (2007). Skills diagnosis using IRT-based continuous latent trait models. Journal of Educational Measurement,

44, 313–324.Tatsuoka, K. (1985). A probabilistic model for diagnosing misconceptions in the pattern classification approach. Journal

of Educational Statistics, 12, 55–73.

Page 23: Hypothesis Testing of the Q-matrix - NSF

YUQI GU ET AL. 537

Tatsuoka, K. (1990). Toward an integration of item-response theory and cognitive error diagnosis. In N. Frederiksen, R.Glaser, A. Lesgold, & M. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition, (pp. 453–488).

Tatsuoka, C. (2002). Data-analytic methods for latent partially ordered classification models. Applied Statistics (JRSS-C),51, 337–350.

Tatsuoka, C. (2005). Corrigendum: Data analytic methods for latent partially ordered classification models. Journal ofthe Royal Statistical Society: Series C (Applied Statistics), 54(2), 465–467.

Tatsuoka, K. (2009). Cognitive assessment: An introduction to the rule space method. Boca Raton: CRC Press.Templin, J. (2006). CDM: Cognitive diagnosis modeling with Mplus . Available from http://jtemplin.myweb.uga.edu/

cdm/cdm.html.Templin, J., He, X., Roussos, L., & Stout, W. (2003). The pseudo-item method: A simple technique for analysis of

polytomous data with the fusion model. Technical report, External Diagnostic Research Group.Templin, J., & Henson, R. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psycho-

logical Methods, 11, 287–305.Tollenaar, N., & Mooijaart, A. (2003). Type I errors and power of the parametric bootstrap goodness-of-fit test: Full and

limited information. British Journal of Mathematical and Statistical Psychology, 56(2), 271–288.Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge: Cambridge university press.von Davier, M. (2005). A general diagnosis model applied to language testing data. Research report, Educational Testing

Service.von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and

Statistical Psychology, 61, 287–307.Xu, G. (2017). Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45(2),

675–707.Xu, G., & Shang, Z. (2018). Identifying latent structures in restricted latent class models. Journal of the American

Statistical Association. https://doi.org/10.1080/01621459.2017.1340889.Zhang, S. S.,DeCarlo, L. T.,&Ying, Z. (2013).Non-identifiability, equivalence classes, and attribute-specific classification

in Q-matrix based cognitive diagnosis models. ArXiv e-prints.

Manuscript Received: 30 JUL 2015Published Online Date: 11 JUL 2018