Evaluating Small Sample Approaches for Model Test ... · Evaluating Small Sample Approaches for Model Test ... (Yuan & Bentler, 1998, ... 2000; Gerbing & Anderson, 1985; Hu & Bentler,

MULTIVARIATE BEHAVIORAL RESEARCH 439

Multivariate Behavioral Research, 39 (3), 439-478Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Evaluating Small Sample Approaches for Model TestStatistics in Structural Equation Modeling

Jonathan NevittDepartment of Agriculture and Applied EconomicsVirginia Polytechnic Institute and State University

Gregory R. HancockDepartment of Measurement, Statistics and Evaluation

University of Maryland, College Park

Through Monte Carlo simulation, small sample methods for evaluating overall data-model fit instructural equation modeling were explored. Type I error behavior and power were examinedusing maximum likelihood (ML), Satorra-Bentler scaled and adjusted (SB; Satorra & Bentler,1988, 1994), residual-based (Browne, 1984), and asymptotically distribution free (ADF;Browne, 1982, 1984) test statistics. To accommodate small sample sizes the ML and SBstatistics were adjusted using a k-factor correction (Bartlett, 1950); the residual-based and ADFstatistics were corrected using modified �2 and F statistics (Yuan & Bentler, 1998, 1999).Design characteristics include model type and complexity, ratio of sample size to number ofestimated parameters, and distributional form. The k-factor-corrected SB scaled test statisticwas especially stable at small sample sizes with both normal and nonnormal data.Methodologists are encouraged to investigate its behavior under a wider variety of models anddistributional forms.

Structural equation modeling (SEM) has become a versatile and widelyused data analytic method for evaluating causal and predictive hypotheses inthe behavioral sciences. Historically it has been a “large sample” technique,with minimum sample size guidelines ranging from five to ten cases perestimated model parameter (e.g., Bentler & Chou, 1987), depending upon themethod of estimation employed. Unfortunately, practitioners are often unableto obtain sufficient numbers of cases to meet such minimum guidelines, letalone satisfy an estimation method’s distributional assumptions. Thus, therehas been increased demand for methods that perform optimally at smallersample sizes and under varied distributional conditions. The methodologicalcommunity has responded accordingly, with several major SEM software

Correspondence concerning this article should be directed to the first author atDepartment of Agricultural and Applied Economics, Virginia Tech, Blacksburg, VA 24061-0401, or e-mail to [email protected].

J. Nevitt and G. Hancock

440 MULTIVARIATE BEHAVIORAL RESEARCH

programs incorporating techniques that could be more viable under sub-optimal conditions. While practitioners might be eager to employ such“robust” methods, their inclusion in software packages is often intendedinitially to facilitate methodologists’ scrutiny. The purpose of the currentinvestigation is to provide just such scrutiny within the context of a factoriallydesigned Monte Carlo investigation.

Background

For a system of p measured variables [yielding p* = p(p + 1)/2 uniquevariances and covariances], let X

i = (x

i1, ..., x

ip)� for i = 1, ..., n be a sample

from X = (x1, ..., x

p)�, with sample mean vector X , sample covariance matrix

S, population mean vector �, and population covariance matrix �0. Then, acovariance structure model represents the elements of �0 as functions of qfree model parameters in vector �, with null hypothesis H

0: �0 = �(�). An

hypothesized model may be fit to a p × p sample covariance matrix, and forany vector of model parameter estimates ( �̂ ) the hypothesized model can beused to evaluate the model-implied covariance matrix, �( �̂ ) = �̂ . The goalin parameter estimation is to obtain a vector of parameter estimates such thata function of the discrepancy between �̂ and S is minimized. The maximumlikelihood (ML) function is the most commonly employed discrepancyfunction and is defined as:

(1) MLF̂ = ln| �̂ | – ln|S| + tr(S -1�̂ ) – p,

with associated test statistic

(2) TML

= (n – 1) MLF̂ ,

asymptotically distributed as a central �2 with p* – q degrees of freedom (df)under multivariate normality and when H

0 is true (see, e.g., Hayduk, 1987).

With respect to issues of sample size and normality in SEM, empiricalresearch has generated a large body of literature and an understanding of thebehavior of ML estimation (and its associated test statistic) with realistic dataforms (Bentler & Yuan, 1999; Boomsma, 1982, 1985; Curran, West, & Finch,1996; Fouladi, 1998, 1999, 2000; Gerbing & Anderson, 1985; Hu & Bentler,1993, 1995; Hu, Bentler, & Kano, 1992; Tanaka, 1984, 1987; Yuan & Bentler,1997, 1998, 1999). Key findings indicate that at small sample sizes: (a) rates ofnon-convergence and/or improper solutions can be high; (b) parameterestimates exhibit only marginal bias; (c) parameter standard errors becomeattenuated; (d) T

ML, and hence Type I error rates based on T

ML, become



inflated; and (e) nonnormality exacerbates inflation in Type I error rates basedon T

ML. Findings also commonly suggest that sample size adequacy is best

measured by the ratio of subjects-to-estimated parameters (n: q) rather than bysample size in an absolute sense. To combat distortion in the model teststatistic, two fundamentally divergent approaches have emerged and areexamined in this investigation: abandoning ML for distribution-free estimationmethods, and using ML for parameter estimation but then adjusting the teststatistic to account for the effects of sample size and nonnormality. Each ofthese approaches is discussed in turn below. 1

Browne’s (1982, 1984) asymptotically distribution-free (ADF) methodestimates model parameters by minimizing

(3) ADFF̂ = (s – �̂ )� –1�̂ (s – �̂ ),

where s = vech(S) (i.e., a p* × 1 column vector of the unique elements of S),where �̂ = vech( �̂ ), and where

(4) � =Cov{vech[(X – �)(X – �)�]},

is a symmetric p* × p* population fourth-order moment weight matrix.Browne (1982) proposed estimating � using only the sample data [requiringestimation of p*(p* + 1)/2 unique elements in the weight matrix]. Let

(5) Yi = vech[(X

i – X )(X

i – X )�].

Then an estimator for � is

(6) �̂ = SY = Cov(Yi),

the sample covariance matrix of Yi. Under the null hypothesis, the associated

test statistic

(7) TADF

= (n – 1) ADFF̂ ,

asymptotically follows a central �2 distribution with p* – q df. Monte Carloexperiments have demonstrated that at large sample sizes (e.g., n � 5,000)T

ADF yields Type I error rates at the nominal level (Chou, Bentler, & Satorra,

1 Bootstrap resampling is another approach for obtaining robust statistics in SEM. Whilebootstrapping appears to be a viable alternative under nonnormal data conditions, it is notdirectly intended to address the problem of small samples and thus was not pursued in thisinvestigation.



1991; Curran et al., 1996; Hu et al, 1992; Muthén & Kaplan, 1992; Tanaka,1984). However, with large models and small to moderate sample sizesADF is problematic, leading to high rates of non-convergence and impropersolutions, and to inflated Type I error rates associated with inflated T

ADF

values when models do converge.ADF’s ability to yield a correct test statistic under nonnormal conditions at

large sample sizes inspired Yuan and Bentler (1997, 1999) to develop correctionsto T

ADF for small sample sizes. They noted that in the regression literature cross

products of model residuals are often used when estimating asymptoticcovariances, and proposed an estimator for the weight matrix in ADFF̂ as

(8) ( )[ ] ( )( )1

ˆ ˆ ˆ1/ 1 .n

i ii

n=

′= − − − ∑ Y Y� � �

Incorporating this estimator into ADFF̂ is tantamount to rescaling the originalADF test statistic such that

(9) TADF1

= [1/{1 + [TADF

/(n – 1)]}]TADF

,

which follows a central �2 distribution with p* – q df (when H0 is true).

Further motivated to improve small sample performance associated withT

ADF, Yuan and Bentler (1999) proposed another modification to the ADF

test statistic, appealing to Fisher’s F distribution. They offered atransformation of T

ADF based upon the logic of the transformation applied to

Hotelling’s T2 statistic in MANOVA (see Stevens, 1996, p. 155, for a reviewof the T2 statistic). Observing that T2 is a quadratic form, similar in structureto the ADF fit function, they proposed rescaling T

ADF to an F-distributed

statistic,

(10) TADF2

= [n – (p* – q)]/[(n – 1)(p* – q)]TADF

,

with numerator and denominator df of p* – q and n – (p* – q), respectively.Yuan and Bentler (1999) used Monte Carlo simulation to investigate the

small sample performance of TADF,

TADF1,

and TADF2

, systematically varyingdistributional forms and sample sizes. They reported that over the range ofsample sizes investigated T

ADF1 and T

ADF2 maintained adequate control of

Type I error rates as compared to TADF,

and yielded adequate power (forrejecting misspecified models) with both normal and nonnormal data.Inspired by these limited, yet promising, results, the adjusted ADF teststatistics have been incorporated into recent releases of SEM software:T

ADF1 is currently available in EQS 5.7 (Bentler, 1996), and the F-distributed

statistic TADF2

will be incorporated into EQS 6.0.



Another strategy to control for nonnormality and potentially smallsamples is to estimate model parameters using ML and then assess data-model fit using a test statistic that has been corrected. These correctionstake two basic forms: (a) adjusting T

ML for sample size and nonnormality, and

(b) constructing distribution-free test statistics based on ML-estimated modelparameters. Both approaches are reviewed as follows.

The most well known corrections to TML

were developed by Satorra andBentler (1988, 1994), with two modifications to T

ML (generically referred to

as SB statistics here) that make adjustments based on the degree ofnonnormality in the sample data. Define �̂� as the p* × q matrix of partialderivatives of the p* elements in �̂ with respect to the q model parameters(i.e., the Jacobian matrix), evaluated at the final model parameter estimates.Let W be the symmetric p* × p* matrix of unique fourth-order momentsobtained by 1 1ˆ ˆ− −⊗� � , and let

(11) ( )–1ˆ ˆ ˆ ˆˆ ′ ′= − U W W W W� � � �� ,

be the residual weight matrix of those inverted fourth-order moments. Then,

(12) ( ) ( )SB1 MLˆ* /T p q tr T = − YUS .

The asymptotic distribution of TSB1

is generally unknown; however, when H0

is true its first moment matches a central �2 distribution with p* – q df. TheT

SB2 statistic adjusts the model df to yield

(13) ( ) ( )2ˆ ˆ ˆ/d tr tr ′ = Y Y YUS US US ,

for the model df. Then as with TSB1

, TML

is scaled as

(14) ( )SB2 MLˆ/T d tr T ′= YUS .

Like TSB1

, the asymptotic distribution of TSB2

is generally unknown; however,both its first and second moments match that of a central �2 distribution withd� df (when H

0 is true). Simulation research has indicated that both SB

adjustments to TML

can be effective in controlling Type I error rates undersome experimental conditions (Chou et al., 1991; Curran et al., 1996; Fouladi,1998, 1999, 2000; Hu et al., 1992; Nevitt & Hancock, 2001; Yuan & Bentler,1998). The scaled T

SB1 is currently available in EQS 5.7 (Bentler, 1996) and

LISREL 8.3 (Jöreskog & Sörbom, 1996), while variants of TSB1

and TSB2

have been incorporated into Mplus 1.0 (Muthén & Muthén, 1998).



Fouladi (1998, 1999, 2000) proposed another correction to TML

, combininga small-sample correction developed by Bartlett (1950) with the SB correctionsfor nonnormality. Within the context of exploratory factor analysis, Bartlett(1950) offered a k-factor correction to the ML test statistic (where krepresents the number of latent factors in the model), suggesting that at smallsample sizes

(15) T = (n – p/3 – 2k/3 – 11/6) MLF̂ ,

more closely follows a central �2 distribution (with p* – q df) than the usualT

ML statistic. This adjusted statistic is equivalent to applying a multiplicative

correction to TML

(or to any test statistic) of the form

(16) c = 1 – [(2p + 4k + 5)/6(n – 1)].

Fouladi (1998, 2000) applied this correction factor to TML

to improve its smallsample performance with normal data. She also investigated the k-factorcorrection applied to T

SB1 and T

SB2 with nonnormal data and reported that this

scaling correction can be effective in controlling Type I error rates undersome experimental conditions (Fouladi, 1999). Interestingly, for all threestudies she only examined the measured variable path analysis case whichhas no latent factors (i.e., k = 0).

While ADF estimation and TML

scaling adjustments appear to providesome protection against distortion in data-model fit statistics when modelingnonnormal data at small sample sizes, both approaches demonstrate eitherpractical or theoretical limitations. For example, ADF estimation can be carriedout only when sample sizes are above an absolute lower bound of n = p*(see, e.g., Bentler & Yuan, 1999). With respect to T

ML scaling, the SB

adjustments to TML

have been criticized for lacking known asymptotic resultswith nonnormal data; the k-factor correction as proposed by Fouladi (1998,1999, 2000) is mostly uninvestigated.

To address the potential shortcomings of ADF estimation and TML

scaling adjustments, Yuan and Bentler (1998) turned to a mostly ignoredresidual-based test statistic offered by Browne (1984), noting that thereexists a p* × (p* – q) matrix ˆ

c�� whose columns are orthogonal to �̂� (and toeach other). Then the residual-based test statistic (asymptotically distributed asa central �2 with p* – q df when H

0 is true) proposed by Browne (1984) is

(17) 1B Y

ˆ ˆ ˆ ˆˆ ˆ( 1) ( ) ,T n −′ ′ ′ ′= − c c c ce S e� � � ��



where ˆ ˆ= −e s � is a p* × 1 column vector of residual variances andcovariances. Yuan and Bentler (1998) contended that T

B could be applied to

any consistent estimator for �, even when data are not normally distributed,and argued that ML is a prudent choice. The T

B test statistic is available to

practitioners within LISREL 8.3 (Jöreskog & Sörbom, 1996).The advantage to T

B is that it is an asymptotically distribution-free test

statistic with theoretical elegance not found with TSB1

, TSB2

, or the k-factorcorrected test statistics in that it follows a known sampling distributionwithout assuming multivariate normality of the data. Moreover, Bentler andYuan (1999) indicated that T

B has smaller minimum sample size requirements

than TADF

with a lower bound of n = df + 1. Unfortunately, these authorsfound that T

B has a propensity to over-reject true models at n < 5000, thus

showing Type I error performance remarkably similar to that of theunadjusted T

ADF test statistic.

The poor performance of TB at small samples led Yuan and Bentler

(1998) to propose corrections to the test statistic that are analogous to thesmall sample corrections to adjust T

ADF. First, they proposed replacing the

fourth-order moment matrix in TB (i.e., SY) with the improved weight matrix

given in Equation 8. Incorporating this estimator into TB is equivalent to

rescaling it such that

(18) TB1

= {1/[1 + (nTB)/(n – 1)2]}T

B ,

with TB1

following a central �2 distribution with p* – q df when H0 is true.

Yuan and Bentler (1998) also noted that the equation for TB takes on a

quadratic form also reminiscent of Hotelling’s T2 statistic. They proposedrescaling T

B to an F-distributed statistic,

(19) TB2

= {[n – (p* – q)]/[(n – 1)(p* – q)]}TB

,

with numerator and denominator df of p* – q and n – (p* – q), respectively.The residual-based test statistics will be incorporated into EQS 6.0.

Yuan and Bentler (1998) examined the residual-based statistics in a smallMonte Carlo experiment, varying combinations of distributional form andsample size. T

B showed a strong tendency to over-reject true models at all

but the largest sample size with the F-distributed TB2

demonstrating the bestoverall performance. The strong performance of T

B2 evidenced by Yuan and

Bentler (1998) at the smallest sample size investigated (n = 150, n:q = 4.5:1)led Bentler and Yuan (1999) to conduct an experiment designed to push thetest statistics to their sample size lower bounds. Some important findingsfrom their study are: (a) consistent with previous investigations, T

ML yielded



inflated Type I error rates with normally distributed data at reduced samplesizes (even at their largest sample size of n = 120 yielding n:q = 3.6:1); (b) T

B

and TB1

performed extremely poorly — under every condition TB yielded

100% model rejections while TB1

yielded 0% model rejections; (c) TSB1

wasnot competitive under the study conditions, yielding inflated Type I error ratesranging from 10-37%; and, (d) T

B2 performed quite well over the range of

experimental conditions. Based on these findings, Bentler and Yuanrecommended that practitioners turn to the T

B2 F-distributed statistic

regardless of distributional form when sample sizes are at least n � df + 1.

The Current Study

While the studies reviewed above yielded promising results, there isclearly the need to characterize more fully the nature of small sample teststatistics in SEM. First, no studies have investigated the k-factor correctionwith latent variable systems (for which k > 0), and thus it is unknown if theimprovement afforded to T

ML for normal data and the improvements to T

SB1

and TSB2

with nonnormal data using the k-factor multiplier will hold undersuch models. If an improvement with latent variable models is realized usingthis correction, then determining a lower bound on sample size requirementsfor the effective use of the k-factor correction could yield improvedrecommendations to practitioners. Second, much remains to be learnedregarding the behavior of the ADF and residual-based adjusted test statistics,T

ADF1, T

ADF2, and T

B1, T

B2. Bentler and Yuan (1999) noted that much of the

characterization of these test statistics focuses only on a specialized model (athree-factor confirmatory factor model) under a very narrow range ofconditions. They called for continued research, investigating these teststatistics under a greater variety of conditions, including variations in thenumber of variables in the system and the type of model examined. Third,for sample sizes in which ADF estimation is possible, no research exists thatdirectly compares the relative performance of T

ADF1 and T

ADF2 against their

respective residual-based analogs, TB1

and TB2

, leaving practitioners with noclear choice among the ADF-based and residual-based ML test statistics.This current investigation seeks to address these important issues,contributing to our understanding of small sample test statistics in SEM.



Method

Test Statistics Examined

This study examined test statistics for evaluating global fit in SEM that,based on previous investigations, have shown promise for practitioner useunder commonly encountered data conditions. The following statistics wereinvestigated: T

SB1, T

SB2, T

ML–k, T

SB1–k, T

SB2–k (a k suffix denotes the k-factor

correction was applied), TADF1

, TADF2

, TB1

, and TB2

. The uncorrected TML

,T

ADF, and T

B test statistics were also analyzed under all conditions (given

sample size constraints), providing benchmarks against which to judge theiradjusted forms.

Population Models

Two population models were developed from which simulated samples ofdata were drawn. Population Model A (Figure 1) is a five-factor latentvariable path (LVP) model with three indicators per factor; Population Model

Figure 1Latent Variable Path (LVP) Population Model A



B (Figure 2) is a seven-factor confirmatory factor analysis (CFA) model,also with three indicators per factor. We chose models that would replicateprevious methodological findings as well as extend knowledge to differentmodel types.

Model B includes features that have not been examined in previousmethodological investigations — a substantially larger model requiring largenumbers of parameters to be estimated, and a model with correlated errorterms, two aspects that are commonly found in applied modeling scenarios.We established non-zero error covariances among all residuals within threegroups: among all factors’ first indicators, among all factors’ secondindicators, among all factors’ third indicators. This yielded 63 non-zero errorcovariances, each set to a population value of 0.1. Three additional non-zeroerror covariances, also set to 0.1, were included between residuals �

1 and �

2,

between �10

and �11

, and between �19

and �20

. Thus, a total of 66 non-zeroerror covariances were established in Population Model B. This populationmodel reflects applied scenarios such as when constructs have indicators thatare parallel questionnaire items differing only in content, such as thoseassessing self concept in multiple subject matter domains (e.g., mathematics,science, and reading).

Figure 2Confirmatory Factor Analysis (CFA) Population Model B



Model Specifications for Type I Error Analyses

When fitting simulated samples of data drawn from each population,model specifications (all properly specified) were implemented that examinedboth constrained and unconstrained models. Again, we incorporated designelements that would replicate previous simulation studies, and examinepreviously unanalyzed model specifications found in applied modelingscenarios. Complete details for model specifications are given in AppendixA (using LISREL notation). Specifications A1 and B1 were unconstrainedversions of Models A and B, respectively, which required larger numbers ofparameters to be estimated. Specifications A2 and B2 had loading and errorvariance constraints imposed on the measurement portion of Models A andB, respectively, and thus required fewer parameters to be estimated (seeAppendix A). Such constraints are similar to those found in longitudinalmodels in which the same constructs are measured at multiple time points.When the equality forced by these constraints holds true in the populations,as is the case in the current simulations, then these constrained models areproperly specified as well (see, e.g., Yuan & Bentler, 1998). Table 1summarizes, for each model specification, p, q, and df, and sample sizeinformation (discussed in the next section).

Table 1Simulation Conditions for Properly Specified Models

Sample Size Condition

Model p p* q df 1:1 2:1 5:1 10:1 df + 1 p*

A1 15 120 35 85 35 70 175b 350b 86a 120b

A2 15 120 18 102 18 36 90 180b 103a 120b

B1 21 231 129 102 129a 258b 645b 1290b 103a 231b

B2 21 231 35 196 35 70 175 350b 197a 231b

Note. Sample size conditions reflect subject-to-estimated parameter ratios, with the exceptionof the last two columns which are the sample size minima for the residual-based and ADF teststatistics, respectively.a indicates sample size sufficiently large to compute residual-based test statistics. b indicatessample size sufficiently large to compute residual-based and ADF test statistics.



Sample Size Conditions

For all four models, ratios of n:q = 1:1, 2:1, 5:1, and 10:1 were examined.In addition each model was also inspected under a sample size condition atwhich the residual-based and ADF test statistics are at their respectiveminima. Thus, each model was examined under a total of six sample sizeconditions. Test statistics based on T

ML are not limited to any sample size

constraints per se and were examined across all six conditions. The ADFand residual-based test statistics were only examined under sample sizeconditions that were sufficiently large to support these methods. Table 1presents each combination of model and sample size condition and theassociated absolute sample size for that cell in the design of the experiment.

Data Generation and Modeling; Computation and Verification of TestStatistics

Three multivariate distributions were examined, each established throughthe manipulation of univariate skew and kurtosis of the measured variables.All manifest variables were drawn from the same univariate distribution foreach data condition. Distribution 1 is multivariate normal with univariateskew and kurtosis both set equal to 0. (Note that normality is defined here,as is commonly done, by using a shifted kurtosis value of 0 rather than avalue of 3.) Distribution 2 represents an elliptical distribution — data arenonnormal but symmetric with univariate skew of 0 and kurtosis of 6.0.Distribution 3 is nonnormal and asymmetric with univariate skew of 3.0 andkurtosis of 21.0. Simulated data matrices were generated in GAUSS(Aptech Systems, 1996) using the programming described by Nevitt andHancock (1999) that follows the algorithm developed by Vale and Maurelli(1983).

All modeling of simulated data was performed using EQS 5.7b (Bentler,1996). The T

ML and T

ADF test statistics were captured directly from EQS, as

were the Jacobian matrix of partial derivatives and the residual covariancematrix. All other test statistics were calculated using GAUSS. Theduplication matrix requisite for calculating the SB test statistics wasconstructed as described by Magnus and Neudecker (1986, pp. 171-176); thealgorithm for obtaining the Jacobian complement requisite for the residual-based test statistics followed Gill, Murray, and Wright (1981, pp. 37-40, 162-163). The fourth-order moment weight matrix, SY, requisite for both the SBand the residual-based test statistics, was constructed from the unbiasedsample variances and covariances of the Y

i data (i.e., an n – 1 divisor was

used when computing the variances and covariances).



Because most of the test statistics in the present investigation wereconstructed outside of EQS, careful steps were taken to verify the accuracyof intermediate matrices and test statistics. The (biased) weight matrix S

Y

(before computing the unbiased SY with n – 1 as divisor)

, and the T

SB1 and

TADF1

test statistics were verified using EQS 5.7b (Bentler, 1996). Accuracyof the residual-based test statistic T

B was verified using LISREL 8.3

(Jöreskog & Sörbom, 1996); TADF2

statistics were matched with output froma pre-release of EQS 6.0.

Design and Execution

For the investigation of Type I error rates, the study fully crossed fourmodel specifications with six sample size conditions with three distributionalforms, yielding 72 “between” cells. For each cell, independent data setswere drawn from the associated population model, distributional form, andsample size, and fit to the appropriate model specification using EQS. Thesame simulated data sets were not used to fit different model specifications(i.e., across Models A1 and A2, or B1 and B2) even though they might havebeen drawn from the same population, distributional form, and sample size.Such an approach was adopted to prevent any potential dependencies amongthe results across Models A1 and A2, or across Models B1 and B2. Thenumber of iterations to convergence was set at 200 for ML estimation; EQS5.7b (Bentler, 1996) limits the maximum number of iterations for ADFestimation to 30.

Simulated data samples were subjected to parameter estimation usingML and ADF to calculate the test statistics. Start values were assignedusing the true population parameter values. For each data set all teststatistics were computed (given adequate n). If a given estimation methodfailed, or a test statistic could not be computed (i.e., a matrix was notinvertible), then only those test statistics associated with that sample of datathat failed were discarded and replaced, not the sample of data itself. Othertest statistics that could be constructed from that sample of data were usedin the replication count. The SB and residual-based test statistics wereflagged as failing (and discarded) whenever their necessary product matricescould not be inverted. Additionally, the residual-based test statistics wereflagged and discarded whenever T

B exceeded a clearly inappropriate value of

900,000. For each cell in the design we obtained 2000 successfulreplications; this number of replications yields high power for detecting adeparture from Bradley’s (1978) liberal criterion of robustness at the � = .05level of significance (Robey & Barcikowski, 1992), as described in the nextsection.



Summarizing Results and Defining Test Statistic Robustness

Study results were analyzed in terms of empirical Type I error ratesacross each cell’s 2000 replicates. As Yuan and Bentler (1998, 1999) noted,although adherence to the referenced sampling distribution is important, theprimary concern for hypothesis testing is with respect to the tail-behavior ofthe test statistic (i.e., in terms of model rejection rates). As such, thepercentage of model rejections across each cell’s replications is reported foreach study condition and is used to gauge test statistic robustness. Allhypothesis testing was performed at the .05 level. Type I error robustness isevaluated using both an expected 95% confidence interval (CI) about � = .05and Bradley’s liberal criterion (Bradley, 1978). Given the 2000 replicationsper condition in this investigation and � = .05, the 95% CI is [.05 � 1.96(.05× .95/2000)1/2] * 100% = (4.04%, 5.96%); the robustness interval correspondingto Bradley’s liberal criterion is (.5�, 1.5�) * 100% = (2.5%, 7.5%).

Power Analysis

While the Type I error propensity of the test statistics investigated is thecentral focus in our study, the power of a test statistic to reject an improperlyspecified model also is considered.2 This investigation takes a two-stepapproach to evaluating test statistic utility. In the first step, Type I errorpropensity is evaluated. Using Bradley’s (1978) liberal criterion, teststatistics under study conditions that yielded empirical Type I error ratesabove the upper bound of 7.5% were eliminated from subsequent poweranalyses under those same conditions. Such screening is deemed reasonablebecause liberal Type I error rates are indicative of inflated test statistics,which in turn would be expected under nonnull conditions to precipitatepower estimates that are artificially inflated and thus not comparable withpower estimates from test statistics that do maintain reasonable Type I errorcontrol. Test statistics with model rejection rates within the robustnessinterval, or below the lower bound of 2.5%, were examined in a follow-uppower analysis. Test statistics that yielded empirical Type I error ratesbelow the lower bound of robustness were retained for power analysis,rather than being eliminated, because these test statistics (albeit at a seemingdisadvantage) could potentially yield acceptable power with respect torejecting misspecified models.

2 The objective of the power analyses is to provide secondary information regarding the relativeperformance of the test statistics in this study, allowing test statistics to be compared againstone another. These analyses are not meant to estimate the power of the test statisticsinvestigated in any absolute sense.



For each test statistic that yielded an observed Type I error rate that waswithin or below the robustness interval, a new series of 1000 replicates wasgenerated using the same population model for that cell in the study, and at thesame distributional form and sample size condition. These new data sets werethen fit to an improperly specified model. Sample data drawn from the five-factor LVP population model were fit to a three-factor LVP model; sampledata drawn from the seven-factor CFA population model were fit to a four-factor CFA model. Complete model (mis)specifications are given in AppendixA. These model misspecifications are reflective of situations in which modelsare under-parameterized, an aspect common in applied modeling scenarios.The relative degree of noncentrality across all misspecified models (undermultivariate normality) was fairly consistent, as indicated by the population rootmean-square error of approximation (ε) values (see, e.g., Browne & Cudeck,1993). Specifically, fitting misspecified models to the population covariancematrices yielded a nonzero F

ML for each model, from which ε = (F

ML/df)1/2 is

determined: εA1

= .144, εA2

= .145, εB1

= .117, and εB2

= .122.3 Because themisspecified models did not perfectly match in model df against theircompanion properly specified models it was not possible to maintain the exactsame absolute sample sizes and n:q ratios in the cells of the power analysis ascompared to the Type I error analysis. Thus, the absolute sample sizesexamined in the Type I error analysis were preserved in the power analysis(rather than the n:q ratios).

Results

Non-convergence, Improper Solutions, and Test Statistic Failure Rates

Rates of non-convergence and improper solutions for the ML and ADFestimation methods, as well as failure rates for the SB and residual-basedtest statistics, were tracked for all cells in the Type I error portion of thestudy. Improper solution and test statistic failure rates are based only onthose data sets that yielded converged solutions. Non-convergence, impropersolutions, and test statistic failure rates are summarized here briefly;complete tables are given in Appendix B. Overall, for the properly specifiedmodels, non-convergence and improper solutions for ML and ADF were afunction of distributional form and sample size, with increasing departurefrom multivariate normality and decreasing sample size systematically leadingto increasing rates of non-convergence and improper solutions.

3 While population ε values may be used to determine power to reject misspecified models (e.g.,MacCallum, Browne, & Sugawara, 1996), such power determinations are not reported here asthey only apply under multivariate normality and with normal theory estimators.



For ML estimation, non-convergence was rare with the highest rate at 2%.With respect to improper solutions, ML was problematic only at n:q � 1:1, withrates as high as 62%. At all larger n:q ratios, improper solution rates for MLestimation were usually not more than 20%. Unlike ML, ADF estimationfrequently yielded high rates of model non-convergence, most particularly atthe sample size lower bound n = p* where non-convergence rates ranged from20-40%. At n > p* ADF yielded non-convergence rates around 1-15%. ADFwas also more problematic than ML estimation, with rates of impropersolutions that ranged from 13-87% at n = p*, were as high as 32% at n:q = 5:1,and were up to 13% for some cells at n:q = 10:1.

With respect to test statistics, the SB test statistics never failed,delivering 0% failure rates across the board (i.e., the triple-product matrixrequisite for computing the SB-based test statistics was successfully invertedevery time). The T

B test statistic also never failed with respect to inversion

of its necessary product matrix; however, there were replicates at the lowerbound n = df + 1 sample size in which computed values of T

B exceeded a

cut-off value of 900,000 and were discarded and replaced. Such failure ratesranged from 7-44% and appeared to be somewhat model-dependent, withunconstrained models showing minimal failure (7-9%) while constrainedmodels tended to yield aberrant values for T

B more frequently (10-44%).

Type I Error Rates

Results for the unconstrained LVP model (A1) are given in Table 2.With normal data at small sample size conditions T

ML delivered inflated model

rejection rates that monotonically decreased with increasing sample size, asexpected. Using Bradley’s (1978) liberal criterion, T

ML was robust only at

n:q = 10:1. The Bartlett k-factor correction to TML

showed improvementwith normal data with Type I error rates within the robustness interval atsample size conditions of n:q � 5:1. With nonnormal data, one finds in Table 2(as expected) that Type I error rates associated with T

ML and T

ML–k are

inflated, although the k-factor correction to TML

did provide someimprovement when data were symmetric nonnormal.

The first SB test statistic, TSB1

, provided some control over Type I errorrates, demonstrating performance with asymmetric nonnormal data that wasnoticeably superior to T

ML. However, Type I error rates fell within the

robustness range only at the largest sample size condition examined.Applying the k-factor correction to T

SB1 yielded a test statistic that was

robust at all combinations of sample sizes and distributional formsinvestigated, with model rejection rates even within the 95% CI for mostsample size ratios above n:q = 2:1. In contrast to the inflated rejection rates



Table 2Type I Error Rates (%) for Model A1 (85 df)

condition: 1:1 2:1 5:1 10:1 df + 1 p*n: 35 70 175 350 86 120

D

1 44.0 17.0 9.7 6.1a 14.0 10.0T

ML2 48.0 24.0 14.0 10.0 19.0 16.03 75.0 63.0 62.0 70.0 62.0 61.01 1.1 2.3 4.5b 3.9a 2.9a 3.1a

TML–k

2 1.7 5.0b 7.7 7.5a 5.5b 6.0b

3 18.0 37.0 52.0 65.0 39.0 47.01 59.0 23.0 11.0 6.4a 17.0 12.0

TSB1

2 64.0 26.0 11.0 6.7a 18.0 14.03 78.0 34.0 13.0 8.2 26.0 18.01 2.9a 3.8a 5.1b 4.3b 4.2b 3.7a

TSB1–k

2 2.7a 4.8b 4.8b 4.4b 4.8b 4.6b

3 5.1b 6.8a 6.0b 5.4b 6.7a 6.8a

1 10.0 3.9a 4.5b 3.7a 3.8a 2.9a

TSB2

2 5.2b 1.6 1.4 1.6 1.6 1.43 2.5a 0.5 0.2 0.5 0.3 0.11 0.0 0.5 1.4 2.7a 0.5 1.1

TSB2–k

2 0.0 0.0 0.3 1.2 0.3 0.33 0.0 0.0 0.0 0.2 0.1 0.01 99.0 67.0 100.0 100.0

TB

2 99.0 66.0 100.0 100.03 99.0 65.0 100.0 100.01 1.9 4.6b 0.0 0.0

TB1

2 0.9 3.2a 0.0 0.03 0.3 1.2 0.0 0.01 12.0 8.5 0.0 12.0

TB2

2 6.2a 5.7b 0.0 6.5a

3 3.9a 3.0a 0.0 6.6a

1 99.0 67.0 100.0T

ADF2 99.0 64.0 100.03 99.0 60.0 99.01 0.7 4.8b 0.0

TADF1

2 0.0 2.7a 0.13 0.0 0.7 0.21 7.0a 8.7 0.5

TADF2

2 1.6 5.0b 0.23 0.2 1.6 0.6

Note. Type I error rates are computed across 2000 successful replications at the .05 level ofsignificance. D = distributional form; D = 1, multivariate normal; D = 2, elliptically symmetricnonnormal; D = 3, asymmetric nonnormal.a indicates rejection rate within the robustness interval. b indicates rejection rate within the 95% CI.



associated with TSB1

, model rejection rates for the second SB test statistic,T

SB2, tended to be attenuated. Applying the k-factor correction to T

SB2

pushed model rejection rates even lower to near zero under most of theconditions examined.

For the residual-based and ADF test statistics in Table 2, performance ofthe unadjusted T

B and T

ADF mirrored one another closely, with near 100%

Type I error rates at all sample sizes and distributional forms investigated.The �2 adjustments T

B1 and T

ADF1 also were close in performance to one

another with near-zero rejection rates at n:q < 10:1 and showingmonotonically decreasing Type I error rates with increasing departure frommultivariate normality. The F-distributed adjustments, T

B2 and T

ADF2,

mirrored one another in performance at the test statistics’ minimum samplesizes with Type I error rates at or near zero; at larger sample size conditionsT

B2 generally maintained robustness while T

ADF2 yielded suppressed model

rejection rates.Type I error rates under the constrained LVP model (A2) are shown in

Table 3. Overall test statistic performance was similar to that seen under theunconstrained model, with the caveat that performance associated with all ofthe test statistics examined under the constrained model deteriorated to someextent. Applying the Bartlett k-factor correction to these test statistics led toimproved Type I error performance in the constrained model, although at thesmallest sample size conditions T

ML–k and T

SB1–k exerted too much correction

yielding Type I error rates below the 2.5% lower bound of the robustnesscriterion. The behavior of T

SB2 and T

SB2–k was again similar to the

unconstrained model; the correction pushed Type I error rates to zero ornear-zero values (in particular with nonnormal distributional forms).

The TB and T

ADF test statistics, and their adjusted forms, paralleled one

another with respect to Type I error propensities in the constrained model.Table 3 shows 100% model rejection rates for T

B and T

ADF, with zero or

near-zero Type I error rates for all of the adjusted forms of TB and T

ADF at

their minimum sample size conditions. At other sample size conditions, modelrejection rates were less than 2% for the �2-distributed T

B1 and T

ADF1; Type I

error rates for the F-distributed TB2

and TADF2

were within or slightly abovethe robustness interval.

Table 4 presents Type I error results for the unconstrained CFA model(B1). With multivariate normal data, T

ML was robust at n:q = 5:1 and 10:1; at

smaller n:q ratios model rejection rates were inflated. The k-factorcorrected form T

ML–k maintained Type I error rates (with normal data) to

within the robustness interval at sample size conditions of n:q = 1:1 andlarger. T

ML and T

ML–k became inflated with nonnormal distributional forms

with Type I error rates climbing as high as 98%.



Table 3Type I Error Rates (%) for Model A2 (102 df)

condition: 1:1 2:1 5:1 10:1 df + 1 p*n: 18 36 90 180 103 120

D

1 99.0 43.0 14.0 9.2 13.0 12.0T

ML2 100.0 80.0 62.0 59.0 63.0 63.03 100.0 98.0 99.0 99.0 99.0 99.01 0.1 0.9 2.9a 4.5b 3.6a 3.2a

TML–k

2 1.3 13.0 37.0 47.0 41.0 43.03 20.0 75.0 95.0 99.0 96.0 97.01 100.0 59.0 18.0 11.0 16.0 14.0

TSB1

2 100.0 61.0 18.0 11.0 18.0 15.03 100.0 68.0 22.0 15.0 20.0 17.01 1.7 3.0a 3.7a 5.5b 4.7b 4.5b

TSB1–k

2 1.5 2.4 5.0b 5.7b 4.8b 5.6b

3 2.3 4.4b 6.7a 8.5 8.0 7.61 72.0 7.0a 2.6a 3.6a 3.2a 2.8a

TSB2

2 53.0 1.7 0.6 1.2 0.5 1.03 34.0 0.5 0.3 0.4 0.3 0.21 0.0 0.0 0.4 1.6 0.4 0.8

TSB2–k

2 0.0 0.0 0.1 0.6 0.2 0.23 0.0 0.0 0.1 0.2 0.1 0.11 100.0 100.0 100.0

TB

2 100.0 100.0 100.03 100.0 100.0 100.01 1.7 0.0 0.0

TB1

2 0.7 0.0 0.03 0.5 0.0 0.01 16.0 0.0 12.0

TB2

2 10.0 0.0 8.63 10.0 0.0 12.01 100.0 100.0

TADF

2 100.0 100.03 100.0 100.01 1.9 0.0

TADF1

2 0.4 0.23 0.2 0.31 14.0 1.0

TADF2

2 7.3a 0.43 4.6b 0.7

Note: Type I error rates are computed across 2000 successful replications at the .05 level ofsignificance. D = distributional form; D = 1, multivariate normal; D = 2, elliptically symmetricnonnormal; D = 3, asymmetric nonnormal.a indicates rejection rate within the robustness interval. b indicates rejection rate within the 95% CI.



Table 4Type I Error Rates (%) for Model B1 (102 df)

condition: 1:1 2:1 5:1 10:1 df + 1 p*n: 129 258 645 1290 103 231

D

1 12.0 9.3 6.3a 6.2a 14.0 7.8T

ML2 24.0 17.0 17.0 15.0 22.0 19.03 86.0 92.0 96.0 98.0 84.0 91.01 2.9a 4.2b 4.7b 5.4b 1.6 3.4a

TML–k

2 7.0a 9.6 14.0 14.0 5.4b 10.03 72.0 87.0 95.0 98.0 66.0 86.01 15.0 10.0 6.8a 6.3a 18.0 9.2

TSB1

2 19.0 10.0 7.0a 5.6b 20.0 11.03 31.0 17.0 9.3 7.3a 41.0 19.01 3.7a 4.8b 4.8b 5.6b 2.4 3.9a

TSB1–k

2 5.1b 4.2b 5.1b 4.4b 3.6a 5.4b

3 9.2 8.7 7.3a 6.0b 8.4 9.11 5.3b 5.4b 4.8b 5.6b 4.1b 4.4b

TSB2

2 2.4 2.1 2.9a 2.8a 2.2 2.03 0.4 0.3 0.6 0.9 0.7 0.31 0.7 2.1 3.8a 4.7b 0.3 1.6

TSB2–k

2 0.1 1.1 1.8 2.5a 0.2 0.73 0.1 0.1 0.3 0.7 0.1 0.11 100.0 95.0 40.0 18.0 100.0 98.0

TB

2 100.0 96.0 39.0 15.0 100.0 99.03 100.0 98.0 39.0 15.0 100.0 99.01 0.0 1.8 4.4b 4.6b 0.0 0.9

TB1

2 0.0 0.8 2.6a 3.4a 0.0 0.53 0.0 0.3 2.0 2.0 0.0 0.21 7.0a 6.4a 5.8b 5.5b 0.0 6.1a

TB2

2 4.1b 4.2b 4.1b 3.7a 0.0 4.1b

3 6.5a 2.9a 3.3a 3.1a 0.0 2.6a

1 94.0 39.0 18.0 96.0T

ADF2 94.0 37.0 15.0 93.03 96.0 34.0 13.0 90.01 0.9 4.3b 4.5b 0.5

TADF1

2 0.1 2.5 3.4a 0.23 0.0 1.2 1.7 6.5a

1 3.4a 5.8b 5.5b 2.1T

ADF22 0.8 3.6a 3.5a 1.73 0.3 2.2 2.3 13.0




TSB1

showed robustness with most distributional forms at n:q = 5:1 and10:1 (with the exception of n:q = 5:1 and asymmetric nonnormal data), butyielded inflated Type I error rates at smaller n:q ratios. The k-factorcorrected form, T

SB1–k, maintained Type I error rates to within the robustness

interval at sample size conditions of n:q = 1:1 and larger, although error ratesexceeded the upper bound of the robustness interval with asymmetricnonnormal data and n:q < 5:1. The T

SB2 test statistic showed robustness with

normal data, but then yielded attenuated model rejection rates with nonnormaldistributional forms; the k-factor adjusted T

SB2–k over-corrected, delivering

Type I error rates near zero.Model rejection rates for T

B and T

ADF, T

B1 and T

ADF1, and T

B2 and T

ADF2

again mostly mirrored one another, with some local distinctions. Theunadjusted test statistics yielded inflated Type I error rates that monotonicallydecreased with increasing sample size; note that even at the n:q = 10:1 ratiomodel rejection percentages associated with T

B and T

ADF were at 18%. The

adjusted forms of TB and T

ADF performed optimally under several of the

sample size conditions investigated but suppressed model rejections withincreasing nonnormality and decreasing sample size. The residual-based F-distributed statistic, T

B2, maintained robustness under all conditions for this

model, with the exception of the n = df + 1 sample size minimum for this teststatistic in which model rejection rates were at zero.

Results for the constrained CFA model (B2) are presented in Table 5.T

ML produced inflated Type I error rates at every sample size and

distributional form combination examined (even at the largest sample sizecondition and multivariate normal data). T

SB1 yielded inflated Type I error

rates that decreased with increasing sample size, approaching the 7.5%upper bound of the robustness criterion at the largest sample size conditions.Applying the k-factor correction to these test statistics improved Type I errorcontrol for T

ML–k with normal data and for T

SB1–k with all distributional forms,

albeit evidencing some degree of over-correction at the smallest sample sizeconditions. The T

SB2 statistic controlled Type I error rates with normal data

at all but the smallest sample size condition; error rates with nonnormaldistributional forms were mostly attenuated. Applying the Bartlett k-factorcorrection, T

SB2–k yielded near-zero model rejection rates for all conditions

investigated.Results for T

B and T

ADF in Table 5 are similar to the patterns seen under

the constrained LVP model. The unadjusted forms of the test statistics yielded100% model rejection rates; the �2-adjusted T

B1 and T

ADF1 were robust only at

the largest sample size condition and with normal data; the F-distributedstatistics T

B2 and T

ADF2 yielded inflated Type I error rates ranging from

10-100% across the conditions examined here.



Table 5Type I Error Rates (%) for Model B2 (196 df)

condition: 1:1 2:1 5:1 10:1 df + 1 p*n: 35 70 175 350 197 231

D

1 92.0 38.0 12.0 8.1 12.0 9.4T

ML2 99.0 94.0 88.0 88.0 88.0 89.03 100.0 100.0 100.0 100.0 100.0 100.01 0.2 0.6 2.5 3.8a 3.2a 2.9a

TML–k

2 8.9 45.0 74.0 82.0 76.0 78.03 91.0 100.0 100.0 100.0 100.0 100.01 97.0 50.0 15.0 9.5 14.0 11.0

TSB1

2 96.0 46.0 15.0 9.8 14.0 12.03 94.0 43.0 15.0 9.9 14.0 13.01 0.9 1.6 3.3a 4.5b 3.9a 3.7a

TSB1–k

2 0.8 1.8 3.5a 5.1b 4.1b 5.0b

3 1.5 2.3 4.6b 5.6b 5.6b 5.5b

1 29.0 4.3b 2.6a 3.1a 3.1a 2.2T

SB22 8.0 0.5 0.4 0.8 0.2 0.33 2.2 0.1 0.1 0.0 0.1 0.11 0.0 0.0 0.3 1.2 0.4 0.4

TSB2–k

2 0.0 0.0 0.0 0.4 0.0 0.13 0.0 0.0 0.0 0.0 0.0 0.01 100.0 100.0 100.0

TB

2 100.0 100.0 100.03 100.0 100.0 100.01 2.7a 0.0 0.0

TB1

2 13.0 0.0 0.03 100.0 0.0 0.01 18.0 0.0 15.0

TB2

2 45.0 0.0 34.03 100.0 0.0 100.01 100.0 100.0

TADF

2 100.0 100.03 100.0 100.01 3.5a 0.0

TADF1

2 15.0 0.03 100.0 0.11 19.0 9.6

TADF2

2 46.0 18.03 100.0 100.0




Empirical Power

Results for follow-up power analyses are given in Tables 6-9, with eachtable paralleling one of the four models investigated in the Type I erroranalysis. As noted earlier, empirical power estimates presented in thesetables are not considered absolute, but rather are relative estimates for thepurpose of comparing performance of the test statistics investigated. Powerwas not analyzed for the residual-based and ADF test statistics at theirrespective sample size minima. The high frequency of non-convergence,improper solutions, and test statistic failure rates — and the poor Type Ierror performance associated with these test statistics at their sample sizelower bounds — logically precluded them from further investigation. Ageneral pattern is evidenced across all of the power tables, whereby powersystematically increased with increasing sample size (as would be expected)and uniformly decreased with increasing departure from multivariatenormality.

Table 6 presents empirical power estimates for the misspecifiedunconstrained LVP Model. Note that for n = 350 the power estimateassociated with every test statistic under all distributional forms was 100%, andas such is not useful for discriminating test statistic performance. At n < 350,however, test statistic performance with respect to power diverged. TheBartlett k-factor correction to T

ML, T

ML–k, yielded empirical power at 81-100%

with multivariate normal data. The SB test statistics and their k-factor-corrected forms generally yielded power at or above 80%. Of note is thepower associated with T

SB1–k which delivered model rejection rates near 90%

at n = 35. In contrast TSB2

and TSB2–k

delivered low power at the smallestsample size conditions — power fell to about 5% for T

SB2–k at the n = 35

sample size. Empirical power estimates for the residual-based and ADF teststatistics in Table 6 suggest some noteworthy features. The residual-basedtest statistics generally exhibited greater power than the ADF test statistics; theF-distributed forms of T

B and T

ADF (T

B2 and T

ADF2) delivered greater power as

compared to the �2-corrected forms of these test statistics (TB1

and TADF1

).Under the misspecified constrained LVP model, Table 7 reports empirical

power estimates that are noticeably lower than those found in Table 6. For theT

ML–k test statistic empirical power was reported at 20% at the smallest sample

size condition with multivariate normal data; at all larger sample sizes powerwas at or near 100%. At n = 18 (with all distributional forms) empirical powerassociated with T

SB1–k dropped off, and went to near-zero for T

SB2–k. Notice

the 0% model rejection rate with all distributional forms for TB1

at n = 120.Again, the F-distributed correction to ADF yielded greater power than the �2-corrected form (e.g., 14% versus 80% with asymmetric nonnormal data).



Table 6Empirical Power (%) for Model A1 (87 df)

n

D 35 70 175 350 86 120

1 100.0a

TML

23

1 81.0 100.0 100.0a 100.0a 100.0a 100.0a

TML–k

2 84.0 100.0a 100.0a 100.0a 100.0a

3

1 100.0a

TSB1

2 100.0a

3

1 87.0a 100.0a 100.0a 100.0a 100.0a 100.0a

TSB1–k

2 88.0a 100.0a 100.0a 100.0a 100.0a 100.0a

3 89.0a 100.0a 100.0a 100.0a 100.0a 100.0a

1 100.0a 100.0a 100.0a 100.0a 100.0a

TSB2

2 77.0a 100.0 100.0 100.0 100.0 100.03 50.0a 80.0 100.0 100.0 90.0 97.0

1 23.0 100.0 100.0 100.0a 100.0 100.0T

SB2–k2 14.0 95.0 100.0 100.0 100.0 100.03 4.6 56.0 99.0 100.0 78.0 94.0

1T

B23

1 98.0 100.0a 0.0T

B12 74.0 100.0a 0.03 53.0 100.0 0.2

1T

B22 98.0a 100.0a 54.0a

3 91.0a 100.0a 53.0a

1T

ADF23

1 93.0 100.0a

TADF1

2 30.0 100.0a

3 8.9 100.0

1 100.0a

TADF2

2 79.0 100.0a

3 49.0 100.0

Note. All power estimates are computed across 1000 successful replications at the .05 level ofsignificance. D = distributional form; D = 1, multivariate normal; D = 2, elliptically symmetricnonnormal; D = 3, asymmetric nonnormal.a indicates corresponding Type I error rate within the robustness interval.



Table 7Empirical Power (%) for Model A2 (110 df)

n

D 18 36 90 180 103 120

1T

ML23

1 20.0 92.0 100.0a 100.0a 100.0a 100.0a

TML–k

2 47.03

1T

SB123

1 50.0 96.0a 100.0a 100.0a 100.0a 100.0a

TSB1–k

2 29.0 79.0 100.0a 100.0a 100.0a 100.0a

3 22.0 49.0a 93.0a

1 92.0a 100.0a 100.0a 100.0a 100.0a

TSB2

2 39.0 92.0 99.0 96.0 98.03 5.5 36.0 79.0 46.0 53.0

1 0.1 32.0 100.0 100.0 100.0 100.0T

SB2–k2 0.0 2.3 89.0 99.0 93.0 96.03 0.1 0.1 21.0 76.0 34.0 44.0

1T

B23

1 96.0 0.0T

B12 63.0 0.03 14.0 0.0

1T

B223

1T

ADF23

1 97.0T

ADF12 65.03 14.0

1T

ADF22 98.0a

3 80.0a




Results in Table 8 show a predominance of 100% rejection rates for mostof the test statistics analyzed under the misspecified unconstrained CFA model.Most noteworthy is the relative performance of the adjusted residual-based(T

B1 and T

B2) and ADF (T

ADF1 and T

ADF2) test statistics at the small and

moderate sample sizes. The residual-based test statistics were consistentlymore powerful than their ADF analogs; the F-distributed statistics deliveredgreater power than the �2-corrected statistics. Note that for the T

B2 test

statistic at the n = 129 sample size, its corresponding Type I error rates wereheld to within the robustness criterion (Table 4), however, the observed powerestimate associated with T

B2 at this sample size was low, at 16-26%.

Empirical power estimates for the misspecified constrained CFA modelare given in Table 9. Power was high for T

ML–k at all sample sizes, dipping

below 100% only at the smallest sample size of n = 35. The TSB1–k

teststatistic also showed strong power at 90-100% over most of the studyconditions examined; power did however become degraded at the smallestsample size conditions with nonnormal distributions. The T

SB2 and T

SB2–k test

statistics generally yielded high empirical power with normal and ellipticallysymmetric distributional forms, with power becoming attenuated withasymmetric nonnormal data and across all distributional forms at the n = 35sample size. Little can be said regarding the performance of the residual-based and ADF test statistics here as these statistics were only analyzedunder a few conditions. Again, notice that the �2 test statistic T

B1 yielded 0%

power with all distributional forms at the n = 231 sample size, a sample sizeclose to its minimum lower bound of n = 218.

Discussion

Model estimation failure rates, Type I error propensity, and power in thepresent investigation closely follow previous findings while providing newinsight into the behavior of SEM test statistics at reduced sample sizes andwith nonnormal distributional forms. Beginning with rates of non-convergence and improper solutions, results in the present investigation leadto a caution that modeling data with sample size ratios at n:q � 1:1 isgenerally unwise. While non-convergence was not problematic, rates ofimproper solutions at this smallest sample size ratio were often high,especially with nonnormal data. Such results are consistent with findingsreported by Boomsma (1982, 1985), and suggest that it is unlikely thatpractitioners will be able to obtain usable model estimates at or below asample size-to-parameter ratio of 1:1.

Considering model rejection behavior, TML

was not robust to departuresfrom multivariate normality in the present investigation, yielding inflated Type



Table 8Empirical Power (%) for Model B1 (117 df)

n

D 129 258 645 1290 103 231

1 100.0a 100.0a

TML

23

1 100.0a 100.0a 100.0a 100.0a 100.0 100.0a

TML–k

2 100.0a 100.0a

3

1 100.0a 100.0a

TSB1

2 100.0a 100.0a

3 100.0a

1 100.0a 100.0a 100.0a 100.0a 100.0 100.0a

TSB1–k

2 100.0a 100.0a 100.0a 100.0a 100.0a 100.0a

3 100.0a 100.0a

1 100.0a 100.0a 100.0a 100.0a 100.0a 100.0a

TSB2

2 100.0 100.0 100.0a 100.0a 100.0 100.03 94.0 99.0 100.0 100.0 90.0 99.0

1 100.0 100.0 100.0a 100.0a 100.0 100.0T

SB2–k2 100.0 100.0 100.0 100.0a 99.0 100.03 88.0 99.0 100.0 100.0 73.0 99.0

1T

B23

1 0.0 100.0 100.0a 100.0a 100.0T

B12 0.0 97.0 100.0a 100.0a 87.03 0.0 94.0 100.0 100.0 80.0

1 26.0a 100.0a 100.0a 100.0a 100.0a

TB2

2 16.0a 100.0a 100.0a 100.0a 99.0a

3 26.0a 99.0a 100.0a 100.0a 97.0a

1T

ADF23

1 100.0 100.0a 100.0a

TADF1

2 93.0 100.0 100.0a

3 64.0 100.0 100.0

1 100.0a 100.0a 100.0a

TADF2

2 100.0 100.0a 100.0a

3 92.0 100.0 100.0




Table 9Empirical Power (%) for Model B2 (217 df)

n

D 35 70 175 350 197 231

1T

ML23

1 80.0 100.0 100.0 100.0a 100.0a 100.0a

TML–k

23

1T

SB123

1 89.0 100.0 100.0a 100.0a 100.0a 100.0a

TSB1–k

2 62.0 99.0 100.0a 100.0a 100.0a 100.0a

3 25.0 69.0 99.0a 100.0a 99.0a 100.0a

1 100.0a 100.0a 100.0a 100.0a 100.0T

SB22 77.0 99.0 100.0 100.0 100.03 3.5 6.6 60.0 94.0 70.0 80.0

1 2.3 100.0 100.0 100.0 100.0 100.0T

SB2–k2 0.1 39.0 98.0 100.0 100.0 100.03 0.0 1.0 50.0 94.0 63.0 76.0

1T

B23

1 100.0a 0.0T

B12 0.03 0.0

1T

B223

1T

ADF23

1 100.0a

TADF1

23

1T

ADF223




I error rates even at the largest sample sizes or n:q ratios examined. Withnormal data, T

ML also was inflated under most of the sample size conditions

examined, most notably at n:q = 5:1 and 10:1, ratios that meet or exceed the5:1 guideline for using ML estimation that has become pervasive in the SEMcommunity. Under some of the models examined in the present investigationeven the upper limit of n:q = 10:1 was not sufficiently large for T

ML to

perform correctly, casting doubt on the current guidelines offered to the SEMcommunity.

Findings associated with the SB test statistics were mixed, with TSB1

delivering inflated Type I error rates at most n:q < 10:1. This pattern isconsistent with findings reported by Curran et al. (1996), Fouladi (1998, 1999),and Bentler and Yuan (1999). Unlike T

SB1, Type I error rates corresponding to

the TSB2

test statistic were mostly attenuated across nonnormal data conditionsin our investigation (with associated empirical power often below practicalstandards). Based on our results we are reluctant to recommend the T

SB2 test

statistic for use in applied modeling situations.Type I error propensity and power associated with the Bartlett k-factor

corrected forms of TML

, TSB1

and TSB2

were given considerable attention inthis investigation. Applying the k-factor correction to T

SB2 in this study was

not effective, with TSB2–k

over-correcting and thus delivering low modelrejection rates in both Type I error and power analyses. Overall results herefor T

ML–k and T

SB1–k are encouraging for small sample research and generally

parallel those reported by Fouladi (1998, 1999). At small sample sizes TML–k

and TSB1–k

maintained Type I error rates with normal data; TSB1–k

alsogenerally performed well with nonnormal data. With multivariate normaldata T

ML–k was robust and relatively powerful at n:q � 5:1 for all models, and

frequently delivered acceptable Type I error rates at n:q � 2:1. Theperformance of T

SB1–k in the present investigation is even more remarkable,

with the combination of SB scaling and k-factor correction yielding the moststable test statistic investigated here. Type I error rates associated withT

SB1–k were close to the 5% level at nearly all conditions investigated;

empirical power for the misspecified models examined here was generallysuperior to other test statistics investigated here.

Previous research that offered an initial evaluation of the residual-basedand ADF test statistics (under limited scope) provided us with incentive tocontinue their examination in this investigation under a wider array ofconditions. Overall, findings in this study for these test statistics and theircorrected forms match up with results reported by Bentler and Yuan (1999)and by Yuan and Bentler (1998, 1999). The inflated Type I error ratesobserved here for T

B and T

ADF again demonstrate that these test statistics are

not useful for assessing data-model fit in SEM at realistic sample sizes. The



�2 and F-distributed forms of the residual-based and ADF statistics showedsome promise in this investigation, maintaining Type I error rates and yieldingstrong power to reject misspecified models. Performance of these correctedtest statistics was erratic, however, controlling Type I error rates under someconditions and failing under others. Moreover, T

B1, T

ADF1, and T

ADF2 behaved

unpredictably when they failed, becoming inflated under some studyconditions and becoming attenuated under others (with associated poorpower performance).

Our design provided the opportunity to compare the analogous forms ofthe residual-based and ADF test statistics against one another under acommon set of experimental conditions. The unadjusted test statistics T

B and

TADF

mirrored one another closely, as did the �2-distributed forms TB1

andT

ADF1. The F-distributed statistics T

B2 and T

ADF2 performed differentially,

paralleling one another under some study conditions but diverging from oneanother under other conditions. Under the unconstrained models withnonnormal data (at sample sizes above their respective minima) T

B2 showed

better performance than TADF2

; TB2

controlled Type I error rates and deliveredstrong power while T

ADF2 yielded attenuated model rejection rates in terms of

both Type I errors and power. Lastly, we note that none of the residual-basedand ADF test statistics behaved properly at their respective sample size lowerbounds. ADF estimation was especially problematic with respect to non-convergence and improper solutions at its lower bound of n = p* — it isdoubtful that practitioners using ADF would be successful in obtaining usablesolutions for their models at such a sample size. Further, even when modelsestimated using ADF do converge and yield proper solutions at n = p*,assessing global data-model fit with a test statistic is still tenuous at best. Theuncorrected T

ADF yielded near 100% Type I error rates under every study

condition at n = p* while the �2-distributed TADF1

unilaterally yielded near-zeroType I error rates, and the F-distributed statistic T

ADF2 was erratic delivering

either inflated or attenuated Type I error rates. A similar scenario exists forthe residual-based test statistics at its lower bound sample size of n = df + 1.Under every condition examined in the present investigation with n = df + 1 theunadjusted T

B test statistic yielded 100% Type I error rates, while both of the

adjusted forms TB1

and TB2

yielded no Type I errors.

Concluding Remarks

Based on our findings it appears that practitioners with small to moderatesample sizes can effectively model their data and accurately assess globaldata-model fit using SEM techniques. Even with severely nonnormal datasome of the test statistics investigated here yielded strong performance with



n < 100; some fit statistics even operated well with n � 50 subjects. Suchresults bode well for SEM use with smaller sample sizes, but are conditionalupon sample sizes that are sufficiently large to allow models to converge andyield proper solutions, and suggest that both model convergence tendenciesand test statistic behavior must be considered jointly in formulatingrecommendations to practitioners. New findings here support the use of theBartlett k-factor correction for controlling test statistic performance withsmall sample sizes, particularly when combined with the SB scaled form ofT

ML (T

SB1–k) under nonnormal data conditions.

The expected failure of TML

in the present investigation at small samplesizes (and with nonnormal data) should again serve to alert practitioners thatmerely applying the default estimation method or test statistic in their SEMsoftware package without attending to issues such as sample size adequacy ordistributional violation is unwise. Rather, practitioners should make informedchoices regarding the proper use of SEM techniques, and to that end,recommendations for successful covariance structure modeling with smallsamples are as follows. ML is our recommended estimation method; ML isnot bound by the n = p* minimum that is implicit in ADF estimation, allowingfor modeling at smaller sample sizes. To protect against potential problemswith model non-convergence or improper solutions using ML, we recommendthat practitioners obtain sample sizes large enough such that n:q � 2:1.Evidence here also leads to an endorsement of the T

SB1–k test statistic for

assessing overall data-model fit with small samples.One of the criticisms of methodological studies in SEM has been that only

a narrow scope of models has been examined. This investigation sought toaddress this limitation by examining diverse models and model specifications.Results reported here for the CFA models suggest that models with errorcovariances requiring estimation do not necessarily disrupt the performance ofmodel test statistics in SEM. It is arguable that the CFA models in thisinvestigation required estimation of more error covariance parameters than aretypically found in applied settings. However, applications involving longitudinaland multidomain models tend to incorporate substantial numbers of correlatederrors when relating residuals of parallel indicator variables. Numerousresidual covariances also occur when measurement models have beenrespecified (see, e.g., Newcomb & Bentler, 1988, which had 77 residualcovariances). Certainly, findings here represent an extreme scenario — if ourmodels, in which we intentionally inflated the number of estimated errorcovariance terms, performed well, then one may surmise that models requiringonly few error covariance parameters to be estimated ought not to degrade teststatistic performance. Moreover, our findings for the unconstrained CFAmodel might also suggest that models requiring estimation of large numbers of



parameters in general might not necessarily disturb the performance of SEMmodel test statistics.

This study also evaluated test statistic performance for models thatimpose equality constraints upon parameter estimates. Our findings indicatethat such constraints could negatively impact test statistic performance, anunexplained phenomenon also reported by Yuan and Bentler (1998).Certainly additional research might be required to understand why teststatistic performance becomes degraded under such conditions, but theimplication is that test statistics could be sensitive to particular modelspecifications.

While ground has been gained in understanding assessing overall modelfit at reduced sample sizes, other issues remain largely uninvestigated. First,although the k-factor correction has been endorsed here as a viableapproach for protecting test statistics at small sample sizes, continuedmethodological research is warranted to profile more fully the behavior of k-factor-corrected model test statistics in SEM before unconditionallyrecommending its use in applied settings. A second issue is the impact ofinferential test statistics (such as T

ML) on incremental and absolute model fit

indices at small sample sizes and with nonnormal data. A natural extensionto modifying test statistics is to incorporate them into other indices of modelfit. In fact, some popular SEM software now offers some robust fit indices;EQS 5.7b (Bentler, 1996) provides a “robust” comparative fit index (CFI)that is computed using T

SB1 rather than T

ML. While some promising research

has been conducted examining the behavior of such modified fit indices(Anderson, 1996; Nevitt & Hancock, 2000), there is clearly the need forfurther examination of modified fit indices that might be corrected for smallsamples as well as nonnormal distributional forms, incorporating, for example,the T

SB1–k test statistic.

A third issue warranting further investigation is the quality of individualestimated model parameters at small sample sizes. While global data-modelfit may be adequately assessed at small sample sizes, it remains to be seenhow well individual model parameters can be estimated at such reducedsample size conditions. Moreover, accurately assessing the statisticalsignificance of individual model parameters could be tenuous with smallsamples; parameter standard errors might not necessarily be stable undersuch conditions leading to distorted z-statistics and incorrect inference. Still,through continued research such as the present investigation, SEMtechniques continue to evolve, breaking sample size and distributionalbarriers, thus becoming more accessible and useful to practitioners.



References

Anderson, R. D. (1996). An evaluation of the Satorra-Bentler distributional misspecificationcorrection applied to the McDonald fit index. Structural Equation Modeling: AMultidisciplinary Journal, 3, 203-227.

Aptech Systems. (1996). GAUSS system and graphics manual. Maple Valley, WA: Author.Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology,

Statistical Section, 3, 77-85.Bentler, P. M. (1996). EQS structural equations program manual. Encino, CA: Multivariate

Software, Inc.Bentler, P. M. & Chou, C.-P. (1987). Practical issues in structural modeling. Sociological

Methods & Research, 16, 78-117.Bentler, P. M. & Yuan, K.-H. (1999). Structural equation modeling with small samples: Test

statistics. Multivariate Behavioral Research, 34, 181-197.Boomsma, A. (1982). The robustness of LISREL against small sample sizes in factor analysis

models. In K. G. Jöreskog & H. Wold (Eds.), Systems under indirect observation:Causality, structure, prediction (Part 1) (pp. 149-173). Amsterdam: North-Holland.

Boomsma, A. (1985). Nonconvergence, improper solutions, and starting values in LISRELmaximum likelihood estimation. Psychometrika, 50, 229-242.

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology,31, 144-152.

Browne, M. W. (1982). Covariance structures. In D. M. Hawkins (Ed.), Topics in appliedmultivariate analysis (pp. 72-141). Cambridge, England: Cambridge University Press.

Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis of covariancestructures. British Journal of Mathematical and Statistical Psychology, 37, 62-83.

Browne, M. W. & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen& J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park,CA: Sage.

Chou, C.-P., Bentler, P. M., & Satorra, A. (1991). Scaled test statistics and robust standarderrors for non-normal data in covariance structure analysis: A Monte Carlo study. BritishJournal of Mathematical and Statistical Psychology, 44, 347-357.

Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics tononnormality and specification error in confirmatory factor analysis. PsychologicalMethods, 1, 16-29.

Fouladi, R. T. (1998, April). Covariance structure analysis techniques under conditions ofmultivariate normality and nonnormality — modified and bootstrap based test statistics.Paper presented at the annual meeting of the American Educational Research Association,San Diego, CA.

Fouladi, R. T. (1999, April). Model fit in covariance structure analysis under small sampleconditions — Modified maximum likelihood and asymptotically distribution freegeneralized least squares procedures. Paper presented at the annual meeting of theAmerican Educational Research Association, Montreal, Canada.

Fouladi, R. T. (2000). Performance of modified test statistics in covariance and correlationstructure analysis under conditions of multivariate nonnormality. Structural EquationModeling: A Multidisciplinary Journal, 7, 356-410.

Gerbing, D. W. & Anderson, J. C. (1985). The effects of sampling error and modelcharacteristics on parameter estimation for maximum likelihood confirmatory factoranalysis. Multivariate Behavioral Research, 20, 255-271.



Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimization. London, England:Academic Press.

Hayduk, L. A. (1987). Structural equation modeling with LISREL: Essentials and advances.Baltimore, MD: The Johns Hopkins University Press.

Hu, L.-T. & Bentler, P. M. (1993). Fit indexes in covariance structural equation modeling.Unpublished manuscript.

Hu, L.-T. & Bentler, P. M. (1995). Evaluating model fit. In R. H. Hoyle (Ed.), Structuralequation modeling: Concepts, issues and applications (pp. 76-99). Thousand Oaks, CA:Sage.

Hu, L.-T., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structureanalysis be trusted? Psychological Bulletin, 112, 351-362.

Jöreskog, K. G. & Sörbom, D. (1996). LISREL 8: User’s reference guide. Chicago: ScientificSoftware International.

MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis anddetermination of sample size for covariance structure modeling. Psychological Methods, 1,130-149.

Magnus, J. R. & Neudecker, H. (1986). Symmetry, 0-1 Matrices and Jacobians. EconometricTheory, 2, 157-190.

Muthén, B. & Kaplan, D. (1992). A comparison of some methodologies for the factor analysisof non-normal Likert variables: A note on the size of the model. British Journal ofMathematical and Statistical Psychology, 45, 19-30.

Muthén, L. K. & Muthén, B. O. (1998). Mplus: The comprehensive modeling program forapplied researchers. Los Angeles: Muthén & Muthén.

Nevitt, J. & Hancock, G. R. (1999). PWRCOEFF & NNORMULT: A set of programs forsimulating multivariate nonnormal data. Applied Psychological Measurement, 23, 54.

Nevitt, J. & Hancock, G. R. (2000). Improving the root mean square error of approximation fornonnormal conditions in structural equation modeling. Journal of Experimental Education,68, 251-268.

Nevitt, J. & Hancock, G. R. (2001). Performance of bootstrapping approaches to model teststatistics and parameter standard error estimation in structural equation modeling.Structural Equation Modeling: A Multidisciplinary Journal, 8, 353-377.

Newcomb, M. D. & Bentler, P. M. (1988). Impact of adolescent drug use and social supporton problems of young adults: A longitudinal study. Journal of Abnormal Psychology,97, 64-75.

Robey, R. R. & Barcikowski, R. S. (1992). Type I error and the number of iterations in MonteCarlo studies of robustness. British Journal of Mathematical and Statistical Psychology,45, 283-288.

Satorra, A. & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariancestructure analysis. American Statistical Association 1988 proceedings of the Business andEconomics Sections (pp. 308-313). Alexandria, VA: American Statistical Association.

Satorra, A. & Bentler, P. M. (1994). Corrections to test statistics and standard errors incovariance structure analysis. In A. von Eye & C. C. Clogg (Eds.), Latent variablesanalysis: Applications for developmental research (pp.399-419). Thousand Oaks, CA:Sage.

Stevens, J. (1996). Applied multivariate statistics for the social sciences (3rd ed.). Mahwah, NJ:Lawrence Erlbaum Associates.

Tanaka, J. S. (1984). Some results on the estimation of covariance structure models.Dissertation Abstracts International, 45, 924B.



Tanaka, J. S. (1987). “How big is enough?”: Sample size and goodness of fit in structuralequation models with latent variables. Child Development, 58, 134-146.

Vale, C. D. & Maurelli, V. A. (1983). Simulating multivariate nonnormal distributions.Psychometrika, 48, 465-471.

Yuan, K.-H. & Bentler, P. M. (1997). Mean and covariance structure analysis: Theoretical andpractical improvements. Journal of the American Statistical Association, 92, 767-774.

Yuan, K.-H. & Bentler, P. M. (1998). Normal theory based test statistics in structural equationmodeling. British Journal of Mathematical and Statistical Psychology, 51, 289-309.

Yuan, K.-H. & Bentler, P. M. (1999). F tests for mean and covariance structure analysis.Journal of Educational and Behavioral Statistics, 24, 225-243.

Accepted October, 2003.

Appendix A

Model Specifications for Type I Error and Power Analyses

Population Model A is a 5-factor LVP model with 3 indicators per factor,as shown in Figure 1.

Properly specified Model A1 may be described as follows: �X hasnonzero free elements �

X11, �

X21, �

X31, �

X42, �

X52, �

X62; �Y has nonzero free

elements �Y11

, �Y21

, �Y42

, �Y52

, �Y73

, �Y83

, and nonzero fixed elements �Y31

=.80, �

Y62 = .70, �

Y93 = .75; � has nonzero free elements

11,

22; B has

nonzero free elements 31

, 32

; � has nonzero free element �21

, and nonzerofixed elements �

11 = 1 and �

22 = 1; � has nonzero free diagonal elements

�11

to �33

; � has nonzero free diagonal elements

�11 to

�66; �εεεεε has

nonzero free diagonal elements ε11 to ε99

.Properly specified Model A2 may be described as follows: �X has

nonzero constrained elements �X11

= �X21

= �X31

, �X42

= �X52

= �X62

; �Y hasnonzero constrained elements �

Y11 = �

Y21, �

Y42 = �

Y52, �

Y73 = �

Y83, and

nonzero fixed elements �Y31

= .80, �Y62

= .70, �Y93

= .75; � has nonzero freeelements

11,

22; B has nonzero free elements

31,

32; � has nonzero free

element �21

, and nonzero fixed elements �11

= 1 and �22

= 1; � has nonzerofree diagonal elements �

11 to �

33; �

has nonzero constrained diagonal

elements �11

= �22

= �33

, �44

= �55

= �66

; �εεεεε has nonzero constraineddiagonal elements ε11

= ε22 = ε33

, ε44 = ε55

= ε66, ε77

= ε88 = ε99

.Misspecified Model A1 may be described as follows: �X has nonzero

free elements �X11

, �X21

, �X31

, �X41

, �X51

, �X62

, �X72

, �X82

, �X92

, �X10,2

; �Y hasnonzero free elements �

Y11, �

Y21, �

Y31, �

Y41, and nonzero fixed element �

Y51 =

.75; � has nonzero free elements 11

, 22

; � has nonzero free element �21

,and nonzero fixed elements �

11 = 1 and �

22 = 1; � has nonzero free



diagonal element �11

; � has nonzero free diagonal elements

�11 to

�10,10;

�εεεεε has nonzero free diagonal elements ε11 to ε55

.Misspecified Model A2 may be described as follows: �X has nonzero

constrained elements �X11

= �X21

= �X31

= �X41

= �X51

, �X62

= �X72

= �X82

=�

X92 = �

X10,2; �Y has nonzero constrained elements �

Y11 = �

Y21 = �

Y31 = �

Y41,

and nonzero fixed element �Y51

= .75; � has nonzero free elements 11

, 22

; �has nonzero free element �

21, and nonzero fixed elements �

11 = 1 and �

22 = 1;

� has nonzero free diagonal element �11

; � has nonzero constrained diagonal

elements �11

= �22

= �33

= �44

= �55

,

�66 =

�77 =

�88 =

�99 =

�10,10; �εεεεε

has nonzero constrained diagonal elements ε11 = ε22

= ε33 = ε44

= ε55.

Population Model B is a 7-factor CFA model with 3 indicators per factor,as shown in Figure 2.

Properly specified Model B1 may be described as follows: �X has nonzerofree elements �

(3i–2, i), �

(3i–1, i), �

(3i, i) for i = 1 to 7; � has diagonal elements

fixed to 1 and off-diagonal elements free; � has nonzero free diagonal

elements (3i–2, 3i–2)

, (3i–1, 3i–1)

, (3i, 3i)

for i = 1 to 7, and the following nonzerooff-diagonal free elements:

21,

11,10,

20,19, and

i,j where i, j {1, 4, 7, 10,

13, 16, 19}, i, j {2, 5, 8, 11, 14, 17, 20}, i, j {3, 6, 9, 12, 15, 18, 21}.Properly specified Model B2 may be described as follows: �X has

nonzero constrained elements �(3i–2, i)

= �(3i–1, i)

= �(3i, i)

for i = 1 to 7; � hasdiagonal elements fixed to 1 and off-diagonal elements free; �

has nonzero

constrained diagonal elements (3i–2, 3i–2)

= (3i–1, 3i–1)

= (3i, 3i)

for i = 1 to 7,and the following nonzero off-diagonal elements constrained to .1:

21,

11,10,

20,19

, and i,j where i, j {1, 4, 7, 10, 13, 16, 19}, i, j {2, 5, 8, 11, 14, 17,

20}, i, j {3, 6, 9, 12, 15, 18, 21}.Misspecified Model B1 may be described as follows: �X has nonzero free

elements �(4i–3, i)

, �(4i–2, i)

, �(4i–1, i)

, �(4i, i)

for i = 1; �(5i–5, i)

, �(5i–4, i)

, �(5i–3 i)

, �(5i–2,

i), �

(5i–1, i) for i = 2; �

(6i–8, i), �

(6i–7, i), �

(6i–6, i), �

(6i–5, i), �

(6i–4, i), �

(6i–3, i) for i = 3 to

4; � has diagonal elements fixed to 1 and off-diagonal elements free; � has

nonzero free diagonal elements (4i–3, 4i–3)

, (4i–2, 4i–2)

, (4i–1, 4i–1)

, (4i, 4i)

for i = 1;

(5i–5, 5i–5),

(5i–4, 5i–4),

(5i–3, 5i–3),

(5i–2, 5i–2),

(5i–1, 5i–1) for i = 2;

(6i–8, 6i–8),

(6i–7, 6i–7),

(6i–6, 6i–6)

, (6i–5, 6i–5)

, (6i–4, 6i–4)

, (6i–3, 6i–3)

for i = 3 to 4; and the followingnonzero off-diagonal free elements:

21,

11,10,

20,19, and

i,j where i, j {1, 4,

7, 10, 13, 16, 19}, i, j {2, 5, 8, 11, 14, 17, 20}, i, j {3, 6, 9, 12, 15, 18, 21}.Misspecified Model B2 may be described as follows: �X has nonzero

constrained elements �(4i–3, i)

= �(4i–2, i)

= �(4i–1, i)

= �(4i, i)

for i = 1; �(5i–5, i)

=�

(5i–4, i) = �

(5i–3 i) = �

(5i–2, i) = �

(5i–1, i) for i = 2; �

(6i–8, i) = �

(6i–7, i) = �

(6i–6, i) = �

(6i–5, i)

= �(6i–4, i)

= �(6i–3, i)

for i = 3 to 4; � has diagonal elements fixed to 1 and off-diagonal elements free; �

has nonzero constrained diagonal elements

(4i–3, 4i–3)

= (4i–2, 4i–2)

= (4i–1, 4i–1)

= (4i, 4i)

for i = 1; (5i–5, 5i–5)

= (5i–4, 5i–4)

= (5i–3, 5i–3)

=

(5i–2, 5i–2) =

(5i–1, 5i–1) for i = 2;

(6i–8, 6i–8) =

(6i–7, 6i–7) =

(6i–6, 6i–6) =

(6i–5, 6i–5)



= (6i–4, 6i–4)

= (6i–3, 6i–3)

for i = 3 to 4; and the following nonzero off-diagonalelements constrained to .1:

21,

11,10,

20,19, and

i,j where i, j {1, 4, 7, 10,

13, 16, 19}, i, j {2, 5, 8, 11, 14, 17, 20}, i, j {3, 6, 9, 12, 15, 18, 21}.

Appendix B

Rates of Non-convergence (NC), Improper Solutions (IS), and TestStatistic Failure (F)

Properly Specified Model A1 (85 df)

ML SB RES ADF

D cond n NC IS F F NC IS

1 1:1 35 0.0 26.6 0.02:1 70 0.0 1.0 0.05:1 175 0.0 0.0 0.0 0.0 1.5 3.6

10:1 350 0.0 0.0 0.0 0.0 0.0 0.0df + 1 86 0.0 0.4 0.0 7.1

p* 120 0.0 0.0 0.0 0.0 26.0 48.2

2 1:1 35 0.4 30.0 0.02:1 70 0.0 2.3 0.05:1 175 0.0 0.0 0.0 0.0 7.9 13.9

10:1 350 0.0 0.0 0.0 0.0 0.0 0.0df + 1 86 0.0 0.8 0.0 7.7

p* 120 0.0 0.0 0.0 0.0 36.0 68.2

3 1:1 35 2.1 62.1 0.02:1 70 0.7 16.4 0.05:1 175 0.1 0.4 0.0 0.0 13.6 31.9

10:1 350 0.0 0.0 0.0 0.0 1.5 1.7df + 1 86 0.5 9.5 0.0 7.2

p* 120 0.2 3.0 0.0 0.0 40.0 84.6



Properly Specified Model A2 (102 df)

ML SB RES ADF


1 1:1 18 0.1 12.5 0.02:1 36 0.0 0.3 0.05:1 90 0.0 0.0 0.0

10:1 180 0.0 0.0 0.0 0.0 0.6 0.6df + 1 103 0.0 0.0 0.0 9.2

p* 120 0.0 0.0 0.0 0.0 18.2 37.8

2 1:1 18 0.0 13.0 0.02:1 36 0.0 0.6 0.05:1 90 0.0 0.0 0.0

10:1 180 0.0 0.0 0.0 0.0 2.9 4.1df + 1 103 0.0 0.0 0.0 10.1

p* 120 0.0 0.0 0.0 0.0 25.9 56.8

3 1:1 18 0.1 26.7 0.02:1 36 0.0 4.4 0.05:1 90 0.0 0.6 0.0

10:1 180 0.0 0.1 0.0 0.0 6.1 10.9df + 1 103 0.0 0.1 0.0 10.4

p* 120 0.0 0.4 0.0 0.0 29.4 65.4



Properly Specified Model B1 (102 df)

ML SB RES ADF


1 1:1 129 0.0 8.4 0.0 0.02:1 258 0.0 0.2 0.0 0.0 0.2 5.75:1 645 0.0 0.0 0.0 0.0 0.0 0.0

10:1 1290 0.0 0.0 0.0 0.0 0.0 0.0df+1 103 0.0 17.0 0.0 8.9

p* 231 0.0 0.3 0.0 0.0 20.1 16.2

2 1:1 129 0.0 9.2 0.0 0.02:1 258 0.0 0.3 0.0 0.0 0.9 14.75:1 645 0.0 0.0 0.0 0.0 0.0 0.0

10:1 1290 0.0 0.0 0.0 0.0 0.0 0.0df+1 103 0.0 19.3 0.0 8.3

p* 231 0.0 0.4 0.0 0.0 12.3 28.2

3 1:1 129 0.0 30.1 0.0 0.02:1 258 0.0 3.6 0.0 0.0 1.0 15.15:1 645 0.0 0.0 0.0 0.0 0.0 0.2

10:1 1290 0.0 0.0 0.0 0.0 0.0 0.0df+1 103 0.0 44.5 0.0 8.6

p* 231 0.0 5.4 0.0 0.0 12.8 44.3



Properly Specified Model B2 (196 df)

ML SB RES ADF


1 1:1 35 0.0 0.7 0.02:1 70 0.0 0.0 0.05:1 175 0.0 0.0 0.0

10:1 350 0.0 0.0 0.0 0.0 0.0 0.0df + 1 197 0.0 0.0 0.0 18.5

p* 231 0.0 0.0 0.0 0.0 0.0 13.2

2 1:1 35 0.0 0.9 0.02:1 70 0.0 0.0 0.05:1 175 0.0 0.0 0.0

10:1 350 0.0 0.0 0.0 0.0 0.0 0.1df + 1 197 0.0 0.0 0.0 20.6

p* 231 0.0 0.0 0.0 0.0 0.2 45.4

3 1:1 35 0.0 10.1 0.02:1 70 0.0 0.2 0.05:1 175 0.0 0.0 0.0

10:1 350 0.0 0.0 0.0 0.0 0.0 12.7df + 1 197 0.0 0.0 0.0 43.6

p* 231 0.0 0.0 0.0 0.0 1.7 86.9

Note. All rates are in percentage form. ML and ADF refer to model estimation methods; SBrefers to the Satorra-Bentler test statistics; RES refers to the residual-based test statistics. D =distributional form; cond = sample size condition. IS and F rates are based on converged datasets only. D = 1, multivariate normal; D = 2, elliptically symmetric nonnormal; D = 3,asymmetric nonnormal.

Documents

Evaluating Small Sample Approaches for Model Test ... · Evaluating Small Sample Approaches for Model Test ... (Yuan & Bentler, 1998, ... 2000; Gerbing & Anderson, 1985; Hu & Bentler,