77
97 5 Chapter Technical Properties In this chapter, the technical properties of the UNIT are described, includ- ing the results of studies of test reliability, subtest and scale properties, and test validity. Throughout the development of the UNIT, an overriding aim was to meet and then to exceed existing standards for measures of cognitive and intellectual ability, including those described in the Standards for Educational and Psychological Testing (1985) and those pro - posed by well-respected authorities such as Nunnally (1978) and Anastasi and Urbina (1997). The standards for preschool tests (Bracken, 1987) were adapted for school-aged children and applied to the entire UNIT, with the addition of more rigorous standards for floors, ceilings, and difficulty gra- dients. The UNIT’s technical properties are discussed in terms of these standards, which are presented in Table 5.1. Although the UNIT was developed according to the highest psychometric standards, the principal value of superlative technical properties lies in their implications for clinical and educational decision making. Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties that relate directly to applied assessment. The reliability of the UNIT for the special populations for which the UNIT is intended is demonstrated by its measurement precision for clinical diag- nostic groups and various exceptional groups and at or near critical test score ranges. The utility of UNIT subtests and scales for assessing the intellectual functioning of individuals of very high and very low ability is also discussed. Evidence of validity includes studies of the UNIT’s conver- gence with (and divergence from) widely known intelligence tests as well as investigations of the UNIT’s use with the exceptionalities most com- monly encountered by practitioners. Theories of test technical development are discussed extensively but only in relation to applied practice.

Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

97

5ChapterTechnicalProperties

In this chapter, the technical properties of the UNIT are described, includ-ing the results of studies of test reliability, subtest and scale properties,and test validity. Throughout the development of the UNIT, an overridingaim was to meet and then to exceed existing standards for measures ofcognitive and intellectual ability, including those described in theStandards for Educational and Psychological Testing (1985) and those pro-posed by well-respected authorities such as Nunnally (1978) and Anastasiand Urbina (1997). The standards for preschool tests (Bracken, 1987) wereadapted for school-aged children and applied to the entire UNIT, with theaddition of more rigorous standards for floors, ceilings, and difficulty gra-dients. The UNIT’s technical properties are discussed in terms of thesestandards, which are presented in Table 5.1.

Although the UNIT was developed according to the highest psychometricstandards, the principal value of superlative technical properties lies intheir implications for clinical and educational decision making.Accordingly, many of the analyses reported here are analyses of theUNIT’s technical properties that relate directly to applied assessment. Thereliability of the UNIT for the special populations for which the UNIT isintended is demonstrated by its measurement precision for clinical diag-nostic groups and various exceptional groups and at or near critical testscore ranges. The utility of UNIT subtests and scales for assessing theintellectual functioning of individuals of very high and very low ability isalso discussed. Evidence of validity includes studies of the UNIT’s conver-gence with (and divergence from) widely known intelligence tests as wellas investigations of the UNIT’s use with the exceptionalities most com-monly encountered by practitioners. Theories of test technical developmentare discussed extensively but only in relation to applied practice.

Page 2: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Reliability StudiesReliability refers to the accuracy or precision of scores from a test or to the degree to which test scores are free from measurement error.Measurement precision is enhanced and error reduced by test elementsthat are internally consistent, temporally stable, accurately understood,and convergently scored by different observers and scorers. One major rea-son for careful standardization of test administration is to elicit reliableresponses from examinees and, thereby, to maximize reliability and reducesystematic error. Test reliability is demonstrated most robustly by mea-surement precision that is consistent across age, sex, race/ethnicity, andthe specific populations for which the test is intended. Therefore, reliabil-ity should be demonstrated across the full range of individual ability, espe-cially the ranges of ability most relevant for clinical decision making.Several methods were used to examine the reliability of UNIT scores.

98 Examiner ’s Manual

Page 3: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Internal ConsistencyThe internal consistency of a test refers to the uniformity and coherence of test content (i.e., items) and is a prerequisite of precise measurement ofthe psychological construct being assessed. Internal consistency is neces-sary for establishing that the variability in test scores is attributable toability rather than to measurement error. Split-half correlations correctedby the Spearman–Brown formula were used as the index of subtest inter-nal consistency. The reliability coefficients of the UNIT scales were com-puted with the formula for the reliability of linear combinations (Nunnally,1978). The subtest and scale reliability coefficients were based on datafrom the normative standardization sample and thereby provide estima-tions of score precision that are based on representative population param-eters. The average reliability coefficients for the standardization sampleacross age groups were computed with Fisher’s z transformation. The reli-ability coefficients for clinical and exceptional populations were computedby similar methods so that results are directly comparable. The adequacyof the UNIT reliability coefficients was compared to the standards adaptedfrom Bracken (1987; see Table 5.1).

The reliability estimates for the UNIT subtests and scales for theAbbreviated, Standard, and Extended batteries are reported in Table 5.2for each age group. The median of the average subtest reliability coeffi-cients across ages is .83 for the Standard Battery and .80 for the ExtendedBattery, with average subtest reliability coefficients across age groupsranging from .64 (Mazes) to .91 (Cube Design). The median reliability coefficients of the six subtests meet the standard of .80, and for 8 of the 12 age groups, exceed that standard. For the remaining 4 age groups, themedian reliability coefficient is .79. The subtest reliabilities for a combinedcl i n i c a l /exceptional sample consisting of individuals from the populationsfor which UNIT is intended were also examined. The median average sub-test reliability coefficient was .92 for both the Standard Battery and theExtended Battery. On the basis of these results, the UNIT approaches ormeets the minimum reliability standards for both the standardization andclinical samples.

Composite scale reliability coefficients average .89 for the StandardBattery and .88 for the Extended Battery, with coefficients ranging from alow of .86 (Extended Battery Reasoning Quotient) to .91 (Standard BatteryNonsymbolic Quotient). The scale reliability coefficients for the clinical/exceptional sample were substantially higher, averaging .96 and with nonelower than .95.

The FSIQ is the UNIT’s estimate of general cognitive functioning and theprincipal score used for educational decision making. The FSIQ reliabilitycoefficients average .91 for the Abbreviated Battery, .93 for the StandardBattery, and .93 for the Extended Battery. For the clinical/exceptional sam-ple, reliability coefficients for the FSIQ for all three batteries are slightlyhigher. All of these coefficients exceed the standards of .90 for total testinternal consistency and .80 for screening test internal consistency.

Technical Properties 99

Page 4: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

100

Exam

iner

’s Man

ual

Page 5: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

In the analyses of fairness reported in Chapter 6, the UNIT subtests andscales show comparably high internal consistencies for groups categorizedby sex, race, and ethnicity.

Confidence Intervals and Standard ErrorsA confidence interval provides a probability band around an individual’sobtained score or estimated true score. The band represents the range ofscores within which the individual’s true score is likely to fall, based onmeasurement precision and the desired level of certainty. The confidenceintervals for the UNIT are based on the estimated true score and the stan-dard error of the estimate (SEE), although the computation of confidenceintervals around the obtained score with the standard error of measure-ment (SEM) is also possible. Both methods are discussed here.

The standard error of measurement is an estimate of the amount of errorin an individual’s obtained score. Large standard errors of measurementindicate less precise measurement, whereas small standard errors of mea-surement imply greater accuracy and reduced error. The UNIT standarderrors of measurement were calculated from the reliability coefficientsaccording to the following formula (Nunnally, 1978):

SEM 5 SDÏ1w 2w rwxxw

As the formula shows, there is an inverse relationship between standarderror and reliability. As the standard error decreases, the reliability of atest’s scores increases. The standard errors of measurement for the UNITsubtests and scales by age are presented in Table 5.3 and are reported inthe same standard score units as the respective subtest or scale scores.Because the subtest scores have a standard deviation of 3 compared to thestandard deviation of 15 for the scale standard scores, the subtest stan-dard errors of measurement are proportionally smaller than those for thescales.

The standard errors of measurement may be used to create a confidenceinterval around an individual’s obtained score. Confidence intervals arecalculated with the following formula and reflect the level of precisiondesired for describing the probability that an individual’s true score fallswithin the confidence interval.

Confidence Interval 5 Obtained Score 6 z (SEM),

where z 5 1.00 at the 68% level of confidence; z 5 1.65 at the 90% level;z 5 1.96 at the 95% level; and z 5 2.58 at the 99% level.

The following example illustrates construction of confidence intervals withthis formula and with the average internal reliability coefficient for theUNIT Standard Battery FSIQ of .93 and a standard error of measurementof 3.99. If an individual obtained an FSIQ of 110, the chances are 90 in100, or 90%, that his or her “true” FSIQ would be included in the range ofscores between 103 and 117, that is, 110 6 [(1.65)(3.99)] with appropriaterounding.

Technical Properties 101

Page 6: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

102

Exam

iner

’s Man

ual

Page 7: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

The confidence intervals included in the UNIT norms tables, however, arebased on a methodology intended to enhance accuracy, that is, one derivedfrom the average standard error of estimation (SEE). With this method,confidence intervals are centered on the estimated true score rather thanon the obtained score. The standard error of estimation is derived by thefollowing formula (Stanley, 1971):

SEE 5 SD Ïrxwxw(1w 2w rwxxw)w,

where SEE is the standard error of estimation, SD is the standard devia-tion of the score, and rxx is the reliability coefficient of the related scale.The standard error of estimation, therefore, corrects for regression to themean and results in increasingly asymmetrical bands around obtainedscores as scores deviate farther from the population mean. In practicalterms, this procedure has little effect on scores near the population mean.For extreme scores, however, the effect is notable. For example, for theextremely low FSIQ of 65 on the Extended Battery, the confidence intervalis from 57 to 73 based on the obtained score and standard error of mea-surement at the 95% level of confidence. In contrast, that same score hasan asymmetrical confidence interval based on the estimated true score andstandard error of estimation of 61–75 at the 95% confidence level becausethe true score is likely to be closer to the mean than is the obtained score.The error subsumed in extreme scores tends to push scores away from themean. Thus, the confidence interval around the estimated true scorereflects a correction for true-score regression to the mean.

Internal Consistency and Standard Errors forClinical and Exceptional SamplesAlthough adequate measurement precision across a test’s intended agerange is important, the reliability, accuracy, and precision for the clinicaland exceptional populations for which the test is intended may be moreimportant for particular examinees from these clinical groups.

The standard errors of measurement for the clinical/exceptional sample are included in Table 5.3 for purposes of comparison. Insofar as the cl i n i c a l/exceptional standard errors of measurement are consistently smaller thanthe average standard errors of measurement across ages, the confidenceintervals computed with the UNIT average standard errors of measure-ment tend to be conservative.

Table 5.4 presents the reliability coefficients for the UNIT subtests andscales for the Abbreviated, Standard, and Extended batteries for four clinical/exceptional samples: examinees with learning disabilities, exam-inees with mental retardation, examinees who are intellectually gifted, a n de x a minees with speech and language impairments. (A more detaileddescription of the criteria for identifying these samples appears in thevalidity sections of this chapter.) The split-half method was used to calcu-late the subtest reliability coefficients, corrected with the Spearman–Brown

Technical Properties 103

Page 8: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

104

Exam

iner

’s Man

ual

Page 9: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

formula. All composite scores were calculated with the formula for reliability of linear combinations (Nunnally, 1978). All of the reliabilitycoefficients were corrected for restriction in range with the formula suggested by Gulliksen (1987). The standard errors of measurement for thefour samples are also presented and are based on the corrected reliability coefficients.

The results show that obtained reliabilities meet the proposed standards.The obtained median subtest reliability coefficients for each group are allgreater than .80, and the median composite reliability coefficients allexceed the .90 standard. Although corrected reliability coefficients tend tobe slightly larger, in some instances they are smaller due to the high vari-ability of these samples. Nevertheless, the corrected coefficients also meetand exceed the proposed standards. The results are evidence of adequatemeasurement precision for children and adolescents with learning disabili-ties, mental retardation, intellectual giftedness, and speech and languageimpairments.

Internal Consistency at Decision-Making PointsA test’s reliability, accuracy, and precision of scores within the range ofscores (i.e., locality) where clinical and educational decision making occursis critical. For example, how confident are psychologists that a score of 70,commonly used for educable mentally retarded (EMR) diagnosis, is as pre-cise as a score of 100? Because reliability statistics for a normative sampleare based primarily on scores near the population mean, an examinationof reliability of scores that are equally far from the population mean isnecessary for ensuring that important educational decisions can be madewith confidence. Not only does the reliability coefficient vary with theextent of individual differences in the sample, it may also vary betweengroups differing in average ability level (Anastasi & Urbina, 1997).

A common criterion for the classification of intellectual giftedness is anFSIQ equal to or greater than 130, that is, 2 SDs above the general popu-lation mean. An equally common cut score for the classification of mentalretardation is an FSIQ equal to or less than 70, that is, 2 SDs below thegeneral population mean. The reliability of the UNIT near these criticalcut points was verified by calculating the reliability coefficients separatelyfor Standard Battery FSIQs between 21.33 SD and 22.66 SD from themean and between 11.33 SD and 12.66 SD from the mean. These rangeswere selected to include about 99% of the examinees whose true scoremight have been at the cut point in the reliability calculations. The split-half method was again used to calculate the subtest reliability coefficients,corrected with the Spearman–Brown formula. All of the composite reliabil-ity coefficients were calculated with the formula for the reliability of linearcombinations (Nunnally, 1978). All coefficients were also corrected forrestriction or expansion in range (e.g., Gulliksen, 1987).

The obtained and corrected reliability coefficients for each UNIT subtestand scale for the Abbreviated, Standard, and Extended batteries calculat-ed for the two key clinical decision-making points are reported in Table 5.5.

Technical Properties 105

Page 10: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

106

Exam

iner

’s Man

ual

Page 11: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

The sample included 471 individuals drawn from the standardization andclinical/exceptional samples who obtained scores in the respective criticalranges. With two exceptions, all of the median obtained and corrected sub-test reliability coefficients are greater than .80, meeting the standardssuggested by Bracken (1987). The two exceptions are the corrected andobtained coefficients for Mazes for the high-ability sample. The correctedreliability coefficients for the UNIT scales, including the Full Scale, allexceed the .90 standard, and the obtained coefficients for these compositescores are consistently near or above .80. Accordingly, the UNIT may beconsidered sufficiently reliable for the range of scores where clinical deci-sion making occurs.

Test–Retest Score StabilityTest scores must be reasonably stable to have practical utility for makingclinical and educational decisions and to be predictive of future perform-ance. Typically, stability is reported in terms of test–retest stability coef-icients, which are correlations of test performance at two points in time.Test stability also encompasses the average magnitude of increase ordecrease in scores over a specified time, with change scores indicating theeffects of practice, learning, maturation, and random error.

The stability of UNIT scores was assessed in a study of 197 participants(approximately 15 in each age group between 5 and 17) who took theUNIT twice over an interval of approximately 3 weeks. The meantest–retest interval was 20.3 days, with a range from 3 to 42 days. Theracial composition of the sample was 76.1% White, 19.8% AfricanAmerican, 3% Asian, and 1% Other. Approximately 1% of the sample wasHispanic. Of the sample, 50.8% were female and 49.2% male. The parenteducation levels of the sample varied: 19.3% of parents had not finishedhigh school, 30.5% were high school graduates or the equivalent, 18.8%had completed some college, and 31.5% had completed 4 or more years ofcollege. For this study, adjacent ages were combined to form four agegroups: 5–7, 8–10, 11–13, and 14–17.

The means and standard deviations of the scores for both testings as wellas the corrected and observed stability coefficients for the total sample andthe four age groups are reported in Table 5.6. The stability coefficientswere corrected for the variability of the initial testing with the formula forthe restriction or expansion of range provided by Guilford and Fruchter(1978).

Bracken (1987) recommended a total test stability coefficient of .90 orgreater, and the UNIT Standard and Extended batteries approach orexceed this criterion for ages 8 and older. Average test–retest practiceeffects across ages are 7.2 points for the Abbreviated Battery, 5.0 points forthe Standard Battery, and 4.8 points for the Extended Battery. Practiceeffects tend to peak at ages 8–10 and drop thereafter, with gains in theStandard and Extended FSIQs averaging about 3–5 points over the

Technical Properties 107

Page 12: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

108 Examiner ’s Manual

Page 13: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 109

Page 14: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

test–retest interval. Object Memory and Mazes appear to be the least sta-ble subtest scores, whereas the Cube Design score is the most stable. Thesmallest mean gains from test to retest were made on the Spatial Memoryand Analogic Reasoning subtests, whereas the highest mean gainsoccurred on Symbolic Memory. According to the results of this study, theReasoning Quotient is the most stable score.

Summary of Reliability StudiesThe results from the analyses of the reliability of the UNIT’s scores indi-cate a high level of measurement precision in terms of internal consistencyand stability over time for both clinical and nonclinical populations. TheUNIT’s subtest and scale scores consistently approach or exceed proposedstandards for reliability. Moreover, reliability is high for general clinicaland exceptional populations, for specific exceptionalities, and for individu-als with high or low cognitive abilities near clinical and educational decision-making points. As the results reported in Chapter 6 show, UNITscores are also reliable for separate groups categorized by sex, race, andethnicity.

Subtest and Full ScalePropertiesThe subtests and scales of a test must have adequate range (from floor toceiling) and appropriate difficulty gradients for the test to be valid for clin-ical and educational decision making at extreme score ranges. As reportedin Table 5.1, instruments intended for school-aged examinees should havefloors sufficiently strong to differentiate the extreme lowest 3% (i.e., indi-viduals of low ability or with mental retardation) of the population fromthe upper 97%. This recommendation may be extended for older school-aged examinees to include a ceiling that is sufficiently high to differentiatethe extreme upper 3% (i.e., individuals with high ability or giftedness)from the lower 97%. It is also recommended that a test’s item difficultygradient should not be so steep that an increase or decrease in a singleraw-score point results in a scaled score change of more than one-thirdstandard deviation (0.33 SD). Likewise, an increase or decrease in onesum-of-scaled-score point should not result in a total test standard scorechange of more than one-third standard deviation. Item gradients steeperthan this criterion result in little differentiation of ability. In this section,the floors, ceilings, and difficulty gradients associated with the UNIT sub-tests and Full Scale are described.

Subtest Floors, Ceilings, and Difficulty GradientsThe results from analyses of average subtest floors and ceilings are report-ed for the Abbreviated Battery (Table 5.7), Standard Battery (Table 5.8),

110 Examiner ’s Manual

Page 15: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

and Extended Battery (Table 5.9). The percentage of examinees whoseintellectual functioning is effectively assessed at each extreme of the dis-tribution of scores was determined, with the adequacy of the floors andceilings evaluated according to standards originally suggested by Bracken(1984). According to these standards, a floor or ceiling that fails to assessthe functioning of 10.01% of the population or more at either end of thedistribution is considered poor; omission of the extreme from 7.01% to10.00% of the population is considered fair; from 5.01% to 7.00% is consid-ered good; from 3.01% to 5.00% is considered very good; and from ,0.01%to 3.00% is considered excellent.

The UNIT’s subtest floors were evaluated across the entire age range, butspecial attention was paid to the adequacy of the easiest items for theyoungest children for whom the test is appropriate. For age 5 years 0months, the average subtest floor, that is, a subtest scaled score associatedwith a raw score of 1 for each subtest, is 5.50 (7th percentile rank) for theAbbreviated Battery, 4.00 (2nd percentile) for the Standard Battery, and3.33 (1st percentile) for the Extended Battery. Accordingly, for age 5, theAbbreviated Battery has a good floor, but the interpretation of subtest rawscores of 0 or 1 as suggestive of possible delay should be cautious, and thechild’s functioning investigated further. The Standard and Extended bat-teries have excellent floors for age 5. For age 6 years 0 months andbeyond, all three batteries have very good or excellent subtest floors, withaverage scaled scores of 4.50 (3rd percentile) for the Abbreviated Battery,3.25 (1st percentile) for the Standard Battery, and 2.67 (1st percentile) forthe Extended Battery. These results indicate that the UNIT subtests havemore than adequate floors for assessing the functioning of examinees oflow ability across the entire age range, with the exception of theAbbreviated Battery at the youngest ages.

The UNIT’s average subtest ceilings were also evaluated across the entireage range, although emphasis was placed on the average ceiling for theoldest examinees for whom the test is intended. The ceiling of a test isdetermined by the extent to which there are sufficient difficult items todistinguish between examinees of average ability and examinees of above-average ability. For adolescents aged 17 years 11 months, the averageUNIT subtest ceiling, that is, the subtest scaled score associated with aperfect raw score for each subtest, was 18.00 (99th percentile) for theAbbreviated Battery, 17.75 (99th percentile) for the Standard Battery, and17.33 (99th percentile) for the Extended Battery. These results indicatethat UNIT has substantial and consistently excellent ceilings, even for theoldest and brightest examinees whose intellectual functioning may beassessed by the instrument.

Finally, item difficulty gradients for the UNIT were examined and arereported in Table 5.10. Difficulty gradients refer to how rapidly a standardscore increases as a function of an examinee’s success or failure on a singletest item (Bracken, 1987). Tests and subtests with smaller increments instandard scores relative to single raw-score points are more effective, sen-sitive, and finely tuned as measures of an examinee’s true ability. For the

Technical Properties 111

Page 16: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

112E

xamin

er’s M

anu

al

Page 17: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

113

Page 18: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

114E

xamin

er’s M

anu

al

Page 19: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

UNIT, difficulty gradients were determined by dividing the total number ofpossible points that could be obtained from the subtest’s floor to ceiling(i.e., the number of possible points from the highest raw score at the floorto the lowest raw score at the ceiling) by the number of standard devia-tions spanned by the subtest (i.e., the number of scaled score points from araw score of 1 to the highest obtained raw score divided by 3). For mostsubtests consisting of dichotomously scored items, the number of possiblepoints is equal to the number of items. For subtests with multiple scoringcriteria or those with bonus points (i.e., Cube Design and Mazes), the num-ber of possible points exceeds the number of items. When difficulty gradi-ents for subtests are averaged, any result meeting or exceeding 3 (i.e., 3raw-score points associated with 1 SD) meets the recommended standards.Analyses revealed that the average item gradient of UNIT subtests equalsor exceeds 3 for every age level. These results indicate that average UNITsubtest difficulty gradients are consistently adequate for detecting minorfluctuations in examinees’ abilities. Only one subtest—AnalogicReasoning—has an item gradient less than 3.0 for ages 5.0–6.7 years; thisresult suggests that this subtest has relatively high increments of changefor very young children. With older children, Analogic Reasoning effective-ly measures subtle differences in ability.

In summary, UNIT subtests have good to excellent floors, consistentlyexcellent ceilings, and consistently satisfactory item difficulty gradientsacross the ages and ability levels served by the test.

Full Scale PropertiesThe results from the analyses of the UNIT Full Scale are reported for the Abbreviated Battery (Table 5.11), Standard Battery (Table 5.12), andExtended Battery (Table 5.13). As with analyses of subtest properties, thepercentage of examinees whose intellectual functioning is effectivelyassessed at each extreme of the distribution of scores was determined,with the adequacy of the floors and ceilings evaluated according to stan-dards originally suggested by Bracken (1984). Difficulty gradients are not reported, in view of the adequacy of the constituent subtest difficultygradients.

Global scale floors represent the lowest possible standard scores (derivedfrom the sum of subtest scaled scores) for individuals who obtained a rawscore of 1 on all contributing subtests. Ceilings are the highest possiblestandard scores on scales, given the sum of possible subtest scaled scores.From floor to ceiling, the number of standard deviations covered by theUNIT FSIQ range from 5.4 to 7.2 for the Abbreviated Battery, from 6.5 to7.9 for the Standard Battery, and from 7.0 to 7.9 for the Extended Battery.These results indicate that the UNIT Full Scale for all three batteries hasvery good to excellent floors and consistently excellent ceilings.

Technical Properties 115

Page 20: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

116 Examiner ’s Manual

Page 21: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 117

Page 22: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

118 Examiner ’s Manual

Page 23: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

119

Page 24: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

120E

xamin

er’s M

anu

al

Page 25: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

121

Page 26: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Summary of Subtest and Full ScalePropertiesThese studies of the UNIT subtests and Full Scale provide evidence of thetest’s effectiveness in assessing the functioning of individuals at both high-and low-ability levels relative to age expectations. The results also showthat the UNIT’s ceilings and floors are consistently effective for makingthe types of clinical decisions for which the test is intended. Moreover, theUNIT has sufficiently gradual increases in subtest and Full Scale difficul-ty to be sensitive to functioning at every level, from very delayed to verysuperior.

Validity StudiesThe validity of a test concerns what the test measures and how well itdoes so (Anastasi & Urbina, 1997, p. 113). Test validation is an ongoingprocess, beginning with the construction of the test. Loevinger (1957) pro-posed that test validation during construction and development consists ofthree sequential components: substantive, structural, and external. For theUNIT, this approach was updated by the incorporation of concepts of vali-dation including, among others, those by Jackson (1970), Wright and Stone(1979), and Millon, Davis, and Millon (1997). Anastasi and Urbina (1997)recently summarized the approach. Based on this approach, the validationof the UNIT began with studies undertaken during its development. Thesestudies included substantive and structural (internal) aspects of validityand external evidence from numerous and varying sources. Underlying allof these studies was Guion’s (1977) proposition that construct validity isthe unifying concept of validity that integrates criterion and content con-siderations into a common framework. Guion observed that “all validity isat its base some form of construct validity . . . . It is the basic meaning ofvalidity” (p. 410).

Internal Evidence of ValidityEvidence of test validity may be found within the test itself. If this evi-dence is largely independent from external sources of information, it con-stitutes internal evidence of validity. Content-description and structuralevidence of validity are found within the test.

Content-Description Evidence of ValidityA systematic examination of test substance and content constitutes onesource of evidence of test validity. The formulation of test items based onand consistent with a theory constitutes substantive validity (Loevinger,1957). The relevance, representativeness, and coverage of item content isreferred to as content validity (Messick, 1980). Taken together, theseapproaches constitute an internal source of test validity that contributesto the overall construct validity.

122 Examiner ’s Manual

Page 27: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

The substantive validity of the UNIT was ensured from the beginning ofits development. From the generation of the initial pool of potential tasks,the test items and tasks were based on the theory underlying the UNIT.During pilot-testing stages, tasks were retained if participant performanceappeared centrally related to the constructs under study (e.g., the abilityto reason) and not to other factors (e.g., comprehension of instructions).Tasks were also retained if their nonverbal administration was sufficientlyeasy and effective. Tasks that did not meet these criteria included nonver-bal adaptations of trail making, picture arrangement, and logical reason-ing to predict natural outcomes.

In terms of content validity, the UNIT items and tasks were designed to berelevant to and representative of cognitive and intellectual abilities, withan emphasis on abilities and aptitudes rather than on achievement, previ-ously acquired knowledge, or cultural experiences. The content of theUNIT subtests was developed to have relevance to the central dimensionsof intelligence—memory and reasoning—as well as to the principal waysby which information is internally mediated—symbolically and nonsymbol-ically. The representativeness and the coverage of subtest content acrossintellectual domains were ensured by a varied and representative set ofperformance requirements and response modes. The UNIT subtests areunrepresentative in one way—they do not sample receptive or expressivelanguage abilities. At the same time, the use of symbolic processingrequires the active use of the internal mental processes associated withlanguage but without the demands of verbal reception or expression.

Structural Evidence of ValidityStructural validation refers to an instrument’s fidelity to its underlyingtheoretical model. Item properties and dimensionality, subtest and scaleinterrelationship, and subtest factorial composition all provide evidence ofstructural validity. Three investigations into the structural integrity of theUNIT were undertaken and are reported here: fit of individual items with-in a subtest, subtest intercorrelation studies, and factor analyses. Centralto the factor analysis is the demonstration that the UNIT is a sound mea-sure of general intellectual ability, or psychometric g.

Subtest Unidimensionality StudiesThe evaluation of the structural fit of subtest items to the ability assessedby that subtest was based on the Rasch model of item response theory(Wright & Stone, 1979), and item fit was analyzed with the BIGSTEPScomputer program (Wright & Linacre, 1992). The Rasch model provides amethodology for evaluating item properties in relation to the individualability levels for which the item is most discriminating. Good fit statisticsare evidence of the unidimensionality of items on a scale.

Every item from the standardization edition of the UNIT was examinedfor adequacy of fit. Items that demonstrated poor fit were removed fromthe final edition. Accordingly, findings for this final edition indicate thatevery item on every UNIT subtest demonstrated adequate fit to expecteditem characteristic curves. The criterion of adequate fit was an observed

Technical Properties 123

Page 28: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

mean square fit statistic that was less than three times the expected stan-dard error from its expected value. Therefore, the items on each UNIT sub-test may be interpreted as evidence of unidimensionality within thesubtest. Additional studies of item fit were conducted across groups catego-rized by sex, race, ethnicity, and hearing impairments. Results from thesestudies are described in Chapter 6.

Subtest and Scale Intercorrelation StudiesThe pattern of relationships between the UNIT subtests and scales wasinvestigated through a series of intercorrelation analyses. These analysesprovide information about convergent and discriminant subtest and scalepatterns. Intercorrelation matrices for each age group are provided inAppendix E. Table 5.14 presents the intercorrelation matrix for the entirestandardization sample, with the average intercorrelation across all 13 agegroups computed with Fisher’s z transformation.

Comparisons Between the Three UNIT BatteriesThe comparability of the three UNIT batteries for the standardizationsample was examined during the intercorrelation analyses. For the stan-dardization sample, correlations between the FSIQs and between thescores on the other four scales across the Abbreviated, Standard, andExtended batteries were consistently high. The Extended FSIQ correlatedvery highly with both the Abbreviated FSIQ (.86) and the Standard FSIQ(.95). The four quotients of the Standard and Extended batteries—Memory,Reasoning, Symbolic, and Nonsymbolic—were also highly correlated, yield-ing obtained correlations of .94, .90, .93, and .91, respectively. These corre-lations are likely to be inflated because some subtests are common to bothscales. Mean FSIQ differences (e.g., Extended FSIQ minus AbbreviatedFSIQ, Extended FSIQ minus Standard FSIQ, Standard FSIQ minusAbbreviated FSIQ) obtained from the Abbreviated, Standard, andExtended batteries were negligible (less than 1.0 points) across the stan-dardization sample. This finding is expected because the normative meanswere set at 100 for the standardization sample. For clinical and exception-al samples, the mean differences were also small. The largest mean FSIQdifferences were 2.0 points (Abbreviated minus Extended) for the samplewith learning disabilities, 5.5 points (Abbreviated minus Extended) for thesample with mental retardation, 1.1 points (Standard minus Extended) forthe sample with giftedness, 2.8 points (Abbreviated minus Extended) forthe sample with speech and language impairments, and 0.9 points(Abbreviated minus Standard) for the sample with serious emotional dis-turbance. With the exception of the sample with mental retardation forwhich mean FSIQ differences are slightly larger than 1 SEM, these resultssupport the validity of the Abbreviated, Standard, and Extended batteriesas comparable measures.

Subtest Factor Analytic InvestigationsFactor analyses are data-reduction procedures that yield informationabout the internal structure of the test that, in turn, serves as evidence ofthe test’s construct validity. Based on the assumption that numerous sub-test intercorrelations can be most parsimoniously explained by a reduced

124 Examiner ’s Manual

Page 29: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties

125

Page 30: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

number of underlying dimensions, or factors, this methodology includesboth exploratory and confirmatory forms. Exploratory factor analysesallow for empirical derivation of the natural structure of an instrument,often without a priori expectations. The results are best interpretedaccording to the “psychological meaningfulness” of the dimensions or fac-tors that emerge (e.g., Gorsuch, 1983). Confirmatory factor analyses permitthe internal structure of a test to be compared against a series of specifiedmodels so that the model that best fits the data can be identified.

Exploratory Factor Analyses. Several extraction methods and datafrom several samples were used for the exploratory analyses. It wasexpected that findings that were consistent across methods and sampleswould yield the most stable factorial solution. Reed and McCallum (1995)suggested that although all UNIT subtests are highly and positively asso-ciated with a general intelligence factor, g, a two-factor solution with ahigher-order factor g may be the most appropriate in conceptualizing theUNIT. This solution and others were investigated. The UNIT subtests com-posing the Standard Battery and the Extended Battery were subjected toboth principal components and principal axis methods of extraction and tosecond-order analyses.

Initial analyses were run on data from the total standardization sample(N 5 1,996, with cases missing any data excluded). The subtest scaledscore intercorrelations for the Standard Battery and a principal compo-nents method with an oblique rotation were used. Components were notrestricted to an orthogonal rotation, because strong correlations betweenthe component cognitive abilities assessed by the UNIT were expected.

A large first eigenvalue of 2.33 was computed, followed by a substantiallysmaller second eigenvalue of .64. These findings support the presence of astrong first factor, which is commensurate with the interpretation of theUNIT FSIQ as an overall index of global cognitive ability. The two-factorsolution, however, provided the most interpretable and meaningful solu-tion, according to the recommendations by Gorsuch (1983) described previ-ously. In this solution, subtests clustered into a memory factor (I) and areasoning factor (II)—the primary constructs that the UNIT was designedto assess. The factor pattern and structure coefficients for the two-factorsolution are provided in Table 5.15. In brief, factor pattern coefficients(sometimes referred to as factor loadings) are analogous to regression betaweights and specify the relative weight for each subtest scaled score as itcontributes to the determination of the factor standard score, controlled forany redundancies in the measured variables. Factor structure coefficientsare analogous to regression structure coefficients and specify the variable’scorrelation with the factor, without the exclusion of the overlap betweenmeasured variables. The interpretation of pattern and structure coeffi-cients is further explained by Thompson and Daniel (1996) and Thompson(1997). The two factors accounted for 77.5% of the total variance.

These analyses were repeated with the six-subtest Extended Battery. Thetwo-factor (Memory–Reasoning) solution did emerge, as it had with the

126 Examiner ’s Manual

Page 31: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Standard Battery, but a third factor was added to the solution. Mazes did not correlate with the Reasoning factor to the extent anticipated but,instead, defined a third factor; on that factor, only Mazes produced a salientpattern coefficient (greater than .40). This finding revealed Mazes as a spe-cial and somewhat unique form of reasoning, that is, planning. Results fromthe confirmatory factor analyses, discussed next, provide support for theinclusion of Mazes as a reasoning subtest. Mazes shares the smallestamount of common variance with the other UNIT subtests, and its reliablespecific variance is relatively high. Accordingly, it captures informationabout the examinee’s intellectual functioning that is not readily availablefrom the other reasoning or memory subtests.

Because the UNIT was developed with a hierarchical theoretical model, ahigher-order factor analysis was considered to be the next logical analyticapproach (McClain, 1996). As Gorsuch (1983) emphasized, “Rotatingobliquely in factor analysis implies that the factors do overlap and thatthere are, therefore, broader areas of generality than just a primary factor.Implicit in all oblique rotations are higher-order factors” (p. 255).

The second-order factor analytic methods proposed by Schmid and Leiman(1957) were employed with a covariance matrix from the Standard Batterydata for both the standardization sample (N 5 1,996) and the clinical/exceptional sample (N 5 608). These results are presented in Table 5.16 andsuggest that a g factor subsumes the subtests but also that unique variancecommensurate with the UNIT’s conceptual model exists in the subtestscores. First-order memory (I) and reasoning (II) factors emerge, even in thepresence of the higher-order g factor. These findings appear stable for thestandardization and clinical/exceptional samples.

A third set of factor analyses were conducted with the Standard Batterydata. For these analyses, ages were combined to form four age groups: 5–7,8–10, 11–13, and 14–17. Also for these analyses, a principal axis methodwas used, in part because the principal components method has sometimesbeen shown to produce spuriously inflated factor pattern coefficients(Gorsuch, 1983), although the interpretations of the factors tend to be

Technical Properties 127

Page 32: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

stable across such fluctuations (Thompson & Daniel, 1996). Again, factorswere not restricted to an orthogonal rotation.

Results for each of the two factors across all four age groups appear inTables 5.17 and 5.18. As before, the tables present both factor pattern coef-ficients and factor structure coefficients. For three of the four age groups,the largest subtest factor pattern coefficients occurred on the expected fac-tors, with Ages 11–13 representing the exception. This finding was furtherexplored by analysis of the factor structure of the Standard Battery sepa-rately by each year for this age group (i.e., ages 11, 12, and 13) and for sev-eral adjacent ages on each side of the age group (i.e., ages 10, 14, 15, 16,and 17). The results supported the presence of dominant primary factors(Reasoning and Memory) at ages 10, 11, 15, 16, and 17 and the emergenceof dominant secondary factors (Symbolic Mediation and NonsymbolicMediation) at ages 12, 13, and 14. Apparently, symbolic and nonsymbolicmediation may become more prominent during puberty, possibly due toneurocognitive maturation. Changes in the performances of various cogni-tive abilities at puberty have been noted for other nonverbal processes(e.g., Newcombe & Bandura, 1983; Young, 1986). In any case, these find-ings provide further support for both the primary and secondary con-structs that the UNIT was developed to assess.

Confirmatory Factor Analyses. Confirmatory factor analyses wereconducted as further investigation of the structure of the abilities mea-sured by the UNIT. Prior investigations with confirmatory procedures(Reed & McCallum, 1995) suggested that the two-factor model of memoryand reasoning provides a good explanation of the UNIT’s subtest interrela-tionships. Data from the large standardization sample covariance matrixwere initially analyzed, followed by examination of structure for the fourconstituent age groups. All six UNIT subtests were included in these anal-yses to optimize the numbers of degrees of freedom necessary for confirma-tory factor analyses, as well as to provide a detailed examination of the

128 Examiner ’s Manual

Page 33: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

UNIT Extended Battery. Because standardization sample data and stan-dard scores were used for these calculations, the stringent normalityassumptions required for confirmatory factor analyses were met.

Several premises governed the confirmatory factor analysis of the UNITstandardization data. First, because more than one model can fit a givendata set, it is helpful to test the fit of several plausible rival models. Inthis manner, the relative adequacy of models can be evaluated. Second, itis important to use several fit statistics because various fit statistics eval-uate different aspects of a model’s adequacy (Thompson & Daniel, 1996).Models were evaluated according to indexes including chi-square (x2) andrelated statistics (chi-square divided by degrees of freedom [x2/df ] and thenoncentrality–degrees-of-freedom ratio [NC/df]), goodness-of-fit index(GFI), the goodness-of-fit index adjusted for degrees of freedom (AGFI),comparative fit index (CFI), root mean squared residual (RMSR), and theroot mean squared error of approximation (RMSEA). These indexes arefurther described in Jöreskog and Sörbom (1993). Several indexes of model

Technical Properties 129

Page 34: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

parsimony, including fit statistics multiplied by parsimony ratios, werealso included. Parsimony refers to the simplest acceptable model and ismathematically defined as the ratio of model degrees of freedom to avail-able degrees of freedom. Third, it is helpful to test models that are poten-tially falsifiable (i.e., models that could be rejected mathematically). Fitmay be spuriously high for models with insufficient degrees of freedom.

In a given confirmatory factor analysis, degrees of freedom are obtained asa function of the number of measured variables in the analysis. Then eachmodel parameter (e.g., factor pattern coefficient, factor correlation) that isestimated costs one degree of freedom against this total. When modelswith zero degrees of freedom are tested, fit is assured. Instead, what issought in confirmatory factor analysis is good fit for a parsimonious modelthat could be rejected as fitting, that is, fit for a model that still has rela-tively many degrees of freedom left unused.

For the present analysis, scores from the six subtests of the ExtendedBattery were used. The use of six measured variables yields 21 degrees of freedom ([ 6 3 ( 6 1 1 ) ] / 2 ). Thus, up to 21 parameters could be estimated.Three theoretical models were specified and analyzed:

Model 1—One-Factor Model: In this model, all six subtests were pre-dicted to define and to have the best fit on a single general factor, whichmay be conceptualized as the g factor of intelligence. All six measuredvariables are correlated with the factor, with factor variance fixed to 1, andwith the error variances of the six measured variables being estimated.This model estimates 12 parameters, yielding a model with 9 degrees offreedom (21 – 12).

Model 2—Two-Factor Model (Memory–Reasoning): This model pre-dicted that Symbolic Memory, Spatial Memory, and Object Memory woulddefine a Memory factor; and Cube Design, Analogic Reasoning, and Mazeswould define a Reasoning factor. Factor variances are fixed to 1, and factorcorrelations and the error variances of the six measured variables arebeing estimated. This model estimates 13 parameters (6 1 1 1 6), yieldinga model with 8 degrees of freedom (21 2 13).

Model 3—Two-Factor Model (Symbolic–Nonsymbolic): SymbolicMemory, Analogic Reasoning, and Object Memory were expected to definea Symbolic Mediation factor in this model, whereas Cube Design, SpatialMemory, and Mazes were expected to define a Nonsymbolic Mediation fac-tor. Factor variances are fixed to 1, and factor correlations and the errorvariances of the six measured variables are being estimated. This modelestimates 13 parameters (6 1 1 1 6), yielding a model with 8 degrees offreedom (21 2 13).

Table 5.19 presents various fit statistics for these models. Several patternsare evident from these results.

First, the fit statistics known to be affected by sample size are consider-ably inflated in the analysis of the standardization sample data because

130 Examiner ’s Manual

Page 35: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

this sample is large (N 5 1,996). Therefore, the x2 and NC/df are not inter-preted here. Second, all three models are reasonably plausible. The modelshave GFI, AGFI, and CFI that are high and greater than .90 (e.g., Bentler,1992) as well as small RMSR and RMSEA. Third, fit statistics tended to beslightly superior for the model positing a two-factor, Memory–Reasoning,conceptualization. This model yielded the highest GFI, AGFI, and CFI andthe lowest RSMR and RMSEA. The one-factor g model appears to providethe most parsimonious fit to the data.

Additional confirmatory factor analyses were conducted with data fromthe standardization sample for four separate age groups: 5–7, 8–10, 11–13,and 14–17. Results are presented in Table 5.20 and again support theplausibility of all three models with Memory–Reasoning showing slightlybetter fit than the one-factor g model and the two-factor Symbolic–Nonsymbolic model. For every age group, the GFI, AGFI, and CFI of allthree models are greater than .90 whereas the RSMR and RMSEA are lessthan .05. Once again, the one-factor g model appears to provide the mostparsimonious fit to the data.

In summary, confirmatory factor analyses of the data for the UNIT stan-dardization sample and across four age groups support the interpretationof a single general intelligence factor as well as the primary and secondaryscales. All models tend to show a good fit to the data, although theMemory–Reasoning conceptualization tends to be slightly better than theothers. At the same time, the one-factor g model provides the most parsi-monious explanation of the data. When these findings are interpreted inconjunction with the higher-order exploratory factor analytic results, thesix UNIT subtests appear to behave much as expected, in view of the hier-archical theoretical model that guided test construction.

Technical Properties 131

Page 36: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

132 Examiner ’s Manual

Page 37: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Subtest Variances. Factor analyses also permit the variance in a sub-test to be divided into three categories: common (shared), specific (unique),and error. The pattern of variance among these three categories can behelpful for determining whether an individual’s performance on a subtestis most likely attributable to a general ability shared among subtests, to aspecific ability that is unique to that subtest, or simply to poor subtest pre-cision and reliability. The methods described by Kaufman (1979, 1994),Sattler (1988), and Bracken, McCallum, and Crain (1993) were used todefine the sources of variance in the UNIT subtests.

Common (shared) factor variance is that portion of a test’s variance thatoverlaps when all factors are extracted. It is best estimated by the squaredmultiple correlation of each subtest with all other subtests in the battery.Common factor variance is most interpretable when it is associated with awidely understood construct such as general intelligence (g). In practicalterms, g may be estimated from the unrotated pattern coefficient of a sub-test on the first factor yielded by the factor analysis. Kamphaus (1993)used pattern coefficients on the first factor in principal components analy-sis to identify g loadings. By convention, factor pattern coefficients of .70or greater define good measures of g; coefficients from .50 to .69 define fairmeasures; and coefficients less than .50 are usually considered poor (e.g.,Kaufman, 1994).

As previously reported, the first factor extracted in the exploratory analy-ses yielded a large eigenvalue. The first unrotated factor accounted for58.3% of the variance in the standardization sample and for 74.9% in the

Technical Properties 133

Page 38: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

clinical/exceptional sample. The g loadings from the first unrotated factorfor the standardization sample for each of the six UNIT subtests arereported in Table 5.21. Of the six UNIT subtests, five meet or exceedKaufman’s (1994) criterion as good g measures ($.70), whereas Mazesalone earned a rating of poor (,.50) as a measure of general intelligence.

Specific variance refers to the amount of variance unique to a subtest thatis neither shared with other subtests nor attributable to error. It is com-puted by subtracting a subtest’s squared multiple correlation coefficientfrom the subtest’s average reliability coefficient across ages. Specific vari-ance is separate from and may be compared with error variance. Whensubtest-specific variance exceeds error variance and is at least 25%, thesubtest is considered to have adequate specificity (e.g., Kaufman, 1994;Sattler, 1988).

Finally, error variance is defined as a random source of variation remain-ing after common and specific variances are accounted for (e.g., Gorsuch,1983). Kamphaus (1993) used the average reliability across ages (comput-ed with Fisher’s z transformation) to compute error variance.

As the data in Table 5.21 show, all of the UNIT subtests except SpatialMemory have adequate specific variance to justify interpretation if theirscores differ significantly from overall performance on their respectivescales. Performance on Spatial Memory should be interpreted somewhatmore cautiously at the subtest level, because it is the most “g-loaded” ofthe UNIT subtests and, therefore, has the most common variance. The spe-cific variance of the UNIT subtests consistently exceeds error variance;this criterion justifies interpretation of unique abilities in the subtests.The error variance of Mazes, however, is sufficiently high to suggest thatperformance on this subtest may be influenced to a moderate degree byfactors other than ability.

134 Examiner ’s Manual

Page 39: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

External Evidence of ValidityEvidence of test score validity may also be accumulated through the study of relationships between test scores and variables external to thetest. Insofar as external criteria constitute independent sources of infor-mation, they are particularly important in the process of validation. Asoriginally suggested by Campbell and Fiske (1959), test scores should beexpected to be related to measures of the same psychological construct(convergent evidence of validity) and comparatively unrelated to measuresof different psychological constructs (discriminant evidence of validity).Test scores may be validated against criterion scores that are obtained atabout the same time (concurrent evidence of validity) or against criterionscores that are obtained at some future date (predictive evidence of validi-ty). The accumulation of external evidence of test score validity becomesmost important when test results are generalized to increasingly variedsituations and when the consequences of testing reach beyond the test’soriginal intent.

The relationships between UNIT scores and external measures of intelli-gence, measures of academic achievement, clinical diagnosis, and educa-tional exceptionality are described in this section. The chapter concludeswith a description of ways in which the test may be appropriately used forclinical and educational decision making.

Correlational Studies With Other Measures of IntelligenceCorrelational studies based on a concurrent research methodology wereconducted to establish convergent evidence of UNIT validity. Specifically,UNIT performance was studied in relation to performance on other mea-sures of intelligence including the WISC–III (Wechsler, 1991), the Tests ofCognitive Ability of the Woodcock-Johnson Psycho-EducationalBattery–Revised (WJ–R; Woodcock & Johnson, 1989/1990), the BateríaWoodcock–Muñoz Pruebas de habilidad cognitiva–Revisada (Batería–R;Woodcock & Muñoz-Sandoval, 1996), the Kaufman Brief Intelligence Test(K–BIT; Kaufman & Kaufman, 1990), the Matrix Analogies Test (MAT;Naglieri, 1985a), the Standard Progressive Matrices (Raven’s SPM; Raven,1960), and the Test of Nonverbal Intelligence–Second Edition (TONI–2;Brown, Sherbenou, & Johnsen, 1990). Data from clinical samples as wellas from samples receiving no special services were studied. The clinicalsamples included individuals with learning disabilities, mental retarda-tion, hearing impairments, and intellectual giftedness. Several racial,ethnic, and minority groups were also sampled, including NativeAmericans, Ecuadorians, and individuals whose first language is notEnglish. In each study, correlations were corrected to account for restric-tion or expansion in variance in both the predictor and criterion variables.Both obtained and corrected correlations are reported.

Correlations With the WISC–III®

Correlations between the UNIT and the WISC–III (Wechsler, 1991) wereobtained for four separate samples: examinees with learning disabilities

Technical Properties 135

Page 40: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

(n 5 61), examinees with mental retardation (n 5 59), examinees classifiedas intellectually gifted (n 5 43), and Native American examinees (n 5 34).The 21 female and 40 male examinees with learning disabilities ranged inage from 7 to 16 years (M 5 11.2, SD 5 3.1). Of this sample, 56 wereWhite, 4 African American, and 1 Hispanic. The 24 female and 35 maleexaminees with mental retardation ranged in age from 6 to 17 years (M 5 12.0, SD 5 3.4). Of this sample, 40 were White and 19 AfricanAmerican. The 20 female and 23 male intellectually gifted examineesranged in age from 6 to 16 years (M 5 10.9, SD 5 2.7). The NativeAmerican sample consisted of 18 female and 16 male examinees rangingin age from 6 to 16 years (M 5 11.2, SD 5 2.1). Detailed selection criteriaare provided in the section on studies with clinical and exceptional sam-ples. The UNIT and the WISC–III were administered to each of the sam-ples in counterbalanced order. One methodological limitation of theseinvestigations must be noted: Scores on the WISC–III are likely to be clos-er to expected ranges (e.g., FSIQ # 70 for individuals with mental retarda-tion) because the WISC–III, in fact, served as a basis for classification.This phenomenon, sometimes called criterion contamination, refers to thecircular logic inherent in comparison of the performance of individualsagainst the very measures with which they were originally identified.

Table 5.22 presents the correlations, means, and standard deviations forthe UNIT Abbreviated, Standard, and Extended composite scores and theWISC–III Verbal IQ (VIQ), Performance IQ (PIQ), and FSIQ and indexscores for all four samples. Obtained correlations were corrected forrestriction or expansion in range for both the criterion and predictor variables.

Relationships between the UNIT and the WISC–III for the sample withlearning disabilities consisted of strong, positive correlations and compara-ble mean scores. The mean WISC–III FSIQ of 92.44 is very comparable tothe mean UNIT Abbreviated, Standard, and Extended FSIQs of 94.56,91.69, and 91.44, respectively. The UNIT Abbreviated, Standard, andExtended FSIQs correlated strongly and positively with the WISC–IIIFSIQ: .78, .84, and .83, respectively.

For the examinees with mental retardation, reduced correlations due totruncated ranges of scores were expected. This sample obtained a lowermean FSIQ (53.54) on the WISC–III than on any of the three UNIT batter-ies. The mean Abbreviated FSIQ (64.98) was 11 points higher than theWISC–III FSIQ. The corrected correlation (.86) between these two scoresdemonstrated a strong positive relationship between the two instruments.Approximately 74% of the variance was shared between the two globalscores. The mean UNIT Standard FSIQ (61.19) and Extended FSIQ (59.61)reduced the differences to 8 and 6 points, respectively. The Standard andExtended FSIQs also evidenced strong positive correlations with theWISC–III score (.84 and .88, respectively). Because the UNIT bears moreresemblance to the WISC–III Performance Scale than to the Verbal Scale,mean UNIT FSIQs were compared to the WISC–III PIQ. The WISC–IIIPIQs were more closely aligned with the UNIT FSIQs than were the

136 Examiner ’s Manual

Page 41: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

137

Page 42: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

138E

xamin

er’s M

anu

al

Page 43: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties

139

Page 44: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

140

Exam

iner

’s Man

ual

Page 45: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 141

WISC–III VIQs, as anticipated. The differences between the UNIT andWISC–III scores were also reduced when the PIQ was used as the criteri-on measure. For example, the UNIT Standard FSIQ (61.19) and WISC–IIIPIQ (58.63) differed by only 2.56 points. Because the WISC–IIIPerformance Scale is “language reduced” compared to the Verbal Scale,comparing the UNIT FSIQ and WISC–III PIQ is reasonable.

Because intellectually gifted individuals typically exhibit higher verbalthan performance abilities, differences between the UNIT and WISC–IIIFSIQs in favor of the WISC–III were expected. Moreover, because theWISC–III is commonly used as a criterion measure for identifying gifted-ness, the issue of criterion contamination was prominent. As expected, thissample obtained a mean WISC–III FSIQ of 128.30 and mean UNITAbbreviated, Standard, and Extended FSIQs of 119.42, 120.44, and 117.26,respectively. When the WISC–III PIQ instead of the FSIQ was used as thecriterion measure, the differences between instruments were reduced, withdifferences ranging from to 3 to 6 points. The UNIT Abbreviated,Standard, and Extended FSIQs correlated .75, .82, and .88, respectively,with the WISC–III FSIQ.

Because the UNIT was designed to reduce the effects of culture on cogni-tive ability scores, its relationship with the WISC–III was also studiedwith a sample of Native Americans. Mean FSIQs on the Abbreviated,Standard, and Extended batteries for this group were 100.65, 99.76, and101.00, respectively. The mean WISC–III VIQ, PIQ, and FSIQ were 95.41,103.06, and 98.85, respectively. These scores are very comparable, althoughthe score reflecting the most verbal content (i.e., the VIQ) is lowest. TheUNIT scores are very similar to the PIQ, which is consistent with expecta-tions. Corrected correlation coefficients between the UNIT FSIQs and theWISC–III VIQ, PIQ, and FSIQ ranged from .82 to .87 for the AbbreviatedBattery, from .70 to .86 on the Standard Battery, and from .57 to .68 on theExtended Battery. The correlations between the WISC–III PIQ and thethree UNIT FSIQs (i.e., Abbreviated, Standard, and Extended) are higherthan those between the VIQ and the UNIT FSIQs. These results againshow more correspondence between the UNIT and the language-reducedportion of the WISC–III.

In summary, the FSIQs on the WISC–III and the UNIT Abbreviated,Standard, and Extended batteries correlated consistently from the .70s tothe .80s, with the exception of a single correlation at .65 for the NativeAmerican sample. The consistency in the correlations between theWISC–III and the three UNIT batteries suggests that the UNIT batteriesshare significant overlap and that each can be used with confidence asmeasures of global intelligence. This claim is supported further by the sim-ilarity in the mean scores produced by the three UNIT batteries across thedifferent exceptional groups. The largest discrepancy across the threeUNIT batteries was the 5-point difference between the Abbreviated andExtended FSIQs for the sample with mental retardation.

Page 46: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

142 Examiner ’s Manual

The discrepancies between the UNIT FSIQ and WISC–III FSIQ for theexaminees with mental retardation and the intellectually gifted samplewere not unexpected, in view of the nature of the two tests and the issueof criterion contamination. It has been pointed out that intellectually gift-ed individuals tend to score lower on performance than on verbal scales,whereas individuals with mental retardation tend to score higher on per-formance than on verbal scales (Wechsler, 1991). Also, because individualswith mental retardation and intellectually gifted individuals are common-ly identified by their scores on the WISC–III, it was expected thatWISC–III scores would more likely fall within the extreme ranges of abili-ty. When the WISC–III PIQ was used as the criterion against which theUNIT FSIQ was compared, the differences between the two instrumentsranged from 1 to 6 points for those with mental retardation and from 3 to6 points for the intellectually gifted examinees.

Correlations With the WJ–R®

The UNIT and the WJ–R Tests of Cognitive Ability (Woodcock & Johnson,1989/1990) were administered to 88 examinees in regular education class-es. The 54 female and 34 male participants ranged in age from 6 to 16years (M 5 11.1, SD 5 2.8). All of the participants were White and non-Hispanic. The correlations, means, and standard deviations for the UNITcomposite scores and the WJ–R subtest and composite scores are present-ed in Table 5.23.

The corrected correlations between the UNIT FSIQs and the WJ–R Broad Cognitive Ability score are .80, .83, and .82, respectively, for theAbbreviated, Standard, and Extended batteries. Scores on the two testswere also similar, with the mean WJ–R scores (105.31) slightly higherthan the mean UNIT Abbreviated, Standard, and Extended FSIQs (102.52,102.59, and 102.25, respectively). Correlations with the WJ–R subtestswere not as strong as those with the broad composite scales, ranging from.18 between the Standard Battery Reasoning Quotient and VisualMatching to .79 between the Extended FSIQ and Analysis–Synthesis.These results suggest that composite scales from the WJ–R and the UNITmeasure similar constructs, although specific subtests show weaker relationships.

Correlations With the Batería–RThe Batería–R (Woodcock & Muñoz-Sandoval, 1996) is the parallelSpanish version of the WJ–R developed with calibrations and norms basedon data gathered from monolingual Spanish-speaking individuals andindividuals whose dominant language is Spanish. The relationshipbetween the UNIT, as a nonverbal test, and Batería–R, as a Spanish-language test, was investigated for two samples of examinees whose firstlanguage is Spanish. One sample consisted of 27 students in bilingual edu-cation classes. The other sample included 26 students receiving servicesfor English as a second language (ESL). All of the students were nativeSpanish speakers whose English proficiency was either limited (bilingual)or high (ESL). The 14 female and 13 male examinees in the bilingual

Page 47: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties

143

Page 48: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

education sample ranged in age from 8 to 15 years (M 5 10.9,SD 5 1.9). The 11 female and 15 male ESL students ranged in age from 7to 15 years (M 5 10.6, SD 5 2.0). The correlation coefficients, means, andstandard deviations for the two samples are presented in Table 5.24. Cor-relation coefficients are corrected for restriction and expansion of range.The relationships between UNIT and Batería–R performance by thesesamples are discussed in terms of global scores and corrected coefficients.

For both samples, mean Batería–R scores are very low compared to UNITmean scores. The Batería–R Broad Cognitive Ability Early Developmentaland Broad Cognitive Ability (BCA) standard scores are 77.11 and 75.44 forthe bilingual education sample and 69.42 and 62.81 for the ESL sample.In contrast, the mean UNIT FSIQs on the Abbreviated, Standard, andExtended batteries were 93.81, 92.30, and 93.41, respectively, for the bilin-gual sample and 95.19, 95.54, and 96.88 for the ESL sample. It should benoted that the Batería–R scores revealed considerable variability, withstandard deviations greater than 15 (the general population standarddeviation) for most of the Batería–R measures for both groups. In contrast and with one exception, the standard deviations of the UNITscores were less than 15 on all the UNIT scales of the three batteries (i.e.,Abbreviated, Standard, and Extended). When correlations were correctedfor expansion or restriction in range, coefficients defining the relationshipbetween the Batería–R Cognitive Ability Early Developmental Battery andthe UNIT A b b r e v i a t e d , S t a n d a r d , and Extended FSIQs were .72, . 3 9 , and .55,respectively, for the bilingual education sample and .08, .17, and .20,respectively, for the ESL sample. Coefficients between the Batería–RBroad Cognitive Ability composite and the UNIT FSIQs showed a similarpattern of correlations for both samples, but are less strongly correlated.The differences between the two samples in the magnitudes of the correla-tions—moderate to strong correlations for the bilingual education sampleand negligible correlations for the ESL sample—suggest that the Batería–R is functioning differently for these two groups. Although Spanish is thefirst language for both samples, the ESL group has stronger English-language skills than the bilingually educated group. These stronger skillsmay interfere with performance on measures like the Batería–R developedwith monolingual or nearly monolingual Spanish-speaking examinees.

These coefficients suggest little overlap between the Batería–R and theUNIT; however, the Batería–R scores may not reflect optimally the abilityof these examinees. As noted previously, the Batería–R scores are morevariable than would be predicted from the population standard deviationreported in the Batería–R manual. In addition, the Batería–R scores aresystematically and considerably lower than the UNIT scores. Moreresearch is needed to explore the use of nonverbal and translated tests for the assessment of cognitive ability of populations with limited Englishproficiency.

Correlations With the K–BITThis study provided further evidence of the concurrent validity of theUNIT. The UNIT and K–BIT (Kaufman & Kaufman, 1990) were

144 Examiner ’s Manual

Page 49: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

145

Page 50: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

146

Exam

iner

’s Man

ual

Page 51: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

administered in a counterbalanced order to a sample of 31 examinees. The16 female and 15 male examinees ranged in age from 5 to 14 years (M 5 1 0 . 7 1 , SD 5 2.82). Of the sample, 19 were White and 12 were AfricanAmerican. The purpose of the study was to compare a language-loadedintellectual screening test, the K–BIT, to a nonverbal measure of intelli-gence, the UNIT. The correlations, means, and standard deviations for theUNIT and the K–BIT are presented in Table 5.25.

The correlation between the K–BIT and the UNIT Abbreviated BatteryFSIQ, both intended as screeners of intellectual functioning, was .71; thecorrelations between the K–BIT and UNIT Standard and Extended FSIQswere .82 and .79, respectively. Both instruments produced global meanscores that were within 4 points of each other. In instances in which theuse of a language-loaded intellectual screener is not appropriate (e.g., forthe assessment of intellectual functioning of individuals who are deaf orhearing-impaired) or not desirable (e.g., when there are cultural or lan-guage limitations), the UNIT Abbreviated Battery correlations with theK–BIT suggest that the Abbreviated Battery may serve as a useful andappropriate intellectual screener of cognitive ability.

Correlations With Three Progressive MatricesAlthough many nonverbal tests, such as Raven’s SPM (Raven, 1960), theMAT (Naglieri, 1985a), and the TONI–2 (Brown et al., 1990), employ asimple matrix item format, the UNIT assesses intelligence in a more com-prehensive manner, sampling a variety of intellectual abilities (i.e., reason-ing, memory, and symbolic and nonsymbolic mediation) through a varietyof markedly different item formats. For this investigation, the UNIT wascompared with three typical matrix-type tests: Raven’s SPM, the MAT, andthe TONI–2.

Two samples participated in this study. With the first sample, the cross-cultural applications of the UNIT with native Spanish speakers wereinvestigated. For this purpose, the comparative intellectual abilities of 27examinees from Ecuador as assessed by the UNIT, the MAT, and Raven’sSPM were examined. The 14 female and 13 male participants ranged inage from 8 to 12 years (M 5 9.44, SD 5 0.93). The second sample consistedof 13 female and 17 male examinees who were deaf or hearing impaired.Of this sample, 93% were White and 7% were African American. The UNITand the TONI–2 were administered to this sample in counterbalancedorder. The correlations, means, and standard deviations for both samplesare presented in Table 5.26.

The UNIT and the MAT and Raven’s SPM all yielded very similar meanscores, ranging from 98.07 to 101.89 (after Raven’s SPM T scores were con-verted to deviation quotients). The MAT total test score correlated stronglywith the UNIT Abbreviated, Standard, and Extended FSIQs (.79, .83, and.82, respectively); the Raven’s SPM total score correlations were slightlylower (.50, .56, and .59, respectively). In view of the comparable meanscores and strong correlations between the UNIT and the MAT andRaven’s SPM for this Ecuadorian sample, the UNIT appears to be reason-ably comparable to both instruments. When a more comprehensive,

Technical Properties 147

Page 52: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

148

Exam

iner

’s Man

ual

Page 53: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

149

Page 54: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

150E

xamin

er’s M

anu

al

Page 55: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

multidimensional instrument is needed, the UNIT will yield additionalclinical information.

The performance of the deaf and hearing-impaired sample on the UNITand the TONI–2 was comparable. The mean UNIT Abbreviated FSIQ of97.17 is close to the global TONI–2 score of 93.43. The Standard andExtended FSIQs of 91.07 and 91.03, respectively, are also of similar magni-tude. The UNIT Abbreviated, Standard, and Extended FSIQs had correla-tions of .68, .63, and .56 with the TONI–2 Quotient. These coefficientsindicate that both instruments assess some similar constructs althoughthe UNIT provides a broader measure.

Correlational Studies With Measures of Academic AchievementThe correlational studies were extended to provide evidence of the efficacy of UNIT in predicting academic achievement, relative to the pre-dictive power of other measures of intelligence. Specifically, UNIT resultswere studied as they relate to measures of academic achievement, whenlanguage-loaded and nonverbal measures of intelligence have also beenadministered. For children with limited English proficiency, correlationsbetween verbally loaded measures of intelligence and measures of aca-demic achievement may be spuriously high because of the interveninginfluence of language. The measures of academic achievement examined in these studies were the Tests of Achievement of the WJ–R (Woodcock &Johnson, 1989/1990), the Spanish Form of the Woodcock LanguageProficiency Battery–Revised (WLPB–R; Woodcock, 1991), the WechslerIndividual Achievement Test (WIAT; The Psychological Corporation, 1992),and the Peabody Individual Achievement Test–Revised (PIAT–R;Markwardt, 1989). Several populations were included in these studies:individuals with learning disabilities, mental retardation, intellectual gift-edness, and deafness and hearing impairments. In each study, correlationswere corrected to account for restriction or expansion in variance in boththe predictor and criterion variables.

Correlations With the WJ–RThe Tests of Achievement of the WJ–R (Woodcock & Johnson, 1989/1990),along with the UNIT and the WISC–III (Wechsler, 1991), were adminis-tered to three samples. The purpose of the study was to compare the effica-cy of the two intelligence tests in predicting academic achievement asassessed by a widely accepted measure. The three samples were individu-als classified as intellectually gifted (N 5 43), individuals with learningdisabilities (N 5 59), and individuals with mental retardation (N 5 55).The 20 female and 23 male intellectually gifted examinees ranged in agefrom 6 to 16 years (M 5 10.93, SD 5 2.71). Of the sample, 39 were Whiteand 4 were Asian American. The 21 female and 38 male examinees withlearning disabilities ranged in age from 7 to 16 years (M 5 1 1 . 1 4 , S D 5 3 . 1 1 ).Of the sample, 55 were White, 3 African American, and 1 Native American,and 1 was of Hispanic origin. The 22 female and 33 male examinees withmental retardation ranged in age from 6 to 17 years (M 5 1 1 . 7 6 , S D 5 3 . 3 5 ) .

Technical Properties 151

Page 56: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Of this sample, 36 were White and 19 were African American. Table 5.27presents the means, standard deviations, and corrected and obtained cor-relations for the UNIT and WISC–III and the WJ–R Test of Achievementfor the three samples.

For the gifted sample, the UNIT FSIQs for all three batteries had high correlations with the WJ–R Broad Mathematics, Broad Knowledge, andSkills clusters but low correlations with the Broad Reading and BroadWritten Language clusters. These correlations are similar to the WISC–IIIPIQ correlations with those same WJ–R composites. The mean WISC–IIIVIQ, PIQ, and FSIQ tend to be somewhat higher than the WJ–R compositescores, whereas the UNIT Abbreviated, Standard, and Extended FSIQs arecloser in magnitude to the achievement scores. The sample with learningdisabilities showed a similar pattern of correlations for the UNIT FSIQs.The WISC–III VIQ, PIQ, and FSIQ also have relatively low correlationswith WJ–R Broad Reading and Broad Written Language composite scoresand relatively high correlations with WJ–R Broad Mathematics, BroadKnowledge, and Skills composite scores. This pattern is probably due tothe prevalence of reading disorders in this sample. Mean scores on bothintelligence tests tend to be slightly higher than the achievement scoresfor the Broad Reading, Broad Written Language, and Skills clusters, andsimilar for the Broad Mathematics and Broad Knowledge clusters. Thesample with mental retardation showed correlations from .40 to .63between the UNIT and the WJ–R scores and from .37 to .70 between theWISC–III and WJ–R scores. Mean scores on both intelligence tests weresubstantially higher than the WJ–R scores.

Correlations With the WLPB–RFor this investigation, the relative prediction by the UNIT and theBatería–R (Woodcock & Muñoz-Sandoval, 1996) of reading ability as mea-sured by the Spanish WLPB–R (Woodcock, 1991) were examined for twosamples. The first sample comprised 27 examinees in bilingual education,and the other consisted of 26 examinees receiving ESL services. The 14female and 13 male bilingual education examinees ranged in age from 8 to15 years (M 5 10.9, SD 5 1.9). The ESL students ranged in age from 7 to15 years (M 5 10.6, SD 5 2.0). Table 5.28 presents the means, standarddeviations, and corrected and observed correlations between the UNIT andBatería–R and the WLPB–R.

The results show that both the UNIT and the Batería–R are c o m p a r a t i v e l ybetter predictors of reading achievement for bilingually educated individu-als, who have limited English proficiency, than for ESL individuals, whohave higher English proficiency. As expected, the UNIT, with its nonverbalapproach, had moderate corrected correlations (Standard FSIQ r 5 .39 andExtended FSIQ r 5 .55) with reading comprehension compared to theBatería–R’s language-loaded correlations (BCA Est. r 5 .91) with readingcomprehension. As noted previously, it is possible that the ESL group’sstronger second-language skills may interfere with mastery of the firstlanguage and thereby interfere with performance on cognitive andachievement measures. For both samples, mean scores on the Batería–R

152 Examiner ’s Manual

Page 57: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

153

Page 58: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

154E

xamin

er’s M

anu

al

Page 59: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

155

Page 60: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

156E

xamin

er’s M

anu

al

Page 61: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

are much lower than mean WLPB–R scores, with differences ranging from6.70 to 44.10 standard score points. Differences between the UNIT FSIQsrange from 10 points above to 26.50 points below WLPB–R standardscores.

Notably, the Symbolic Quotient from the UNIT correlated substantiallyhigher with the three WLPB–R reading subtests (range from .23 to .57)than did the Nonsymbolic Quotient (range from -.02 to .36) for the bilin-gual education sample. Negligible correlations between reading and intel-lectual ability were evident for the ESL sample. For the bilingualeducation sample, the UNIT Symbolic subtests assessed the ability of theexaminees to use language, as defined by the two achievement subtests,better than did the UNIT Nonsymbolic subtests.

Correlations With the WIAT®

For this study, the comparative prediction of performance on the WIAT(The Psychological Corporation, 1992) by the UNIT and the K–BIT(Kaufman & Kaufman, 1990) were examined. The sample consisted of 16female and 15 male participants between the ages of 5 and 14 years (M 5 10.71, S D 5 2.82). Of this sample, 19 were White and 12 were AfricanAmerican. The UNIT, K–BIT, and WIAT were administered in a counter-balanced order. Results are reported in Table 5.29.

As expected, the language-loaded intellectual screening test, K–BIT, corre-lated at a higher level with the WIAT than did the UNIT. However, boththe UNIT and the K–BIT correlated at strong positive levels with theWIAT. The K–BIT Composite score correlated at .86 with the WIAT TotalComposite score, and the UNIT Abbreviated, Standard, and ExtendedFSIQs correlated at .53, .62, and .59, respectively, with that WIAT score.The UNIT FSIQs predicted performance on the WIAT Basic Reading andMathematics Reasoning subtests better than they predicted the partici-pants’ Language or Writing skills. Correlations of the UNIT Abbreviated,Standard, and Extended FSIQs with Basic Reading were .66, .70, and .74,respectively, and with Mathematics Reasoning, .64, .71, and .65, respec-tively. Among the UNIT scales, the Reasoning Quotient and SymbolicQuotient were especially strong predictors.

Correlations With the PIAT–RFor this study, the ability of the UNIT and the TONI–2 (Brown et al.,1990) to predict Reading Comprehension and Mathematics performance on the PIAT–R (Markwardt, 1989) for deaf or hearing-impaired examineeswas compared. All of the participants attended a school for deaf and h e a r i n g-impaired individuals. The UNIT, the TONI–2, and the PIAT–Rwere administered in counterbalanced order to 13 female and 17 maleexaminees ranging in age from 5 to 17 years. Of the sample, 28 were White and 2 were African American. Results of this study are presented in Table 5.30. Only global scores are presented, and, as in all previousstudies, correlations were corrected for restriction and expansion in range.

Technical Properties 157

Page 62: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

158E

xamin

er’s M

anu

al

Page 63: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Tech

nical P

roperties

159

Page 64: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

160

Exam

iner

’s Man

ual

Page 65: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

All mean UNIT FSIQs were higher than the PIAT–R Reading Comprehen-sion and Mathematics scores of 72.77 and 69.93, respectively. The academ-ic achievement of this sample was lower than expected from the nonverbalability scores; this pattern reflects the heavy influence of language on per-formance on tests of academic achievement. The correlation coefficients,corrected for restriction and expansion in range, reveal a stronger relation-ship between the UNIT and the PIAT–R than between the TONI–2 andthe PIAT–R. For example, the correlations between the PIAT–R ReadingComprehension and Mathematics subtests and the Abbreviated FSIQ were.43 and .49, respectively; .53 and .50 with the Standard FSIQ, respectively;and .49 and .47 with the Extended FSIQ, respectively. In contrast, thecoefficients between the TONI–2 full scale standard score and the PIAT–RReading Comprehension (.18) and Mathematics (.27) subtests are consid-erably lower. The results indicate that the UNIT FSIQs explain more vari-ance in academic achievement. The UNIT also appears to predict academicachievement for individuals with hearing impairments with considerablymore precision than does the TONI–2. This increased predictive powermay be due to the greater representation of cognitive abilities sampled(e.g., symbolic mediation) by the UNIT.

Studies With Clinical and Exceptional SamplesEvidence of the UNIT’s validity for clinical and educational decision mak-ing was gathered from studies of test performance by examinees in severalwidely occurring diagnostic and exceptional groups. According to recentfederal data (U.S. Department of Education, 1995), approximately 11.2% ofchildren are served by federally supported programs for students with dis-abilities, with some of the most common disabilities including specificlearning disabilities (5.7%), speech or language impairments (2.4%), men-tal retardation (1.3%), and serious emotional disturbance (1.0%). Becauseof the high representation (6.2%) of gifted students in educational settings(U.S. Department of Education NCES, 1996), students with this exception-ality were also included in these studies.

Individuals in each of these groups were administered the UNIT, and theirperformance was then compared to that of a demographically similar sam-ple. The clinically diagnosed or exceptional examinees were matched, on acase-by-case basis, to an equal number of control examinees drawn fromthe standardization sample and with no known diagnosis, exceptionality,or special education services. Examinees in the target and control groupswere matched according to age, sex, race, ethnicity, and parent educationlevel. A random-number generator was used to select final matches whenmore than one examinee with the specified demographic characteristicswas available in the standardization sample. Accordingly, UNIT perform-ance by these two groups, which differed only on the basis of clinical diag-nosis or exceptionality, could be readily compared. Group means, standarddeviations, standard score differences, and effect sizes (a term describingthe magnitude of the score differences occurring between groups) are pre-sented for each of the comparisons. Cohen’s d was selected as an index ofeffect size due to its widespread acceptance and simplicity. Computed fromthe difference between group means divided by the standardization sam-

Technical Properties 161

Page 66: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

ple standard deviation, an effect size of 0.2 is considered small, 0.5 is con-sidered medium, and 0.8 and greater is considered large (Cohen,1977/1987). In brief, Cohen’s d is described as the average differencebetween groups in standard deviation units. For example, if the mean dif-ference in performance on a UNIT subtest between a special group and acontrol group is 3 points, then the effect size is 1.0 (3 points difference 43 points per SD). Statistical significance testing is not reported because ofits dependence on sample size (Thompson, 1996).

The accuracy of the decision-making classifications in two of theclinical/exceptional groups—examinees with mental retardation and exam-inees classified as intellectually gifted—was also examined. Cut scores forboth the clinical/exceptional group and the matched control group wereused to determine diagnostic sensitivity, diagnostic specificity, and thetotal hit rate. Sensitivity is the proportion of cases in which a clinical con-dition is detected when it is in fact present (true positive). Specificity isthe proportion of cases for which a diagnosis is rejected, when rejection isin fact warranted (true negative). The total hit rate is the percentage ofcorrect classifications across both populations.

In general, UNIT classification accuracy compares favorably with that ofother nonverbal intellectual measures (e.g., Roid & Miller, 1997). However,data regarding classification accuracy should be interpreted conservatively(a) because in practice, classification decisions should be based on theresults of multiple sources of information rather than on test results froma single measure and (b) because the consequences of a classification deci-sion can be far-reaching and long-lasting. A false-negative classification,meaning an individual is incorrectly classified as not needing special edu-cation services, could mean the denial of needed services to that individu-al. Alternatively, a false-positive classification, by which special servicesare recommended for a normally functioning individual, could result inunfair labeling of that individual.

Classification accuracy can be compromised by criterion contamination,that is, the tendency of the criterion used to identify children as exception-al (e.g., WISC–III FSIQ) to be less than perfectly valid. Criterion contami-nation can therefore contribute to the presumed error of any independentpredictor variable. Classification accuracy can be judged only against theaccuracy of the diagnostic/exceptional criteria themselves. The frequentuse of instruments for classification sometimes results in performance onthose instruments becoming the criteria for classification, rather than theoriginal construct associated with the exceptionality (cf. the “bootstraps”effect as described by Cronbach & Meehl, 1955). For example, individualswho are identified as intellectually gifted on the basis of intelligence teststhat are language-loaded tend to constitute a verbally precocious sample ofgifted individuals. The use of language-loaded intelligence tests obfuscatesthe identification of nonverbally gifted individuals. Over time, a dispropor-tionately high number of individuals with high verbal skills will be classi-fied as gifted whereas a disproportionately low number of individuals withhigh nonverbal skills will be identified.

162 Examiner ’s Manual

Page 67: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 163

Individuals With Speech and Language ImpairmentsA sample of 57 examinees diagnosed with communication disorders andwho were receiving special education services for persons with speech andlanguage impairments were administered the UNIT. The communicationdisorders were diagnosed according to criteria in the Diagnostic andStatistical Manual of Mental Disorders–Fourth Edition (DSM–IV;American Psychiatric Association, 1994) and included expressive languagedisorder, mixed receptive–expressive language disorder, and phonologicaldisorder. The examinees in this sample tended to demonstrate deficienciesin language, articulation, voice, or fluency. Accordingly, examinees witheither language impairment or speech impairment or both were tested.Most were receiving services from speech and language therapists for anaverage of 10.02 hours (S D 5 8.95) per week. Individuals in this samplehad been receiving special education services for an average of slightlyover 2 years (M 5 2.42, S D 5 1.57). Individuals were excluded if they hadreceived a diagnosis of mental retardation.

The average age of the 25 female and 32 male examinees in this samplewas 7.84 years (S D 5 2.58). Of the sample, 34 were White, 10 AfricanAmerican, and 13 Other; 1 was of Hispanic origin and 56 were of non-Hispanic origin. Parent education level was uniformly distributed from lowto high educational levels (,HS, 25%; HS, 25%; Some College, 35%; and$4 Years College, 14%). The matched comparison group from the stan-dardization sample was statistically similar in all demographic character-istics but had one more White individual and one fewer in the Othercategory. The performance results of the two samples are provided in Table 5.31.

The mean performance by the clinical and control samples differed byabout 0.5 SD; all of the mean scores of the clinical sample were lower thanthose of the control sample. The Standard and Extended FSIQs of the twogroups differed in the medium effect-size range; the effect size of theAbbreviated FSIQ difference was small (0.28). Differences of mediumeffect size were noted for the Reasoning, Symbolic, and Nonsymbolic scalesfor both the Standard and Extended batteries; Memory Quotients tendedto show the smallest effect-size difference. Mean subtest scores differed inthe medium effect-size range for Analogic Reasoning, Object Memory, andMazes but in the small effect-size range for the other subtests.

These findings are commensurate with recent reports that individualswith language impairments perform at lower levels relative to unimpairedindividuals on a variety of cognitive tasks, including but not limited tothose requiring symbolic representation ability (Montgomery, Windsor, &Stark, 1991). A common symbolic deficit related to the generation andmanipulation of various mental representations is considered by manyinvestigators to underlie both linguistic and nonlinguistic difficulties ofindividuals who are language-impaired (Montgomery et al., 1991; Savich,1984). It is unclear whether the presence of the subtle but sometimes per-vasive cognitive deficits is the cause or the effect of language impairment,

Page 68: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

but nonverbal measures of intelligence are usually considered most appro-priate for assessing the level of cognitive functioning of individuals withlanguage impairments. Children with speech problems (e.g., developmen-tal articulation disorder) but without language disorders are generallyexpected to show intellectual development within normal limits, providedno concomitant conditions are present (Cantwell & Baker, 1987).

164 Examiner ’s Manual

Page 69: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Individuals With Learning DisabilitiesThe UNIT was administered to 205 examinees diagnosed with learningdisabilities who were receiving special education services and to amatched control sample. The diagnoses were made by independent schoolpsychologists and were based primarily on ability–achievement discrepan-cies as well as on achievement substantially below that expected for ageand schooling (e.g., American Psychiatric Association, 1994). Individuals inthis sample had been receiving special education services for an average ofover 3 years (M 5 3.19, SD 5 2.48) and were currently spending morethan one day each week in special education (mean hours of special educa-tion per week 5 8.47, SD 5 6.89).

The average age of the 84 female and 121 male examinees in the clinicalsample was 11.13 years (SD 5 2.88). Of the sample, 178 were White, 19African American, and 8 Other; the sample had 16 Hispanic and 189 non-Hispanic participants. Parent education level was uniformly distributedfrom low to high educational levels (,HS, 23%; HS, 39%; Some College,23%; and $4 Years College, 15%). The matched comparison group from thestandardization sample was statistically similar in all of these demographiccharacteristics. The performance results of the two samples are presentedin Table 5.32.

Score differences between the clinical sample and the control sample weresalient; that is, there were medium to large effect sizes for nearly all UNITindexes. Examinees with learning disabilities scored lower than the meanscore of the control sample on every UNIT index. The Abbreviated FSIQdifference was in the medium effect-size range; large effect sizes werenoted for the Standard and Extended FSIQ differences. Score differenceson all of the scales had medium effect sizes, with the largest mean differ-ences noted for the Memory Quotient and Symbolic Quotient. With thesole exception of Mazes, for which the mean score difference was in thesmall effect-size range, all subtest effect sizes were in the medium range.

The mean IQs on the UNIT and a traditional language-loaded intelligencetest (i.e., the WISC–III, Wechsler, 1991) were compared for a subsample ofthe individuals with learning disabilities who were administered bothtests. For the subsample of 61 participants with learning disabilities, themean UNIT Standard FSIQ (91.69, SD 5 12.55) and the WISC–III FSIQ(92.44, SD 5 12.14) differed by 0.75 points. Differences were slightlygreater for the Abbreviated and Extended batteries (mean UNITAbbreviated FSIQ 5 94.56, SD 5 12.82; mean Extended FSIQ 5 91.44,SD 5 12.64). According to these findings, differences between the UNITand the WISC–III scores are likely to be small for samples with learningdisabilities.

Individuals With Mental RetardationFor this investigation, a sample of 84 examinees in special education pro-grams for persons with mental retardation were administered the UNIT.Members of this sample were typically identified according to DSM–IV orAmerican Association of Mental Retardation (AAMR; 1992) criteria, which

Technical Properties 165

Page 70: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

require significantly subaverage intellectual functioning existing concur-rently with related limitations in two or more areas of adaptive function-ing. Examinees in this sample were receiving intensive services (meanhours per week 5 24.98, SD 5 8.59) and had a history of long-term partici-pation in special education (M 5 5.18 years, SD 5 2.14).

166 Examiner ’s Manual

Page 71: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 167

The average age of the 35 female and 49 male examinees in this samplewas 11.96 years (SD 5 3.14). Of the sample, 62 were White, 20 AfricanAmerican, and 2 Other; the sample had 2 Hispanic and 82 non-Hispanicparticipants. Parent education level was predominantly high school gradu-ate or less (,HS, 53%; HS, 31%; Some College, 14%; and $4 Years College,2%). The matched control group from the standardization sample was sta-tistically similar in all of these demographic characteristics. The perform-ance results of the two samples are presented in Table 5.33.

Page 72: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

168 Examiner ’s Manual

The clinical sample performed significantly lower than the control sampleon every UNIT subtest and scale, with all effect sizes greater than 1.0.Accordingly, the UNIT appears to be useful in the identification of individ-uals with mental retardation.

Analyses of the classification accuracy of the UNIT showed that an FSIQcut score less than or equal to 70 yielded total hit rates above 80% for theclinical and control samples for both the Standard and Extended batteries.The Extended Battery appears to be slightly more accurate in identifyingthis population, with a sensitivity of 78.6%, a specificity of 95.2%, and atotal hit rate of 86.9%, compared to 65.5% sensitivity, 98.8% specificity,and 82.1% total hit rate for the Standard Battery. Historically, individualswith mental retardation tend to score slightly better on nonverbal intelli-gence tests than on traditional language-loaded intelligence tests(Wechsler, 1991).

It is important to note that the high UNIT accuracy for identifying indi-viduals with mental retardation would likely be even greater in situationswhere confounding variables (e.g., regression to the mean, criterion con-tamination) do not influence the outcome. In situations in which the UNITis the selection variable, it is highly likely that other instruments (such asthe Wechsler scales) would also show somewhat limited classification accu-racy. That is, prediction accuracy is always, in part, a function of and limit-ed by the criterion used in selection.

The mean IQs on the UNIT and the WISC–III (Wechsler, 1991) were com-pared for a subsample of individuals with mental retardation. Regionalvariation in classification was minimized by the imposition of an addition-al requirement: The traditional test results had to rank more than 2 SDbelow the general population mean (i.e., an FSIQ #70). For the subsampleof 54 participants with mental retardation who took the UNIT and theWISC–III, the mean Standard FSIQ (60.04, SD 5 12.77) and WISC–IIIFSIQ (51.67, SD 5 8.78) differed by 8.37 points. Differences were slightlygreater for the Abbreviated Battery and slightly smaller for the ExtendedBattery (mean UNIT Abbreviated FSIQ 5 63.76, SD 5 11.67; meanExtended FSIQ 5 58.61, SD 5 11.86).

Individuals in Gifted ProgramsThese studies included a sample of 160 examinees who were identified asintellectually gifted on the basis of state and federal definitions by multi-disciplinary teams in the schools they attended. The teams used criteriasuch as teacher referral, intelligence test scores, achievement test scores,evidence of superior performance in one or more academic areas, and, insome instances, performance on measures of leadership and creativity. Onaverage, the examinees in this sample had been placed in a gifted settingfor several years (M 5 2.64, SD 5 1.94) and spent a substantial amount oftime each week receiving services related to the exceptionality (M 5 16.35hours, SD 5 13.20).

The average age of the 71 female and 89 male examinees in this samplewas 11.18 years (SD 5 3.25). Of the sample, 144 were White, 5 African

Page 73: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 169

American, and 11 Other; the sample had 6 Hispanic and 154 non-Hispanicparticipants. Parent education level was predominantly high (,HS, 2%;HS, 7%; Some College, 17%; and $4 Years College, 74%). The matched con-trol group from the standardization sample was similar in all of thesedemographic characteristics with the exceptions that it included one moreWhite participant and one fewer in the Other category, and one morewhose parents were high school graduates and one fewer whose parentshad attended some college. The performance results of the two samples arepresented in Table 5.34.

The mean scores of the gifted sample were considerably higher than thoseof the matched control sample on every UNIT subtest and scale. Effectsizes were large for the three FSIQs and the Memory and Nonsymbolicquotients and medium to large for nearly all other scores, except ObjectMemory and Mazes, for which effect sizes were small. According to theseresults, the UNIT appears to be effective in identifying individuals whoare intellectually gifted.

Traditionally, individuals classified as intellectually gifted tend to scorehigher on verbally loaded tests than on nonverbal tests (e.g., Wechsler,1991); however, in minority samples, research has often found higher non-verbal scores than verbal scores among African American (Ryan, 1983),rural gifted (Spicker, 1992), and Hispanic children (Olmedo, 1981). Becauseverbally loaded intelligence tests constituted the primary criterion foridentification of the gifted sample in this study, the UNIT classificationaccuracy is understandably reduced. A UNIT FSIQ cut score >125 yielded28.8% sensitivity, 91.3% specificity, and 60.0% total hit rate for theStandard Battery, and 19.4% sensitivity, 93.1% specificity, and 56.3% totalhit rate for the Extended Battery. An FSIQ cut score of >120 yielded 45.0%sensitivity, 88.1% specificity, and 66.6% total hit rate for the StandardBattery, and 34.4% sensitivity, 89.4% specificity, and 61.9% total hit ratefor the Extended Battery. The UNIT may be especially useful for identify-ing gifted individuals with known language-related or cultural differences,such as ESL, bilingual, African American, or deaf and hearing-impairedindividuals.

It is important to note that the UNIT accuracy for identifying gifted exam-inees would likely be improved in situations where confounding variables(e.g., regression to the mean, criterion contamination) do not influence theoutcome. In situations in which the UNIT is the selection variable, it ishighly likely that language-loaded instruments would show somewhat lim-ited classification accuracy because prediction accuracy is always, in somepart, a function of the criterion used in selection.

The mean IQs on the UNIT and the WISC–III (Wechsler, 1991) were com-pared for a subsample of the individuals classified as intellectually gifted.Regional variation in classification was minimized by the imposition of anadditional requirement: The traditional test results had to rank more than1.33 SD above the general population mean (i.e., an FSIQ .120). Manystates, in fact, require IQs more than 2 SD above the mean. For the sub-sample of 37 gifted participants who took the UNIT and the WISC–III, the

Page 74: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

170 Examiner ’s Manual

mean UNIT Standard FSIQ (121.19, SD 5 8.09) and the mean WISC–III FSIQ (130.51, SD 5 7.31) differed by 9.32 points. Differenceswere slightly greater for the Abbreviated and Extended batteries (meanUNIT Abbreviated FSIQ 5 120.27, SD 5 11.67; mean Extended FSIQ 5118.11, SD 5 8.82).

Page 75: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Technical Properties 171

Individuals With Serious Emotional DisturbanceAccording to the Individuals with Disabilities Education Act of 1990(IDEA; Public Law 101-476) and its 1997 reauthorization (Public Law 105-17), the educational exceptionality of serious emotional disturbance (SED)requires documentation of one or more of the following criteria: interper-sonal problems; inappropriate behaviors and feelings under normal cir-cumstances; pervasive mood problems, including depression; a tendency todevelop physical symptoms and fears; and/or an inability to learn that can-not be explained by intellectual, sensory, or health factors. Moreover, theemotional disturbance must have an impact on educational achievement.A sample of examinees previously identified by multidisciplinary educa-tional teams as having SED was administered the UNIT. On average, theexaminee in this sample had been in special education for several years(M 5 3, SD 5 1.64) and was receiving several days of services each week(M 5 14.39 hours per week, SD 5 11.97). Because SED is, by definition,independent of intellectual impairment, performance deficits on the UNITwere expected to be negligible.

The average age of the 10 female and 13 male examinees in this samplewas 13.6 years (SD 5 3.1). Of the sample, 14 were White, 4 AfricanAmerican, and 5 Other; the sample had 1 Hispanic and 22 non-Hispanicparticipants. Parent education level was predominantly high school gradu-ate or greater (,HS, 17%; HS, 35%; Some College, 26%; and $4 YearsCollege, 22%). The matched control group from the standardization samplewas statistically similar in all of these demographic characteristics. Theperformance results of the two samples are presented in Table 5.35.

No meaningful performance differences and only small effect sizes werefound to differentiate the sample with SED from the matched control sam-ple. The clinical sample’s scores on about two thirds of the UNIT indexeswere slightly lower than those of the control sample, and on about onethird, slightly greater. Serious emotional disturbance and intelligence aresomewhat independent. Accordingly, the UNIT did not discriminatebetween the examinees with SED and the control sample of nondisturbedexaminees, as expected.

Summary of Studies with Clinical and Exceptional SamplesIn general, the results of these studies show that the UNIT can be used toidentify individuals who are mentally retarded or intellectually gifted.Individuals with learning disabilities can be differentiated most easily onthe basis of symbolic processing and overall intellectual functioning(FSIQ), with large effect sizes. The performance of populations with speechand language impairments (who usually have a number of concomitantcognitive impairments) differ considerably, but with medium effect sizes,from that of demographically matched control examinees.

Page 76: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

172 Examiner ’s Manual

Page 77: Chapter Technical Properties - EWUoutreach.ewu.edu/.../CEDP589/unit/html/pdf/chap5.pdf · Accordingly, many of the analyses reported here are analyses of the UNIT’s technical properties

Summary of Validity StudiesEvidence from multiple sources and multiple methods support the use ofthe UNIT as a structurally sound and meaningful measure of intelligenceand its constituent constructs. The UNIT’s theory-based organization intoprimary and secondary scales with a hierarchical g is supported by theresults of factor analyses of data from both normative and clinical/ exceptional samples. The UNIT correlates well with other measures ofintelligence across a number of samples and also offers substantial valuein the prediction of academic achievement. Moreover, the UNIT is diagnos-tically useful and sensitive to common clinical and exceptional conditions.

Technical Properties 173