15
Equivalence Reliability Among the FITNESSGRAM ® Upper-Body Tests of Muscular Strength and Endurance Todd Sherman Division of Physical Education and Dance Oxford College of Emory University J. P. Barfield Department of Exercise Science, Physical Education, & Wellness Tennessee Tech University This study was designed to investigate the equivalence reliability between the sug- gested FITNESSGRAM ® muscular strength and endurance test, the 90º push-up (PSU), and alternate FITNESSGRAM ® tests of upper-body strength and endurance (i.e., modified pull-up [MPU], flexed-arm hang [FAH], and pull-up [PU]). Children (N = 383) in Grades 3 to 6 were tested over a period of a week. Equiva- lence reliability for the PSU–MPU comparison was acceptable for boys across ages 8 to 10 (Percentage Agreement [Pa] = .74 to .78, Modified Kappa [Kq] = .48 to .56), with good agreement among boys age 11 (Pa = .86, Kq = .72). Equivalence reliability for girls was unacceptable across all ages (Pa = .54 to .59, Kq = –.04 to .18). Reliabil- ity estimates were also acceptable for boys on the PSU–FAH comparison across all ages (Pa = .72 to .80, Kq = .44 to .60). Similar results were not found for girls. Con- sistency of classification was not demonstrated between PSU–PU for boys ages 10 and 11 years; however, acceptable to good estimates were found for the girls’ PSU–PU comparison (Pa = .72 to .82, Kq = .44 to .64) with the exception of girls age 11 (Pa = .67, Kq = .34). Practitioners must recognize that using alternative FITNESSGRAM ® strength and endurance items may result in a different healthy fit- ness zone classification for children than the recommended test, the PSU. Key words: criterion-reference, children’s strength tests MEASUREMENT IN PHYSICAL EDUCATION AND EXERCISE SCIENCE, 10(4), 241–254 Copyright © 2006, Lawrence Erlbaum Associates, Inc. Correspondence should be sent to Todd Sherman, Oxford College of Emory University, Division of Physical Education and Dance, 100 Hamill Street, Oxford, GA 30054. E-mail: tsherma@learnlink. emory.edu

Strength Test

Embed Size (px)

DESCRIPTION

strength test

Citation preview

Page 1: Strength Test

Equivalence Reliability Amongthe FITNESSGRAM® Upper-Body Tests

of Muscular Strength and Endurance

Todd ShermanDivision of Physical Education and Dance

Oxford College of Emory University

J. P. BarfieldDepartment of Exercise Science, Physical Education, & Wellness

Tennessee Tech University

This study was designed to investigate the equivalence reliability between the sug-gested FITNESSGRAM® muscular strength and endurance test, the 90º push-up(PSU), and alternate FITNESSGRAM® tests of upper-body strength and endurance(i.e., modified pull-up [MPU], flexed-arm hang [FAH], and pull-up [PU]).

Children (N = 383) in Grades 3 to 6 were tested over a period of a week. Equiva-lence reliability for the PSU–MPU comparison was acceptable for boys across ages 8to 10 (Percentage Agreement [Pa] = .74 to .78, Modified Kappa [Kq] = .48 to .56),with good agreement among boys age 11 (Pa = .86, Kq = .72). Equivalence reliabilityfor girls was unacceptable across all ages (Pa = .54 to .59, Kq = –.04 to .18). Reliabil-ity estimates were also acceptable for boys on the PSU–FAH comparison across allages (Pa = .72 to .80, Kq = .44 to .60). Similar results were not found for girls. Con-sistency of classification was not demonstrated between PSU–PU for boys ages 10and 11 years; however, acceptable to good estimates were found for the girls’PSU–PU comparison (Pa = .72 to .82, Kq = .44 to .64) with the exception of girlsage 11 (Pa = .67, Kq = .34). Practitioners must recognize that using alternativeFITNESSGRAM® strength and endurance items may result in a different healthy fit-ness zone classification for children than the recommended test, the PSU.

Key words: criterion-reference, children’s strength tests

MEASUREMENT IN PHYSICAL EDUCATION AND EXERCISE SCIENCE, 10(4), 241–254Copyright © 2006, Lawrence Erlbaum Associates, Inc.

Correspondence should be sent to Todd Sherman, Oxford College of Emory University, Divisionof Physical Education and Dance, 100 Hamill Street, Oxford, GA 30054. E-mail: [email protected]

Page 2: Strength Test

Adequate upper-body strength is necessary for performing functional and daily ac-tivities as well as preventing injury and osteoporosis (Kollath, Safrit, Zhu, & Gao,1991; Pate, Burgess, Woods, Ross, & Baumgartner, 1993; Ross & Pate, 1987). Inaddition, physical educators use muscular fitness test scores to document health-related physical fitness and estimate levels that may yield benefits into adulthood(Cooper Institute for Aerobics Research [CIAR], 1999; Cureton & Warren, 1990).Because of the practicality and the importance of muscular strength and endurancetesting, test developers make consistent efforts to include upper-body strengthmeasures in test batteries (Engelman & Morrow, 1991).

The FITNESSGRAM® health-related physical fitness test battery was devel-oped by the CIAR (1999) and is currently endorsed by the American Alliance forHealth, Physical Education, Recreation and Dance. Unique to the FITNESS-GRAM®, practitioners have the option of using any one of the following FIT-NESSGRAM® field tests to measure upper-body strength and endurance: (a) thetraditional pull-up (PU), (b) the modified pull-up (MPU), (c) the 90º push-up(PSU), and (d) the flexed-arm hang (FAH). Although the practitioner may chooseto use any of the tests, the PSU is recommended.

FITNESSGRAM® scores are evaluated against both norm-referenced and cri-terion-referenced standards. Criterion-referenced standards were established inthe late 1970s and early 1980s to help indicate levels of physical fitness needed forgood health (Cureton & Warren, 1990). The unique feature of the FITNESS-GRAM® is that it allows the practitioner the option of administering any of the fourFITNESSGRAM® upper-body strength tests; hence, a child should receive thesame criterion classification (i.e., healthy or unhealthy fitness zone) regardless ofthe test administered. If tests are used interchangeably, tests must be equivalent(Zhu, 1998).

There have been numerous studies on norm-referenced reliability and validityevidence for field tests of upper-body strength and endurance (Cotten, 1990;DiNucci, McCune, & Shows 1990; Engelman & Morrow, 1991; Jackson, Fromme,Plitt, & Mercer 1994; Kollath et al., 1991; McManis, Baumgartner, & Wuest,2000; Pate et al., 1993; Rutherford & Corbin, 1994); unfortunately, there is limitedevidence supporting the consistency of classification across tests (Looney & Plow-man, 1990). To this point, Romain and Mahar (2001) have published the onlystudy addressing criterion-referenced equivalence reliability of FITNESSGRAM®

upper-body strength and endurance items. These researchers evaluated consis-tency of classification between the PSU and MPU among children but limited thestudy to Grades 5 and 6 and excluded additional FITNESSGRAM® options (i.e.,PU, FAH). If tests are not consistent in classification, problems can occur whenusing test scores to classify whether children are in a healthy fitness zone.Misclassification of a child may lead to an overestimation of appropriate physicalactivity or a discouragement in participation because the child feels the standard isunachievable. Both outcomes may affect the child’s development of an active life-

242 SHERMAN AND BARFIELD

Page 3: Strength Test

style that is conducive to his or her health-related fitness (Cureton & Warren,1990). Therefore, this study was designed to determine the consistency of classifi-cation, or equivalence reliability, between the PSU, the FITNESSGRAM’s® sug-gested upper-body muscular strength and endurance test, and other upper-bodystrength and endurance test options across elementary school ages.

METHODS

Participants

The participants were a convenience sample of 403 children from one elementaryschool in a metropolitan area. The children were between 7 and 13 years of age andwere enrolled in Grades 3 to 6 physical education classes. Scores collected fromchildren ages 7, 12, and 13 were not included in the analyses because of low sam-ple size. Thus the total number in the sample was 383 (boys n = 201, girls n = 182).Permission from the principal, the director of schools, and the Institutional ReviewBoard (IRB) were obtained prior to testing. Because fitness testing was a part ofthe children’s physical education curriculum, the IRB committee only requiredpermission from the parents prior to data collection.

Instrument

The FITNESSGRAM® (CIAR, 1999) upper-body strength tests were adminis-tered to all participants. The tests included the MPU, the PU, the FAH, and thePSU. Test administration procedures were strictly followed as detailed in theFITNESSGRAM® test manual (CIAR, 1999, pp. 25–28).

Procedures

Eight test administrators, consisting of graduate students and faculty members,were utilized for this study. All administrators had prior testing experience; how-ever, the principal investigator required each test administrator to reviewFITNESSGRAM® test administration protocol and to participate in the practicetrials 1 week prior to data collection. These practice trials allowed for the identifi-cation and remedy of any procedural or scoring problems by the principal investi-gator and test administrators.

Fitness testing was part of the physical education curriculum and students werefamiliar with test items. Students were given additional instruction on correct per-formance on all tests and practice time on all test items prior to data collection.Testing was conducted during the 30-min physical education classes over 2 days.Classes met on a Monday–Wednesday or Tuesday–Thursday schedule. On Day 1,

EQUIVALENCE RELIABILITY 243

Page 4: Strength Test

MPU and FAH were administered. As the children entered the gymnasium, stu-dents were divided into three groups of eight. Each group started either at theMPU, FAH, or the height and weight station. Student groups rotated at 9-min inter-vals. PU, PSU, and an activity station (i.e., jumping rope) were administered onDay 2. Each station group of students had approximately 2 to 3 min of rest collec-tively before performing at the next station. Because students remained in alpha-betical order throughout testing, students rested, individually, approximately 8min before performing the next test. The principal investigator returned the follow-ing week to collect five make-up test scores.

Analyses

Students were categorized as being in a healthy or unhealthy fitness zone based onthe criterion-referenced standard for their age and gender (CIAR, 1999). Percent-age Agreement (Pa) was used to determine equivalence reliability for the follow-ing comparisons: PSU–MPU, PSU–FAH, and PSU–PU. Equivalence reliability, aterm more appropriate to the psychomotor domain, is sometimes called alternateforms reliability; the Pa reflects the extent to which two tests result in the sameclassification (Morrow, Jackson, Disch, & Mood, 2000). Looney (1989) indicatedthat reliability studies should include both Pa and Modified Kappa (Kq) when theproportion of masters is not fixed. Where Pa represents the proportion of individu-als receiving the same fitness zone classification on two tests and is influenced bychance agreement, Kq represents the proportion of individuals receiving the samefitness zone classification after controlling for chance agreement. The followingassumptions for Kq were met: (a) independence among objects to be categorized,and (b) independence and exclusivity of categories. All agreement statistics werecalculated using SPSS for Windows (version 10.0) and a statistical Web site(Chuang, 2001).

RESULTS

Descriptive statistics for boys (n = 201) and girls (n = 182) are presented in Table 1.The weight and height of the boys and girls increased with each successive agegroup. This sample was above the national average for both height and weight(Ogden et al., 2002). Mean Body Mass Index (BMI) for boys and girls met the cri-teria for the healthy fitness zone as stated in the FITNESSGRAM® test manual(CIAR, 1999, pp.40–41). Mean BMI for 11-year-old boys exceeded the healthyfitness zone criterion. Because some test items yielded many zero scores, Tables 2and 3 include the number of zero scores, the 25th, 50th, and 75th percentile for allthe upper-body strength and endurance tests.

244 SHERMAN AND BARFIELD

Page 5: Strength Test

245

TABLE 1Means and Standard Deviations for Eight to Eleven Year-Old Boys’

and Girls Height, Weight and BMI

Age 8 Age 9 Age 10 Age 11

Variables M SD n M SD n M SD n M SD n

Boys 46 50 61 44Weight 33.0 7.7 37.0 10.4 40.9 11.8 50.6 16.7Height 132.6 5.1 136.7 6.9 141.5 7.9 150.1 8.9BMI 18.6 3.4 19.6 4.2 20.1 4.2 22.0 5.2

Girls 39 56 44 43Weight 33.8 10.9 36.6 9.2 42.1 12.9 47.9 16.4Height 132.1 5.6 136.2 6.6 142.5 6.6 148.1 8.4BMI 19.2 5.0 19.6 3.9 20.5 4.8 21.4 5.3

Note. BMI = body mass index. Height is reported in centimeters, weight is reported in kilograms,and BMI is reported as weight in kilograms divided by height in squared meters.

TABLE 2Boys’ 25th, 50th, and 75th Percentile Scores for FITNESSGRAM’s ® Tests

of Upper-Body Strength and Endurance

Age Test ItemNumber ofZero Scores

25thPercentile

50thPercentile

75thPercentile

8a PSU 0 4.0 7.0 12.0MPU 0 6.0 10.5 15.3PU 28 0.0 0.0 1.3FAH 10 1.0 5.0 12.3

9b PSU 2 4.0 8.0 13.0MPU 2 7.0 11.0 15.3PU 31 0.0 0.0 2.0FAH 6 3.0 5.0 13.5

10c PSU 3 3.5 7.0 14.5MPU 1 4.5 10.0 15.0PU 39 0.0 0.0 1.0FAH 11 1.0 5.0 12.0

11d PSU 2 5.0 9.0 17.0MPU 2 5.0 9.0 14.8PU 29 0.0 0.0 1.0FAH 17 0.0 3.5 9.75

Note. MPU = modified pull-up; PSU = 90º push-up; PU = pull-up; and FAH = flexed-arm hang inseconds.

an = 46. bn = 50. cn = 61. dn=44.

Page 6: Strength Test

The 90º PSU test is recommended by the FITNESSGRAM®; therefore Pa in-dexes were computed between PSU and all other tests of upper-body strength.Pa indexes between PSU and the other tests of upper-body strength for each ageand gender are reported in Tables 4 and 5. Kq was used to correct for chanceagreement. When evaluating equivalence reliability, it is important to note thatPa estimates for small sample sizes (N = 30) are unbiased relative to large sam-ple values; however, less evidence is available regarding Kq (Looney, 1989).Therefore, Kq estimates were calculated for children ages 8 and 9 collectivelyand ages 10 and 11 collectively to increase sample size and improve the abilityto generalize findings (Tables 4–5). Pa and Kq were evaluated by separate crite-ria. Pa values between .50 and 1.0 are acceptable but values should be closer to1 than .50 to establish equivalence reliability (Baumgartner, Jackson, Mahar, &Rowe, 2003; Looney, 1989). Kq values greater than .75 are “excellent,” between.60 to .75 are “good,” and between .40 to .60 are “acceptable” estimates ofequivalence reliability (Morrow et al., 2000). Student passing rates on test itemsare included in Tables 6 and 7.

246 SHERMAN AND BARFIELD

TABLE 3Girls 25th, 50th, and 75th Percentile Statistics for FITNESSGRAM’s ®

Tests of Upper-Body Strength and Endurance

AgeTestItem

Number ofZero Scores

25thPercentile

50thPercentile

75thPercentile

8a PSU 3 2.0 4.0 8.0MPU 2 6.0 9.0 14.0PU 27 0.0 0.0 1.0FAH 13 0.0 4.0 8.0

9b PSU 2 2.0 4.0 8.0MPU 0 5.0 8.0 11.8PU 44 0.0 0.0 0.0FAH 17 0.0 4.0 6.8

10c PSU 6 2.0 5.0 8.5MPU 1 3.3 8.0 14.5PU 35 0.0 0.0 0.0FAH 14 0.0 3.0 8.0

11d PSU 0 2.0 6.0 12.0MPU 1 5.0 10.0 13.0PU 34 0.0 0.0 0.0FAH 1 1.0 3.0 6.0

Note. MPU = modified pull-up; PSU = 90º push-up; PU = pull-up; FAH = flexed-arm hang in sec-onds.

an = 39. bn = 56. cn = 44. dn = 43.

Page 7: Strength Test

247

TABLE 4Boys’ Percent Agreement and Modified Kappa Indexes Between

the Push-Up Test and the FITNESSGRAM’s ® Alternate Testsof Upper-body Strength and Endurance

Age Statistic PSU–MPU PSU–FAH PSU–PU

8a Pa .78 .80 .61Kq .56 .60 .22

9b Pa .78 .76 .62Kq .56 .52 .24

10c Pa .74 .72 .74Kq .48 .44 .48

11d Pa .86 .75 .70Kq .72 .50 .40

8 and 9e Kq .39 .49 .2910 and 11f Kq .55 .46 .47

Note. Pa = percent agreement; Kq = modified kappa; MPU = pull-up; PSU = 90º push-up; PU =pull-up; and FAH = flexed-arm hang. Age is represented in years. Excellent = Kq > .75, Good = .60 ≤Kq ≤ .75, and Acceptable = .40 ≤ Kq < .60.

an = 46. bn = 50. cn = 61. dn = 44. en = 96. fn = 105.

TABLE 5Girls’ Percent Agreement Indexes Between the Push-Up Test

and the FITNESSGRAM’s ® Alternate Tests of Upper-Body Strengthand Endurance

Age Statistic PSU–MPU PSU–FAH PSU–PU

8a Pa .54 .56 .72Kq .08 .12 .44

9b Pa .48 .64 .77Kq –.04 .28 .54

10c Pa .59 .75 .82Kq .18 .50 .64

11d Pa .58 .67 .67Kq .16 .34 .34

8 and 9e Kq .11 .24 .4610 and 11f Kq .25 .40 .44

Note. Pa = percentage agreement; kq = modified Kappa; MPU = modified pull-up; PSU = 90º

push-up; PU = pull-up; and FAH = flexed-arm. Excellent = Kq > .75, Good = .60 ≤ Kq ≤ .75, and Ac-ceptable = .40 ≤ Kq < .60. Age is represented in years.

an = 39. bn = 56. cn = 44. dn = 43. en = 95. fn = 87.

Page 8: Strength Test

Boys

Using these criteria, equivalence reliability for the PSU–MPU comparison was ac-ceptable for boys ages 8 to 10 (Pa = .74 to .78, Kq = .48 to .56), with good agree-ment for boys age 11 (Pa = .86; Kq = .72). Reliability estimates were also accept-able for the PSU–FAH comparison across all ages (Pa = .72 to .80, Kq = .44 to .60);however, consistency of classification was not demonstrated between PSU–PU.Estimates were unacceptable for boys ages 8 and 9 (Pa = .61 to .62; Kq = .22 to .24)and barely acceptable for boys ages 10 and 11 (Kq = .40 to .48).

Girls

Equivalence reliability for the PSU–MPU comparison was unacceptable across allages (Pa = .48 to .59; Kq = –.04 to .18). Similar results were found for thePSU–FAH comparison (Table 5) except for 10-year-olds. Acceptable to good esti-mates were found for the PSU–PU comparison (Pa = .72 to .82, Kq = .44 to .64)with the exception of girls age 11 (Kq = .34).

248 SHERMAN AND BARFIELD

TABLE 6Passing Rates (%) on Upper Body Strength and Endurance Items for Boys

Age PSU MPU FAH PU

8 70 87 63 399 68 90 76 3810 53 75 61 3611 56 68 39 34

Note. MPU = modified pull-up; PSU = 90º push-up; PU = pull-up; FAH = flexed-arm hang. Ageis represented in years.

TABLE 7Passing Rates (%) on Upper Body Strength and Endurance Items for Girls

Age PSU MPU FAH PU

8 46 92 54 289 41 89 59 2110 34 75 45 2111 49 86 26 21

Note. MPU = modified pull-up; PSU = 90º push-up; PU = pull-up; FAH = flexed-arm hang. Age isrepresented in years.

Page 9: Strength Test

DISCUSSION

The study was designed to determine the equivalence reliability, or alternate formsreliability, between the suggested FITNESSGRAM® upper-body strength test, thePSU, and alternative test choices. If high equivalence reliability exists among testitems, practitioners may use test items interchangeably and feel confident that achild’s fitness zone classification will be consistent across tests. If low reliabilityexists, FITNESSGRAM® researchers may need to adjust healthy fitness zone cri-teria or reevaluate optional test items.

Boys

PSU–MPU. In this study, equivalence reliability estimates were acceptablefor the PSU–MPU comparison for boys; however, these estimates were insuffi-cient to conclude that tests can be used interchangeably. Looney (1989) indicatedthat a high percentage of masters and large sample variability increases classifica-tion agreement as described by the proportion of agreement. Although the PSUand MPU data had these characteristics, far too many boys were misclassifiedbased on test choice. Reflective of contingency tables, approximately 20% of boyswere classified differently between tests at each age level and the majority ofmisclassified boys passed the MPU standard but failed the PSU. When the influ-ence of chance is removed, only 48% to 72% of boys, depending on age, would beexpected to receive the same classification on both tests. Acceptable equivalencywas noted but classification agreement can be improved (Table 4). Based on thesefindings, one cannot conclude that these tests are truly equivalent in terms ofhealthy fitness zone classification.

PSU–FAH. Reliability estimates for the PSU–FAH comparison were alsostatistically acceptable but insufficient to conclude that evaluation standards areequal. Contingency tables revealed 25% of boys at each age level were classifieddifferently between tests and misclassification occurred in both directions. Similarto the Kq estimates for the PSU–MPU comparison, equivalency reliability was ac-ceptable but can be improved, (Kq = .44 to .60). Also, more “zero” scores were re-corded for the FAH than the PSU or the MPU (Table 2). Therefore, practitionersmay want to consider the usefulness of the FAH, in addition to classification con-sistency, because many boys will not be able to complete one attempt.

PSU–PU. Equivalence reliability estimates were unacceptable for thePSU–PU comparison (Table 4). The high percentage of zero scores on the PU(63% of all boys) contributed to low passing rates (Table 6), despite the minimalcriteria (i.e., 1 PU). Poor performance on the PU has been documented elsewhere

EQUIVALENCE RELIABILITY 249

Page 10: Strength Test

(Engelman & Morrow, 1991; Ross, Dotson, Gilbert, & Katz, 1985). We recom-mend that the PU be removed from future editions of the FITNESSGRAM® due totest difficulty.

Minimal criterion-referenced reliability of FITNESSGRAM® upper-bodystrength and endurance items is present in the literature (Kollath et al., 1991;Romain & Mahar, 2001), especially in terms of equivalence reliability. Romainand Mahar (2001) published the initial equivalence reliability study specific toFITNESSGRAM® upper-body test items. These researchers compared the classi-fication consistency between the PSU and MPU in boys (n = 30) and girls (n = 32)in Grades 5 and 6. Among boys, Romain and Mahar reported Pa = .70 and Kq = .40and suggested that these reliability estimates were not acceptable. Norm-referenced PSU and MPU mean scores were similar to those reported in this studybut our reliability estimates were slightly higher and extended the test comparisonsacross all alternative FITNESSGRAM® items. Nonetheless, we agree withRomain and Mahar that classification consistency must improve before test itemsare used interchangeably.

Girls

PSU–MPU. Equivalence reliability estimates were unacceptable for thePSU–MPU comparison across all ages (Table 5). Percentage agreement rangedfrom .48 to .59. Based on contingency tables, over 40% of girls in each group wereclassified differently between tests and the majority of misclassified girls passedthe MPU but failed the PSU. If the influence of chance is removed, the equivalencyreliability is unacceptable (Kq ≤ .20). Practitioners should not be encouraged touse the PSU and MPU interchangeably to evaluate criterion-referenced muscularstrength and endurance performance for girls.

PSU–FAH. Reliability estimates for the PSU–FAH comparisons were alsounacceptable across ages, with the exception of age 10 (Table 5). Pa statisticsranged from .56 to .64, and contingency tables revealed that 36% to 44% of girlsages 8, 9, and 11 were classified differently between tests with the majority passingthe FAH and failing the PSU. If chance agreement is controlled, equivalency reli-ability is unacceptable. Additionally, an unacceptable number of girls received azero score on the FAH (45 of 182). Similar to the recommendation for thePSU–MPU comparison, practitioners should not use the PSU and FAH inter-changeably to assess upper-body strength.

PSU–PU. In contrast to boys, the highest classification agreement for girlswas recorded for the PSU–PU comparison. Sixty-seven percent to 82% of girls re-ceived the same classification on both test items. Although Pa estimates were ac-ceptable with the exception of age 11 (Table 5), Kq estimates indicated that less

250 SHERMAN AND BARFIELD

Page 11: Strength Test

than 55% of girls would be classified the same if chance agreement was controlled.More importantly, the PU test was too difficult to complete. The majority of girlscould not complete 1 PU (140 of 182 failed the test). In this case, acceptable equiv-alence reliability was due to test difficulty on both items and should not indicatethat both items are appropriate evaluations of upper-body muscular strength. Theauthors recommend that the PU be eliminated as a test choice from theFITNESSGRAM®. Furthemorer, it appears evident that healthy fitness zone crite-ria for the PSU are too high for girls and should be lowered to increase classifica-tion consistency with MPU and FAH tests.

The investigation by Roman and Mahar (2001) documented lower mean PSUand MPU scores than this sample and also documented unacceptable reliability(Pa = .69, Kq = .38) between the PSU and MPU. These researchers concluded thatcriterion consistency among strength items needed to be addressed. If the PSU re-mains the recommended FITNESSGRAM® muscular strength and endurance testitem recommendation for girls, classification consistency with optional test itemsmust be improved.

Application

Equivalence reliability between the recommended FITNESSGRAM® upper-body strength test, the PSU, and alternative muscular strength and enduranceitems needs improvement. This finding is consistent with the lack of agreementbetween the PSU and MPU documented by Romain and Mahar (2001). Al-though some reliability estimates were statistically “acceptable” across specificcomparisons (Pa), most comparisons for both boys and girls were unacceptablefor practical usage when chance agreement was considered (Kq). Consistency ofclassification across future recommended tests and criteria must improve if prac-titioners use FITNESSGRAM® battery data to assess health-related fitness (i.e.,muscular fitness).

Validity. Equivalence reliability addresses one measurement consideration ofFITNESSGRAM® muscular strength and endurance test items. Specifically, thisstudy addresses the appropriateness of using test items interchangeably to classifya child’s score into the healthy or unhealthy fitness zone; however, criterion-referenced equivalence reliability estimates, whether high or low, should not dic-tate the inclusion or exclusion of muscular strength and endurance items within thebattery. The validity of criterion-referenced standards is an additional, but interre-lated, measurement issue. Although the authors of this study make specific recom-mendations for test item alterations, these recommendations must be consideredwithin a larger theoretical framework relative to FITNESSGRAM® test items andevaluation criteria.

EQUIVALENCE RELIABILITY 251

Page 12: Strength Test

Appropriate criterion-referenced standards are difficult to establish (Morrow etal., 2000). It is disturbing, but not surprising, that equivalence reliability acrossFITNESSGRAM® items is inadequate. One explanation is that upper-bodystrength and endurance items vary in difficulty (Zhu, 1998) and make it difficult toset equivalent standards. Lack of agreement could also be due to the varying mus-cle groups that each test emphasizes (Engelman & Morrow, 1991; Pate et al.,1993). For example, PSU emphasizes the pectoralis major and triceps whereas thePU emphasizes the latissimus dorsi and biceps. Engelman and Morrow (1991)documented a moderate relation between norm-referenced MPU and PU scores.These researchers reported correlation coefficients of .49 to .71 among boys andgirls in Grades 3 to 5. Pate and colleagues (1993) reported correlations between .40to .71 between the PSU and MPU. Although not specific to criterion-referencedstandards, inadequate correlations among test items reinforces that various up-per-body tests measure varying components of strength.

To this point, no empirical efforts have been made to validate criterion-referenced standards of strength and endurance test items. Looney and Plowman(1990) have reported one method to establish valid criterion standards. These re-searchers suggested that two groups of students, one active and one inactive, betested on a specific item. The active group utilizes muscular strength and endur-ance for everyday use and possesses a suitable level of function whereas the inac-tive group possesses a distinctly lower level of function. The intersection of sampledistributions is therefore selected as the appropriate standard. This researchmethod would be useful in the future study and evaluation of FITNESSGRAM®

test items.Equating tests is an additional method that can be used to address the validity of

criterion-referenced standards. Although a variety of statistical options are avail-able, Zhu (1998) indicated that traditional equating is an appropriate method fortwo muscular strength tests (i.e., use of z scores to equate tests). Using this method,one can determine a score for an alternative test that corresponds to a specific scoreon the gold standard. For example, a PSU score of 7 may correspond to a MPUscore of 10. FITNESSGRAM® healthy fitness zone criteria should parallel“equated” scores. Although this sample size is not sufficient to draw inferences,equated scores between the PSU and other test items for 10-year-old boys in thissample did not reflect the same health zone classifications. Although the purposeof this study is to address classification consistency across tests, one must considerthe validity of each test when drawing conclusions from reliability estimates.

Reliability. High criterion-referenced test–retest reliability coefficients havebeen documented for both the PSU (Pa = .97, Kq = .94) and MPU (Pa = .95, Kq =.90; Romain & Mahar, 2001) but have not been documented for either the PU orFAH. A major limitation of this study is that criterion-referenced reliability was

252 SHERMAN AND BARFIELD

Page 13: Strength Test

not established for all test items either prior to or during testing. Indeed, equiva-lence reliability among test scores is influenced by test–retest consistency. Al-though norm-referenced test–retest reliability coefficients reported for the PU andFAH have been high (Cotten, 1990; Engelman & Morrow, 1991; Pate et al., 1993),it is difficult to conclude if classification inconsistency in this study resulted fromerror between tests, within participants, or both.

The FITNESSGRAM® is currently in its third edition. Results from this studysuggest that upper-body muscular strength criterion standards should be investi-gated further for boys and girls in Grades 3 to 6 relative to equivalence reliability.The FITNESSGRAM® standards have been adjusted over time to reflect appropri-ate standards and further study relative to equivalence reliability will enhance use-fulness and application of the test battery.

REFERENCES

Baumgartner, T. A., Jackson, A. S., Mahar, M. T., & Rowe, D. A. (2003). Measurement for evaluationin physical education and exercise science (7th ed.). Boston: McGraw-Hill.

Chuang, J. H. (2001). Agreement between categorical measurements: Kappa statistics. Retrieved May5, 2001, from Columbia University, Department of Medical Informatics Web site: http://www.dmi.columbia.edu/homePages/chuangj/kappa/

Cooper Institute for Aerobics Research. (1999). The Prudential FITNESSGRAM® test administrationmanual. Dallas, TX: Author.

Cotten, D. J. (1990). An analysis of the NCYFS II modified pull-up test. Research Quarterly for Exer-cise and Sport, 61, 272–274.

Cureton, K. J., & Warren, G. L. (1990). Criterion-referenced standards for youth health-related tests: Atutorial. Research Quarterly for Exercise and Sport, 61, 7–19.

DiNucci, J., McCune, D., & Shows, D. (1990). Reliability of a modification of the health-related physi-cal fitness test for use with physical education majors. Research Quarterly for Exercise and Sport,61, 20–25.

Engelman, M. E., & Morrow, J. R., Jr. (1991). Reliability and skinfold correlates for traditional andmodified pull-ups in children grades 3–5. Research Quarterly for Exercise and Sport, 62, 88–91.

Jackson, A. W., Fromme, C., Plitt, H., & Mercer, J. (1994). Reliability and validity of a 1-minute 90ºpush-up test for young adults [Abstract]. Research Quarterly for Exercise in Sport, 65(Suppl. 1),A57–A58.

Kollath, J., Safrit, J., Zhu, W., & Gao, L. (1991). Measurement errors in modified pull-ups testing. Re-search Quarterly for Exercise and Sport, 62, 432–435.

Looney, M. (1989). Criterion-referenced measurement: Reliability. In M. Safrit & T. Wood (Eds.),Measurement concepts in physical education and exercise science (pp. 137–152). St. Louis, MO:Mosby.

Looney, M. A., & Plowman, S. A. (1990). Passing rates of American children and youth on theFITNESSGRAM® criterion-referenced physical fitness standards. Research Quarterly for Exerciseand Sport, 61, 215–223.

McManis, B. G., Baumgartner, T. A., & Wuest, D. A. (2000). Objectivity and reliability of the 90ºpush-up test. Measurement in Physical Education and Exercise Science, 4, 57–67.

EQUIVALENCE RELIABILITY 253

Page 14: Strength Test

Morrow, J. R., Jackson, A. W., Disch, J. G., & Mood, D. P. (2000). Measurement and evaluation in hu-man performance (2nd ed.). Champaign, IL: Human Kinetics.

Ogden, C. L., Kuczmarski, R. J., Flegal, K. M., Mei, Z., Guo S., Wei R., et al. (2002). Centers for Dis-ease Control and Prevention 2000 growth charts for the United States: Improvements to the 1977 Na-tional Center for Health Statistics version. Pediatrics, 109, 45–60.

Pate, R., Burgess, M., Woods, J., Ross, J., & Baumgartner, T. (1993). Validity of field tests of upperbody muscular strength. Research Quarterly for Exercise and Sport, 64, 17–24.

Romain, B. S., & Mahar, M. T. (2001). Norm-referenced and criterion-referenced reliability of the 90ºpush-up and modified pull-up. Measurement in Physical Education and Exercise Science, 5, 67–80.

Ross, J., Dotson, C., Gilbert, G., & Katz, S. (1985). New standards for fitness measurement. Journal ofPhysical Education, Recreation, and Dance, 56(1), 66–70.

Ross, J. G., & Pate, R. R. (1987). The national children and youth fitness study II: A summary of find-ings. Journal of Physical Education, Recreation, and Dance, 58(9), 51–56.

Rutherford, W. J., & Corbin, C. B. (1994). Validation of criterion-referenced standards for tests of armand shoulder girdle strength and endurance. Research Quarterly for Exercise and Sport, 65,110–119.

Zhu, W. (1998). Test equating: What, why, how? Research Quarterly, 69, 11–23.

254 SHERMAN AND BARFIELD

Page 15: Strength Test