16
ADAPTED PHYSICALACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability of the Bruininks-Oseretsky Test of Motor Proficiency-Long Form Brenda N. Wilson, Bonnie J. Kaplan, Susan G. Crawford, and Deborah Dewey Alberta Children's Hospital Research Centre To examine the reliability of the Bruininks-Oseretsky Test of Motor Profi- ciency-Long Form (BOTMP-LF), approximately 40 therapists completed a questionnaire on the administration and scoring of this test (72% response rate). A large degree of inconsistency between therapists was found. This prompted a study of interrater reliability of six therapists who received rigor- ous training on the BOTMP-LF. Results indicated that consistency of scoring between testers was statistically high for the battery, composite, and subtest scores. However, item-by-item agreement was low for many items, and agree- ment between raters on their diagnosis of the children as having motor prob- lems was only fair to good. There was no difference in interrater reliability of the test for children with and without learning, attentional, or motor coordina- tion problems. Some limitations of the BOTMP-LF are apparent from these studies. The Bruininks-Oseretsky Test of Motor Proficiency (BOTMP; Bruininks, 1978) is commonly used in North America to assess children suspected of having developmental coordination disorder (DCD; American Psychiatric Association, 1994). Children with DCD have significant, unexplained deficits in motor skills and coordination that affect their academics, sports participation, and activities of daily living (Fox & Lent, 1996; Hoare, 1994). Burton (1992) has stated that "per- haps the most commonlyused multipurpose (motor) assessmenttool is the BOTMP (p. 31). In numerous surveys, this statement has proven to be true for physical educators (Sherrill, 1998) and for pediatric occupational and physical therapists (Crowe, 1989; Gowland et al., 1991; Rodger, 1994; Yack, 1989). This test has wide clinical and educational acceptance because it measures skills important to children's development, it is perceived to have good psychometric properties, and, until recently, few other tests existed for the school age child. The BOTMP-LF consists of eight subtests, with a total of 46 test items. Four of these subtests combine to form the Gross Motor Composite Score, and three The authors are with the Behavioral Research Unit at Alberta Children's Hospital Research Centre, Calgary,AB T2T 5C7, Canada. Bonnie J. Kaplan and Deborah Dewey are also with the Department of Pediatrics at the University of Calgary, Calgary AB T2T 5C7, Canada.

lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

  • Upload
    vancong

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc.

lnterrater Reliability of the Bruininks-Oseretsky Test of Motor Proficiency-Long Form

Brenda N. Wilson, Bonnie J. Kaplan, Susan G. Crawford, and Deborah Dewey Alberta Children's Hospital Research Centre

To examine the reliability of the Bruininks-Oseretsky Test of Motor Profi- ciency-Long Form (BOTMP-LF), approximately 40 therapists completed a questionnaire on the administration and scoring of this test (72% response rate). A large degree of inconsistency between therapists was found. This prompted a study of interrater reliability of six therapists who received rigor- ous training on the BOTMP-LF. Results indicated that consistency of scoring between testers was statistically high for the battery, composite, and subtest scores. However, item-by-item agreement was low for many items, and agree- ment between raters on their diagnosis of the children as having motor prob- lems was only fair to good. There was no difference in interrater reliability of the test for children with and without learning, attentional, or motor coordina- tion problems. Some limitations of the BOTMP-LF are apparent from these studies.

The Bruininks-Oseretsky Test of Motor Proficiency (BOTMP; Bruininks, 1978) is commonly used in North America to assess children suspected of having developmental coordination disorder (DCD; American Psychiatric Association, 1994). Children with DCD have significant, unexplained deficits in motor skills and coordination that affect their academics, sports participation, and activities of daily living (Fox & Lent, 1996; Hoare, 1994). Burton (1992) has stated that "per- haps the most commonly used multipurpose (motor) assessment tool is the BOTMP (p. 31). In numerous surveys, this statement has proven to be true for physical educators (Sherrill, 1998) and for pediatric occupational and physical therapists (Crowe, 1989; Gowland et al., 1991; Rodger, 1994; Yack, 1989). This test has wide clinical and educational acceptance because it measures skills important to children's development, it is perceived to have good psychometric properties, and, until recently, few other tests existed for the school age child.

The BOTMP-LF consists of eight subtests, with a total of 46 test items. Four of these subtests combine to form the Gross Motor Composite Score, and three

The authors are with the Behavioral Research Unit at Alberta Children's Hospital Research Centre, Calgary, AB T2T 5C7, Canada. Bonnie J. Kaplan and Deborah Dewey are also with the Department of Pediatrics at the University of Calgary, Calgary AB T2T 5C7, Canada.

Page 2: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

96 Wilson, Kaplan, Crawford, and Dewey

subtests combine to form the Fine Motor Composite Score. These seven subtests, together with a subtest of upper limb coordination, form the Battery Composite Score.

Test-retest reliability coefficients of the long form, examined with 63 sec- ond graders and 63 sixth graders, all without disabilities (Bruininks, 1978), were low-to-moderate (0.68 to 0.88). Coefficients above 0.80 were reported for all subtests except Balance and Response Speed, suggesting that these two subtests should be interpreted with caution.

Interrater reliability on the BOTMP has been evaluated only for the indi- vidual test items of Visual Motor Control (Subtest 7). Reliability coefficients ranged from 0.79 to 0.97 for 5 raters with training and 0.63 to 0.97 for 3 graduate students without training (Bruininks, 1978). Median correlations for the total subtest score reached 0.98 for testers who had received training and 0.90 for those who used only the scoring instructions in the manual. Although Bruininks (1978) stated that Subtest 7 was difficult to score, most clinicians find that it is actually the easiest one to rate as it is the only subtest with a written record that can be referred to repeatedly. The seven other subtests all measure a child's actions directly, and scoring must be spontaneous and rapid. It may not be appropriate to assume that the interrater reliability of Subtest 7 will apply to the rest of the BOTMP. Absence of interrater reliability data on these seven other subtests is a major limitation of the BOTMP-LF.

Another concern regarding the standardization sample data of the BOTMP is that these data were gathered from children who did not have motor problems. Gowland et al. (1991) recommended caution in assuming that measures of reli- ability and validity would be the same for children with motor problems. Hattie and Edwards (1987) and Burton and Davis (1992) also questioned the confidence with which this test can be used in relation to both its reliability and validity. There is a need to examine the usefulness of this test with children who are known to have motor problems (Wilson, Polatajko, Mandich, & Mcnab, 1998).

This paper describes two studies: first, a questionnaire study that assessed how clinicians administered and scored specific items on the BOTMP-LF and second, a study that examined the interrater reliability of the test in a research setting.

Experiment 1

Method

Researchers used a questionnaire to assess the accuracy of clinicians' administra- tion and scoring of specific items of the BOTMP-LF. The 30-item questionnaire (available from the first author) was developed by four testers who regularly used the BOTMP. Each multiple choice question related to an item where the adminis- tration and scoring could be difficult or confusing or where the testers suspected large variability in the scoring. An example of a typical question was:

For Subtest IV - Strength, Item 1 - Standing Broad Jump, if the child falls I would:

measure to the point that the feet touch the floor measure to the point that the hands touch the floor measure to the point closest to the start line re-administer the item

Page 3: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

lnterrater Reliability of the B.O.T.M.P. 97

Sixty questionnaires were sent to four pediatric facilities in Canada that employed physiotherapists (PTs) and occupational therapists (OTs). They were distributed to therapists who were most likely to use the BOTMP. Forty-three of the 60 were returned for a 72% response rate. Six respondents indicated that they rarely used the test (1 - 2 times per year), and their questionnaires were not used in the analysis. Of the remaining 37 respondents, 85% were OTs.

Respondents were asked to indicate how they administered and scored each of 30 selected test items without refening to the manual. This was known to be inconsistent with general practice, as 6% said they relied totally on the manual during administration, 50% said that they frequently referred to the manual during a test session, and 22% reported they referred occasionally to the manual. How- ever, there were practical reasons for this request. First, it is impossible to admin- ister the BOTMP while reading directly from the manual, so it is certain that therapists frequently rely on their memory of the details of at least some of the items, which is what we were evaluating. Second, our questionnaire would have been much less informative if respondents referred to the manual for the correct answers.

Results

Of the 30 questions, only 6 were answered correctly at least 80% of the time; the remainder were correct less than 80% of the time. In other words, 24 of the 30 items we surveyed were reported to be administered or scored differently from the manual, or the writers' interpretation of the manual, at least 20% of the time. Table 1 outlines some of the areas questioned, with percentages of correct and incorrect responses. Some were rather small points of disagreement, but some (e.g., on Re- sponse Speed) could result in a large difference in the subtest score. Several items revealed fairly dramatic disagreement on rules that are stated quite clearly in the manual. The results of this survey confirmed clinical impressions: many people have different ideas of how to administer and score this test. This prompted the initiation of the next study.

Experiment 2

The purposes of this study were the following:

1. Examine the interrater reliability on all eight of the subtests ( i.e., the consis- tency of scores on the BOTMP-LF when measured by two independent testers).

2. Compare the interrater reliability of the test for children with and without learning disabilities and for children with and without motor problems.

3. Determine the extent to which specific items are more likely to be rated differently by two professionals.

Method

Participants. The sample consisted of 50 children and adolescents between the ages of 7 years, 1 month and 14 years, 5 months (M = 10.34 years, SD = 1.83). Twenty-six of the 50 children had known learning or attentional problems (LD), or

Page 4: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

98 Wilson, Kaplan, Crawford, and Dewey

Table 1 Percentages of Correct and Incorrect Administration and Scoring of Items (Experiment 1)

Correct Incorrect

Subtest 1: Running Speed and Agility 40% place the block on its side, rather than its end 54% would re-administer the trial

Subtest 3: Bilateral Coordination Item 6: 34% were administering1 scoring this item correctly Items 1-5: 78% counted each tap/ jump as one

Subtest 4: Strength Item 1 : 48% measured correctly if child fell Items 2 & 3: 61% started timing correctly

Subtest 5: Upper Limb Coordination Items 1 - 5: 57% counted it as a failure if child stepped off of the mat Item 5: 61% place the target level with eyes Item 6: 45% gave 2 practices Item 6: 32% had the child hold pointed finger in front of face

Subtest 6: Response Speed 57% position themselves to the left of a right handed child 33% place the tape below the shoulder level 57% record the number at the masking tape line

Subtest 7: Visual Motor Control Items 5 - 8: 15% use lead pencil and permit erasing

Subtest 8: Upper Limb Speed and Dexterity Item 2: 45% count correct when both picked up and placed simultaneously Item 3: 15% begin timing when the child touches the deck of cards Item 4: 29% began timing when bead touched the lace

60% place the block on its end 46% would continue the trial even if the child fell

52% counted all claps 14% counted only claps done in front of the face. 22% counted each right-left pattern as one

52% measured an incorrect point if child fell 34% begin to time when the child begins, rather than on "Go"

43% re-administer trial if the child stepped off of the mat 3 1 % place the target level with chin 48% gave only 1 practice 36% had the child hold pointed finger in front of chest

43% position themselves child incor- rectly 67% place the tape at the shoulder level 27% record the number under the child's thumb

47% use the red pencil 80% do not permit erasing

19% accepted approximate picking and placing together

* 83% began timing incorrectly * 7 1 % began timing incorrectly

Page 5: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

Interrater Reliability of the B.O.T.M.P. 99

both, as evidenced by their attendance at a special school or clinic designed to meet the needs of children with complex learning problems. Twenty-four of the children did not have any known learning problems (nonLD; Kaplan, Wilson, Dewey, & Crawford, 1998).

Ten of the 38 children who received a full motor assessment were also clas- sified as having DCD. The identification of DCD is commonly based on perfor- mance on a test of motor skills of one or more standard deviations below the mean (Henderson & Bmett , 1998). For the larger study in which these children partici- pated, below average performance was defined as one standard deviation below the mean on at least two of six measures of motor skills. These included the Bat- tery Composite, Gross Motor Composite, Fine Motor Composite, Short Form of the BOTMP, Movement Assessment Battery for Children (MABC; Henderson & Sugden, 1992), and the Developmental Coordination Disorder Questionnaire (Wil- son, Kaplan, Crawford, Dewey, & Campbell, 1999). See Kaplan et al. (1998) for rationale of this decision.

Table 2 shows the distribution of the sample by age and gender, and perfor- mance on the BOTMP-LF by diagnosis. Using a one-way ANOVA and Chi-square analysis, the LD and nonLD groups did not differ with respect to age, F(1,48) = 0.08 or gender, X2(1) = 2.88. Using a series of one-way ANOVAs, the children in the nonLD group scored significantly higher than the children in the LD group on the BOTMP Battery Composite, F(1,48) = 20.99, p < .001, the BOTMP Fine Mo- tor Composite, F(1,48) = 1 2 . 2 4 , ~ < .O1, and the BOTMP Gross Motor Composite, F(l,48) = 1 8 . 5 5 , ~ < .001. The proportion of children with DCD within the LD and nonLD groups was significantly different: X2(1) = 9.87, p < .01; 10 children with LD also had DCD but none of the nonLD children did.

We compared the children classified as DCD and nonDCD on age and gen- der using one-way ANOVAs and Chi-squares. There was no significant group dif- ference for age, F(1,36) = 0.02, and no association between group and gender, X2(1) = 0.45.

Instrumentation and Procedure. The BOTMP-LF was administered and scored according to standardized procedures described in the manual. The testers consisted of five OTs and one master's level experimental psychologist, all of whom had more than 5 years experience working with children. Three of the testers had little or no experience using the BOTMP-LF before this study began; two had more than 5 years experience with this test. All testers, including those with BOTMP-LF experience, completed a very defined training procedure as was re- quired by the research laboratory that conducted the larger study that most of the sample was drawn from (Kaplan et al., 1998). Prior to data collection, testers were required to (a) observe the administration of the test by a more experienced tester, (b) "walk through" the administration of the test with the most experienced tester with no child present, (c) memorize test administration and instructions (cue cards for each item were also available during assessments), and (d) perform a practice test on another tester to identify potentially problematic areas of administration. If this step indicated the need for more study, another practice session on another tester was done after more study was completed. Testers were then required to (e) complete one or two assessments on children who were identified as being "easy to test" participants, and (f) test a child while being observed by an experienced tester, who then provided feedback.

Page 6: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

Table 2 Composition of Sample (Experiment 2)

Score type

Age (Y~s) Battery Gross motor Fine motor Males Females Total Mean (SD) composite composite composite

Children with LD 20 6 26 10.41 (1.85) 46.38 (9.15) 47.23 (8.33) 48.35 (8.45) Children without LD 13 11 24 10.26 (1.84) 57.08 (7.14) 54.67 (6.49) 59.46 (9.78)

Children with DCD 6 4 10 10.14 (2.19) 42.70 (10.38) 43.90 (7.16) 46.30 (10.01) Children without DCD 20 8 28 10.25 (1.18) 55.61 (8.46) 54.21 (7.43) 57.11 (11.06) $ Entire sample 33 17 50 10.34 (1.83) 51.52 (9.79) 50.80 (8.33) 53.68 (10.62) *:

Page 7: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

tnterrater Reliability of the B.O.T.M.P. 101

This procedure is acknowledged to be much more intensive than that which would normally be provided in a clinical and educational setting. It was required due to the need for careful assessment of motor skills as part of the larger study mentioned, but may not have simulated real conditions of the test's use. Despite this rigorous training schedule, the testers found many differences in how each one administered and scored certain parts of the BOTMP-LF.

In examining the interrater reliability, the actual assessment session was used to rate the child. All raters acted as both testers and observers, assigned according to convenience. Data were collected by a tester who administered and scored the test as usual and by an observer who independentIy scored the child's responses. The tester and observer did not look at each other's score sheets or indicate the child's score by counting responses aloud or giving the child direct feedback. The tester was careful not to stop timing an item when the child had, in her opinion, correctly completed an item. Ending the item before the maximum time limit would have indicated to the observer that the child's response had been scored as a pass within the time limit; extending the time to the maximum enabled the observer to make her decision of the child's success independently. Testers and observers were usually blind as to whether participants had a diagnosis of LD, DCD, or neither diagnosis.

Design and Analysis. The standard scores of the Battery Composite, the Gross Motor Composite, the Fine Motor Composite, and each individual Subtest were used in the analyses. Pearson product-moment correlation coefficients tends "to ignore the magnitude of the discrepancy on individual pairs of ratings. As a result, testers can be far apart on individual measurements, but, as long as the trend of their ranking is similar, the extent of the correlation will be high, giving the impression of better agreement than actually exist" (Cicchettti & Conn, 1976, pp. 375-376). This bivariate statistic, therefore, has limitations as a measure of interrater agreement and, for this reason, intraclass correlation coefficients (ICCs) were com- puted. ICCs provide an estimate of reliability while accounting for changes in the standard deviation from one test to the next. They allow the assessment of how much of the variability among the six observers was due to the raters and how much was due to differences among the participants and their test scores.

The form of ICC utilized is determined by the study design; we chose the ICC (2,l) model to measure the agreement of the judges (Shrout & Fleiss, 1979). In this model, judges were considered random effects and the research question was whether the judges could be considered interchangeable. Treating the judges as random effects was deemed appropriate since we wished to have an estimate of the reliability that could be expected among other randomly selected testers. SPSS for Windows Version 6.0 was used to calculate the appropriate mean squares, and the ICCs were calculated directly from this information.

Scoring of individual items was also analyzed by printing out the actual point score from the tester and from the observer on each item for each child, and adding up the times that the tester and observer disagreed. If the tester and ob- server agreed on an item for every participant in this study, the percentage of dis- agreement was zero. If, however, they disagreed on the point score for 18 of the 50 participants, the percentage of disagreement was 18/50, or 36%.

Analyzing the data another way, the raw proportion of observed agreement between testers corrected for chance (Kappa) was calculated for the Battery, Gross Motor and Fine Motor Composite Standard Scores, to examine the consistency of

Page 8: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

1 102 Wilson, Kaplan, Crawford, and Dewey

the diagnostic decisions made by all combinations of raters (Riggin, Ulrich, & Ozmun, 1990). The consistency of the diagnosis of DCD or nonDCD was calcu- lated with a 2 X 2 contingency table containing information about the proportion of children classified as DCD or not by two raters. Values of Kappa over 0.75 indicate excellent agreement beyond chance, while values from 0.40 to 0.75 indi- cate fair to good agreement (SPSS, 1996).

Results

There were no significant differences in interrater agreement for the group of chil- dren with LD and those without, nor were there differences in interrater agreement for the group with DCD and those without motor problems. Therefore, the groups were combined and all analyses were done with the entire sample.

Table 3 shows the ICCs for the Gross Motor, Fine Motor, and Battery Com- posite Standard scores. Values are very high (0.82 - 0.96), indicating that the mag- nitude of the difference between tester scores was quite small and that there were no systematic multiplicative differences across testers or between raters acting as either testers or observers.

ICCS were also calculated for the Subtest Standard scores (Table 4). All subtests had an acceptable level of interrater reliability, but correlations were slightly lower than for Composite Standard Scores. The percentage of times that the scor- ing disagreement between the tester and the observer resulted in a difference in the Subtest Point score was also given. It is interesting to note that the Balance Subtest had the lowest ICCs and also a high percentage of disagreement, as would be expected. The Response Speed subtest had the highest correlations and the lowest percentage of disagreement. The highest percentage of disagreement, 74%, was found for Subtest 8 (Speed and Dexterity), but the overall correlation was acceptable.

Although the Composite Standard score and the Subtest Standard score cor- relation coefficients were high, there was a possibility that the large number of items in the test (46) may have minimized actual differences in the scoring done by two raters. We therefore calculated item-by-item disagreement between the testers. This was done by identifying the number of times that both the tester and the observer disagreed on the point score assigned for each participant in the sample.

Table 3 Intraclass Correlation Coefficients for Composite Standard Scores (Experiment 2)

Battery Gross motor Fine motor composite composite composite

Entire sample LD (n = 26) NonLD (n = 24) DCD (n = 10) NonDCD (n = 28)

Note. All correlations significant at p < .001.

Page 9: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

lnterrater Reliability of the B.O.T.M.P. 103

Table 4 Intraclass Correlations and Percentage of Times Testers Disagreed for Subtest Standard Scores (Experiment 2)

% of ICCs disagreement

Subtest 1: Running Speed and Agility Subtest 2: Balance Subtest 3: Bilateral Motor Coordination Subtest 4: Strength Subtest 5: Upper Limb Coordination Subtest 6: Response Speed Subtest 7: Visual Motor Control Subtest 8: Speed and Dexterity

Note. All correlations significant atp < .001.

Table 5 shows that the percentage of disagreement for each item ranged from 0 to 74%, with 17 of the 46 test items (over one-third) being prone to disagreement between testers more than 20% of the time. This indicates that testers disagreed often, even though the magnitude of their disagreement (as analyzed by correla- tional statistics) was relatively small in most cases.

Table 6 shows the level of agreement between the two raters for the Battery, Gross Motor, and Fine Motor Composite Standard Scores, using Kappa (SPSS, 1996). Agreement corrected for chance was fair for the Battery Composite and Fine Motor Composite scores, and good for the Gross Motor Composite Score, but none met the accepted level of Kappa set at 0.75. Although correlations were statistically significant, agreement between raters was only fair when analyzed by this method.

The consistency of decisions made by testers regarding impairment or nonimpairment was also examined. Figure I reports the agreement between two testers in their assignment of children to the DCD or nonDCD categories, based on the Battery Composite Standard score. Inconsistent decisions were made four times in this sample of 48 children. Although there was a significant association between the testers' and the observers' classification, xZ(l) = 21.26, p < .001), the Kappa value (0.64) did not meet the accepted level of agreement (0.75). A Kappa of 0 would indicate random or chance assignment to the classification, and one of 0.75 would indicate that the measurement procedures are 75% better than random as- signment. Agreement between raters on their classification of children's perfor- mance in this study was only 64% better than that which would occur by chance.

Discussion

The results of a questionnaire to OTs and PTs in Canada confirmed clinical im- pressions of the BOTMP-LF: many professionals are not administering and scar- ing the test according to the standardized procedures presented in the manual.

Page 10: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

104 Wilson, Kaplan, Crawford, and Dewey

'1 Table 5 Item-by-Item Disagreement Between Tester and Observer (Experiment 2)

1 Subtest

Subtest 1:

I Subtest 2:

Subtest 4:

Item 1 Item 2 Item 3

Item 1 Item 2

- Item 3 Item 4 Item 5

Subtest 5:

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9

I

Running Speed and Agilitya

Item 6 Item 7 Item 8

Subtest 3:

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8

Standing on preferred leg on floor Standing on preferred leg on balance beam Standing on preferred leg on balance beam--eyes closed" Walking forward on walking line Walking forward on balance beam Walking forward heel-to-toe on walking linea Walking forward heel-to-toe on balance beama Stepping over response speed stick on balance beam

Bilateral Coordinationa

Tapping feet alternately while making circle with fingers Tapping-foot and finger on same side synchronized Tapping-foot and finger opposite side synchronized Jumping in place-leg and arm on same side synchronized Jumping in place-leg and arm on opposite side synchronized Jumping up and clapping hands Jumping up and touching heels with hand Drawing lines and crosses simultaneously

Strength"

Standing broad jump Sit-ups Knee push-up and full push-upsa

Upper Limb Coordinationa

Bouncing a ball and catching it with both hands Bouncing a ball and catching it with preferred hand Catching a tossed ball with both hands Catching a tossed ball with preferred hand" Throwing a ball at a target with preferred handa Touching a swinging ball with preferred handa Touching nose with index fingers Touching thumb to fingertips-eyes closed Pivoting thumb and index finger

% of disagreement

Subtest 6: Response Speed"

(continued)

Page 11: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

lnterrater Reliability of the B.O.T.M.P. 105

Table 5 (continued)

Subtest

Subtest 7:

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8

Subtest 8:

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8

Visual Motor Control"

Cutting out a circle with preferred hand Drawing a line through a crooked path with preferred hand Drawing a line through a straight path with preferred hand Drawing a line through a curved path with preferred handa Copying a circle with preferred hand Copying a triangle with preferred hand Copying a horizontal diamond with preferred hand Copying overlapping pencils with preferred hand"

Upper Limb Speed and Dexteritya

Placing pennies in a box with preferred hand Placing pennies in two boxes with both handsa Sorting shape cards with preferred hands Stringing beads with preferred hand" Displacing pegs with preferred handa Drawing vertical lines with preferred hand" Making dots in circles with preferred hand" Making dots with preferred hand"

% of disagreement

aIndicates those items and subtests for which disagreement occurred more than 20% of the time.

Table 6 Agreement Between Testers as Measured by Kappa (Proportion of Observed Agreement Adjusted for Chance)

Battery Gross Fine composite motor composite motorcomposite (n = 48) (n = 48) (n = 50)

Agreement 0.64* 0.70* 0.62*

**Above 0.75; excellent agreement. *0.4 to 0.75; fair to good agreement.

Page 12: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

106 Wilson, Kaplan, Crawford, and Dewey

a =proportion of students classified as nonDCD by tester and observer

b = proportion of students classified as DCD by the tester and nonDCD by the obscncr

C = proportions of sludents class~fied as 11oliDCD by the tester and DCD by the observer

d =proportion of students classified as DCD by tc~ter and observer

Observer Classification of

NonDCD

Observer Classsification of

DCD

Column Total

Figure 1 - Calculation of po and kappa for the Battery Composite score.

Subsequently, a study of the interrater reliability within a research setting was conducted. This demonstrated that the consistency between the composite scores obtained when two extensively trained testers observed and rated the same child were good from a statistical standpoint. However, the low item-by-item agree- ments of the point scores indicated that testers do not observe or rate the same behaviors in the same way. Different testers appear to disagree very often, al- though the magnitude of the difference may be small. The score appears to be influenced by the person testing the child and by the testers' individual standards and differing levels of severity of grading (Lunz & Stahl, 1990).

In a different type of analysis, the interrater reliability of the BOTMP-LF could only be considered moderate, as there was a great deal of inconsistency between professionals' ratings of children as DCD or nonDCD (average or below average). In a review of the BOTMP, Hattie and Edwards (1987) believed this to be true and even stated that "the test has little value in providing dependable scores and any decisions based on the test are suspect" (p. 111). The use of different professionals to administer or readminister the test will likely further limit the valid interpretation of the results.

The analyses indicated that the large number of items in the BOTMP-LF may minimize differences between item point scores of different testers when the

Tester Classification of

NonDCD

a 39

(0.81)

C 2

(0.04)

41 (85%)

Tester Classification of

DCD

b 2

(0.04)

d 5

(0.10)

7 (15%)

R~~ ~ ~ t ~ l

41 (85%)

7 (14%)

48 (100%)

Page 13: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

lnterrater Reliability of the B.O.T.M.P. 107

total composite score is calculated. However, if a child's specific performance is examined on individual items, as is often done in clinical and educational settings, reliability of individual item scores is lower. Consistency between testers is high- est when the Composite Standard Scores are compared, but the level of agreement declines when the subtest scores are used and is further lowered when point (raw) scores are considered. However, individual subtests are used more than composite scores in many settings even though their reliability may not be adequate (King- Thomas & Hacker, 1987).

One factor contributing to the low item-by-item agreement between testers is that the criteria for some items are not adequately defined to allow testers to account for the variations that commonly occur between children. This is seen clearly in the Balance Subtest items where some children can stand on one foot without "moving a muscle," while others achieve balance only with a large amount of effort and extraneous movement. Compared to some other tests of motor skills, such as the MABC, the criteria for administration and scoring of the BOTMP is sometimes ambiguous. As Burton and Miller (1998) note, ". . . if the procedures (for testing) are too vague, the variability between examiners may be a confound- ing factor7'( p. 101). In addition, the time frames (whether an item is done in 10, 15, or 20 s) and specific conditions of administration (correct responses in a time period vs. amount of time required to correctly complete seven tasks) change fre- quently from one item to the next; perhaps this introduces confusion and increased opportunity for error for some people.

The generalizability of this study is limited by the degree of training re- quired by this research setting compared to what is usually available in a clinical or educational setting: (a) one experienced tester acted as a trainer for all other testers; (b) the training period and procedures were extensive; (c) early assess- ments by all testers were observed by more experienced testers; (d) the testers examined in this study frequently used the BOTMP-LF, which may be used more sporadically in a clinical and educational setting; and (e) all marking and calcula- tions of the BOTMP scores were double checked by another tester. These factors would certainly have minimized differences between the testers; agreement be- tween the testers in this study is likely higher than would occur elsewhere. A simi- lar study in a clinical and educational setting would probably result in much lower agreement between testers.

We found no evidence that interrater reliability was different for children with learning or motor problems, or both, compared to those without. This pro- vides us greater confidence in using this test for children with motor coordination problems (Gowland et al., 1991).

Clinical and Educational Implications

One implication for educators and clinicians who use this test is that careful and regular review of the test administration and evaluation should be done. The same professional should do both testings when the test is readministered. Wilson et al. (1995) has recommended that the point (raw) scores be examined to determine a child's progress over time. If subtest scores and performance on individual items are used in the reevaluation of a child's motor performance, it is even more im- perative that the same tester be used. Implications specific to each subtest are summarized:

Page 14: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

108 Wilson, Kaplan, Crawford, and Dewey

I. The Running Speed and Agility Subtest was frequently administered incor- rectly and testers disagreed on the score 113 of the time. As there is only one item in this subtest, chances of error are high and the standard error of mea- surement (SEM) is large (Bruininks, 1978).

2. The Balance Subtest had the lowest interrater reliability and the highest per- centage of disagreement; four of the subtest's eight items were problematic in their administration. With the low test-retest reliability and large SEM reported in the manual, many limitations for the interpretation of this subtest are apparent. Burton and Miller (1992) and King-Thomas and Hacker (1987) also consider this subtest to be of questionable validity.

3. The Bilateral Coordination Subtest had a fairly high level of disagreement for the subtest score (54%) but good agreement for each individual item in this subtest.

4. The Strength Subtest had a moderate level of disagreement (44%), and the pushup item was observed to have many inconsistencies in its administration.

5. The Upper Limb Coordination Subtest was similar to the Balance subtest in its percentage of disagreement (62%) and in the high number of items that were problematic (six out of 8). It also had a large SEM.

6. The Response Speed Subtest (one item) had a surprisingly low percentage of disagreement considering the differences observed in how testers admin- ister and score the one item of this subtest. Its reported SEM was very large.

7. The Visual Motor Control and Upper Limb Speed and Dexterity Subtests, which both involve visual motor integration and dexterity, were found to have the highest percentages of disagreement and some of the lowest levels of interrater agreement. Many of the items in each subtest were observed to have been administered or scored incorrectly, or both, especially relating to timing.

Further Research

This paper describes the first step in the establishment of the interrater reliability of the BOTMP-LF. While our findings in a research lab indicate some implica- tions for the use of this test in a clinical and educational setting, the results of Experiment 2 are likely an inflation of the reliability. It is imperative that further examination of the consistency of scoring between professionals be done in clini- cal and school settings to allow professionals more confidence in interpreting the results of their testings. Further research could also examine reliability of people from different professions.

Two groups of children were poorly represented in this study. The group of children with DCD was quite small. Although we found no difference in the reli- ability of the test when used with these children compared to age matched con- trols, it is important to replicate this research with other samples of children. Sampling would be most relevant if the children were chosen from a group re- ferred for treatment due to their motor problems, as opposed to our sampling, which was largely based on a group of children with learning or attention prob- lems, or both, who may or may not have also had DCD. Another group excluded in this study were 5 and 6-year-olds. The reliability of the test with this age group is important to establish. Finally, the usefulness of this test in identifying a child with DCD needs to be evaluated further.

Page 15: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

lnterrater Reliability of the B.O.T.M.P. 109

References

American Psychiatric Association. (1994). Diagnostic and statistical manual of mental dis- orders. (4th ed.). Washington, DC: American Psychiatric Association.

Bmininks, R.H. (1978). Bruininks-Oseretsky Test of Motor Proficiency: Examiner 's manual. Circle Pines, MN: American Guidance Service.

Burton, A.W., & Davis, W.E. (1992). Assessing balance in adapted physical education: Fundamental concepts and applications. Adapted Physical Activity Quarterly, 9, 14- 46.

Burton, A.W., & Miller, D.E. (1998). Movement skill assessment. Champaign, IL: Human Kinetics.

Cicchettti, D.V., & Conn, H.O. (1976). A statistical analysis of reviewer agreement and bias in evaluating medical abstracts. Yale Journal of Biology and Medicine, 49,373-383.

Crowe, T.K. (1989). Pediatric assessments: A survey of their use by occupational therapists in northwestern school systems. The Occupational Therapy Journal of Research, 9, 273-286.

Fox, M.A., & Lent, B. (1996). Clumsy children: Primer on developmental coordination disorder. Canadian Family Physician, 42, 1965- 197 1.

Gowland, C., King, G., King, S., Law, M., Letts, L., MacKinnon, L., Rosenbaum, P., & Russell, D. (1991). Review of selected measures in neurodevelopmental rehabilita- tion. Hamilton, Ontario: Neurodevelopmental Clinical Research Unit.

Hattie, J., & Edwards, H. (1987). Areview of the Bruininks-Oseretsky Test of Motor Profi- ciency. British Journal of Educational Psychology, 57, 104-113.

Henderson, S.E., & Barnett, A.L. (1998). The classification of specific motor coordination disorders in children: Some problems to be solved. Human Movement Science, 17, 449-470.

Henderson, S.E., & Sugden, D.A. (1992). Movement assessment battery for children. Kent, UK: The Psychological Corporation.

Hoare, D. (1994). Subtypes of developmental coordination disorder. Adapted Physical Ac- tivity Quarterly, 11,158-169.

Kaplan, B.J., Wilson, B.N., Dewey, D.M., & Crawford, S.G. (1998). DCD may not be a discrete disorder. Human Movement Science, 17,471-490.

King-Thomas, L., & Hacker, B.J. (1987). A therapist's guide to pediatric assessment. Bos- ton: Little, Brown and Co.

Lunz, M.E., & Stahl, J.A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13,425-444.

Riggin, K.J., Ulrich, D.A., & Ozmun, J.C. (1990). Reliability and concurrent validity of a test of motor impairment - Henderson revision. Adapted Physical Activity Quarterly, 7,249-258.

Rodger, S. (1994). A survey of assessments used by pediatric occupational therapists. Aus- tralian Occupational Therapy Joumal, 41,137-142.

Sherrill, C. (1998). Adaptedphysical activity, recreation and sport: Crossdisciplinary and lifespan. (5th ed.). Boston: WBCMcGraw-Hill.

Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 80,420-428.

SPSS. (1996). SPSS base 7.0 applications guide. Chicago: SPSS Inc. Wilson, B.N., Kaplan, B.J., Crawford, S.C., Dewey, D., & Campbel1,A. (in press). Reliabil-

ity and validity of a parent questionnaire on childhood motor skills. American Jour- nal of Occupational Therapy.

Page 16: lnterrater Reliability of the Bruininks-Oseretsky Test of ...€¦ · ADAPTED PHYSICAL ACTIVITY QUARTERLY, 2000,17,95-110 O 2000 Human Kinetics Publishers, Inc. lnterrater Reliability

110 Wilson, Kaplan, Crawford, and Dewey

Wilson, B.N., Polatajko, H.J., Kaplan, B.J., & Faris, P.D. (1995). Use of the Bruininks- Oseretsky Test of Motor Proficiency in occupational therapy. American Journal of Occupational Therapy, 49,8-17.

Wilson, B.N., Polatajko, H.J., Mandich, A.D., & Mcnab, J.J. (June, 1998). Standardized measures: How well do they identih children and adolescents with DCD. Paper pre- sented at the World Federation of Occupational Therapy, Montreal, Canada.

Yack, E. (1989). Sensory integration: A survey of its use in the clinical setting. Canadian Journal of Occupational Therapy, 56,229-235.

Authors' Note

We wish to acknowledge the financial support of the Alberta Children's Hospital Foundation and Alberta Mental Health, and the assistance of Anne Robillard, OT(C), for coordination of parts of this study. We also thank the children and families who participated in this research.