Evaluating EDI* Participant Reactions via Different ...€¦ · Evaluating EDI Participant Reactions via Different Response Scales: A Technical Review 4 -- There is no definitive

Evaluating EDI* Participant Reactions via Different Response Scales: A Technical Review _______________________ Keshav Gaur William A. Eckert

*The World Bank Institute (WBI) was formerly the Economic Development Institute (EDI), as reflected in the text of this publication.

WBI Evaluation Studies Number ES99-17 World Bank Institute The World Bank Washington, D.C.

48932P

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

edP

ublic

Dis

clos

ure

Aut

horiz

ed

-2-

Copyright © 1999 The International Bank for Reconstruction and Development/The World Bank 1818 H Street, N.W. Washington, D.C. 20433, U.S.A. The World Bank enjoys copyright under protocol 2 of the Universal Copyright Convention. This material may nonetheless be copied for research, educational, or scholarly purposes only in the member countries of The World Bank. Material in this series is subject to revision. The findings, interpretations, and conclusions expressed in this document are entirely those of the author(s) and should not be attributed in any manner to The World Bank, to its affiliated organizations, or the members of its Board of Directors or the countries they represent. If this is reproduced or translated, WBI would appreciate a copy.

Executive Summary

Donald L. Kirpatrick1 defines four levels of evaluation as evaluating reaction (level 1), evaluating learning (level 2), evaluating behavior (level 3) and evaluating results (level 4). The Evaluation Unit (EU) of the Economic Development Institute evaluates EDI’s activities (seminars, conferences, workshops, courses, etc.) on these four levels. All activities are evaluated at level 1 and a selected few on higher levels. It has been evaluating, on an average, about 200 activities at level 1 annually over the past five years. This paper deals with level 1 evaluations which measure how the participants who participate in the programs react to them.

For level 1 evaluations, the Evaluation Unit has shifted to using a 5-point Likert type scale

instead of a 6-point scale which was the norm until November 1997. This study attempts to address two main issues arising out of this change. First is the appropriateness of using a 5-point scale over a 6-point scale and second is the comparison/conversion of ratings obtained on two different scales. In particular, an attempt is made to find out a suitable method for conversion of scores from one scale to another so that comparison of scores being obtained on a 5-point scale can be made with past scores which are on a 6-point scale.

The methodology has involved gleaning significant insights from the past research on the

issue and also analyzing the results of an experiment conducted by the Evaluation Unit. The results of the experiment and further analysis of past data of EU clearly show that a 5-point scale is more appropriate to evaluate EDI’s activities and also that there is a suitable method to convert/compare scores from one scale to another. An odd rather than an even number of response alternatives is preferable under circumstances in which the respondent can legitimately adopt a neutral position. A 6-point scale does not provide the alternative to respond neutrally. It was found that in EDI’s past evaluations, based on 6-point scales, this constraint made the overall scores biased. These results clearly show that the overall opinion formed about a seminar can influence the response of a participant on questions in the neutral range. Since there is no neutral or midpoint on a 6-point scale, he/she is forced to choose either “3” or “4” for questions to which he/she is neutral. “4” is chosen more often when overall opinions are strongly favorable and “3” is chosen more often when overall opinions are not so strongly favorable, thus making good performance better and bad performance worse. There is clearly a need to provide the participants an alternative “neutral” response, which is precisely the objective of a 5-point scale.

When the same activity is measured on two different scales, the mean score on a 6-point

scale is greater than the mean score on a 5-point scale. However, the variance does not differ significantly due to decreasing or increasing the scale by one point. This may not be true if the scale changes by more than one point. This finding provides a basis to convert and compare the ratings on a 5-point scale to that on a 6-point scale by making an adjustment for the means accordingly. We have attempted to find the best point estimate of the mean difference between scores on different scales, for activities of the same quality and for questions falling in four categories, viz., relevancy, course content/usefulness, objectivity and worthwhileness. This difference, which is 0.78, can be used as a benchmark to convert/compare scores obtained on a 5-point scale to previous scores obtained by using a 6-point scale. For example, if for a particular activity in future the mean score on a 5-point scale to a question is 4.5, then a comparable figure on a 6-point scale will be 4.5 + 0.78 = 5.28.

1 Donald L. Kirpatrick. “Evaluating Training Programs, The Four Levels”.

Evaluating EDI Participant Reactions via Different Response Scales: A Technical Review

-- 2

Introduction

Background: Shift from a six to a five point scale The Economic Development Institute (EDI) conducts over 400 training activities annually. These activities are evaluated by the Evaluation Unit (EU), a part of EDI, and includes seminars, conferences, workshops, and courses. Evaluation activities conducted by the EU generally correspond to the four levels of evaluation defined by Kirpatrick2. These activity levels consist of evaluating reaction (level 1), evaluating learning (level 2), evaluating behavior (level 3) and evaluating results (level 4). All EDI activities are evaluated at level 1 and a selected few on higher levels. The Evaluation Unit has been evaluating, on an average, about 200 level 1 activities annually for the past five years. This paper uses data from these level 1 evaluations and measures how participants react to EDI sponsored training. One important feature of these level 1 evaluations has been the use of a Likert3 type scale to measure participants’ responses at the end of an activity. During most of its evaluative activities, the Evaluation Unit used a 6-point scale, where point 1 corresponded to “Not At All” and point 6 corresponded to “Exceeded Expectations.” Questions were framed so that a higher point corresponded to improvement in assessed quality of the event. The mean scores from these responses are used as quantitative indicators of the quality of EDI activities and the data/results are given to the course organizers, the Task Managers (TMs). Although the questions were different from event to event, the scale used was the same. A sample questionnaire using this 6-point scale can be seen in Appendix A. Such basic questionnaires and the 6-point scale were used for more than 5 years to evaluate various EDI activities. In November, 1997, the Evaluation Unit decided to shift to using a 5-point scale and phase out the use of the 6-point scale by the end of August, 1998. This decision was made by Evaluation Unit staff who believed that a 5-point scale would produce more valid responses, a view consistent with current research and practice. Two examples of questionnaires using 5-point scale are shown in Appendix B.

The shift from a 6-point to a 5-point scale may be an issue of concern within EDI, particularly among Task Managers. Task Managers are the principal users of evaluation findings within the various divisions of EDI. Reported results not only provide them with a quantitative assessment of their activities, but these same results also form a benchmark for comparisons, especially when tracking the performance of an activity over time. An important concern is that when activities are evaluated using a 5-point scale, they will not be easily compared to results from past performances which were evaluated using a 6-point scale. This concern would be a strong argument against changing the scale, unless there was some equally compelling reason for that change. Our paper addresses this concern. In the following sections, we provide evidence for making the change to a 5-point scale and a method for comparing the previously used 6-point scale results to those obtained using the new, 5-point scale.

2 Donald L. Kirpatrick. “Evaluating Training Programs, The Four Levels”. 3 Likert, R. “A Technique for the Measurement of Attitudes .” Archives of Psychology, no. 140, 1932.


-- 3

Why use a five point scale?

Past research and theoretical framework

Past research has addressed the subject of how many points a scale should have and whether or not a midpoint for neutral/average responses makes a difference. The debate can be dated back to 1915, when Boyce (1915) reviewed the number of alternatives employed in scales used to evaluate the efficiency of teachers, and has continued for over 80 years. A seminal work by Eli P. Cox III (1980) summarized this research literature on the optimal number of response alternatives for a scale. According to Cox,

“................. as the number of response alternatives is increased beyond some minimum, the demands placed upon a respondent become sufficiently burdensome that an increasing number of discrepancies occur between the true scale value and the value reported by the respondent. Thus, although the information transmission capacity of a scale is improved by increasing the number of response alternatives, response error seems to increase concomitantly. Accordingly, one can hypothesize that the relationship between the amount of information actually transmitted by a scale and the number of response alternatives it provides is similar to that shown in Figure 14. It can be argued that the optimal number of response alternatives is found at the point where the amount of transmitted information is maximized...................”

Transmitted information 0 1 2 3 4 5 6 7 8 9

Response alternatives Figure 1. Relationship between the amount of information transmitted by a scale and the number of its response alternatives ( Adopted from Eli P. Cox III, 1980)

4 Likert’s original monograph describing his technique, an important landmark in attitude measurement.


-- 4

There is no definitive criterion or formula to decide the optimal number of response alternatives appropriate under all circumstances. Broad guidelines are, however, suggested by past research on this topic. Scales with two or three responses are generally inadequate for transmitting full information and may not provide sufficient alternatives to respondents. Marginal returns from using more than nine responses are minimal. Fowler (1995), for example, shows that “Five to seven points appear sufficient for meaningful responses in most rating tasks.” In the case of subject-centered scales, Cox (1980) found that five alternatives appear adequate for the individual items and suggests that energy is best spent on increasing the number of quality items constituting the composite scale. However, this same body of research also shows that changing scales and labels may change the distribution of responses significantly (Schwarz, et. al., 1991).

The theoretical justification for using an odd-point scale is that it has a specific midpoint which denotes an “average” response, or a “neutral” response when moving from a negative to a positive range. On the other hand, on an even-point scale, if a participant wants to respond “average” or “neutral”, he/she is forced to choose in a specific direction (for example, 3 or 4 on a 6-point scale), neither of which is exactly the midpoint. Furthermore, the explicit midpoint plays a crucial role. “An odd rather than an even number of response alternatives is preferable under circumstances in which the respondent can legitimately adopt a neutral position” (Cox.., 1980). “Offering an explicit middle alternative in a forced-choice attitude item increases the proportion of respondents in that category. On most issues the increase is in the neighborhood of 10 to 20%, but it may be considerably larger” (Schuman & Presser, 1981). This means that some participants prefer to remain neutral on various issues and if they are given a choice to select a neutral response then about 10 to 20% of participants would select it. Alternatively, an even-point scale may introduce bias, especially when overall responses tend to be very high or very low. Under these conditions, when participants don’t have an alternative to respond neutrally they tend to select a response in the direction of their overall response level. For example, on a 6-point scale, a response of 4 is more likely to be selected when overall responses are high and a response of 3 selected when overall responses are low. This can bias results by making good results better and bad results worse.

Test for Bias in the EDI 6-point scale

A justification for moving from a 6-point to a 5-point scale in EDI would be the presence of a pattern of bias, as explained above. To determine if such bias exists in the use of a 6-point scale and its direction, we analyzed data for a five year period (1993 to 1997), using 5,902 participants’ observations from all 214 Senior Policy Seminars organized by EDI during that time. All seminars used the same basic questionnaire shown in Appendix A. We identified 8 common questions and attempted to see whether the absence of a midpoint on the 6-point scale introduced any bias when a participant actually had no choice of selecting a neutral opinion, but was forced to choose either 3 or 4.

The methodology we used to test for the presence of bias was to identify participants who made very strong favorable or not very strong favorable responses to an activity, and then study their specific responses to points 3 and 4 on the 6-point scale. Out of the 8 common questions (see Appendix A) we selected, if a participant responded “5” or “6” (on the 6-point scale ) for more than 5 questions, we classified that respondent as having a “strong favorable” opinion about that particular activity. Alternatively, if the participant responded “5” or “6” for only 2 or fewer questions, then we classified them as having a “not very strong favorable” opinion about that particular activity. After identifying and classifying these participants, we studied their


-- 5

responses to the remaining questions and identified how frequently they chose a response of 3 or 4 on the 6-point scale.

Table 1 shows the results from this analysis using the 214 activities with 5,902 participants. Column 2 shows the number of times a participant selected either a 5 or 6 response out of the 8 questions. Thus, a figure of “0” in Column 2 indicates that participants did not reply favorably to any of the 8 questions. Alternatively, a score of “7” indicates that participants replied very favorably to 7 out of 8 questions asked on the questionnaire. Viewing these data arranged in this manner allows us to see the responses to alternatives “3” and “4” for participants who have formed strong favorable or not so strong favorable opinions about an activity. Column 3 shows the number of times a response of “3” was chosen and Column 4 shows the number of times a response “4” was chosen. A trend clearly visible in these data is that, as a participant’s opinion about an activity becomes more favorable, then he/she is more and more likely to select response 4 over 3 , even when the total number of responses to 3 and 4 remains approximately same. The percentage of times 4 is chosen increases from 67% to 97% when we move from “strong favorable” category to “not so strong favorable” category. On the other hand, as the negative opinion about an activity increases, a respondent is more and more likely to choose 3. The percentage of selecting 3 increases steadily from 11% to 33% as the participant’s opinion starts to become less favorable.

Number of

participants

Number of times response 5 or 6 is chosen out of 8

Number of times 3 is

chosen (A)

Number of times 4 is

chosen (B)

Total (A+B)

%

choosing 3

%

choosing 4

355 0 716 1472 2188 33% 67% 307 1 401 1428 1829 22% 78% 496 2 471 2208 2679 18% 82% 646 3 403 2546 2949 14% 86% 728 4 290 2319 2609 11% 89% 908 5 207 2277 2484 8% 92% 815 6 73 1417 1490 5% 95% 707 7 17 596 613 3% 97% 940 8 0 0 0 0 0

Table 1 Analyzed Responses for 5,902 participants over 5 years (1993-1997) to the eight questions of the questionnaire in Appendix A.

These results clearly show that the overall opinion formed about a seminar has influence on the response of a participant to EDI evaluation questions in the neutral range. Since there was no neutral or midpoint, he/she was forced to choose either “3” or “4”. A “3” was chosen more often when overall opinion was not strongly favorable and a “4” was chosen more often when overall opinion was strongly favorable, thus making good performance better and bad performance worse. There is clearly a need to provide the participants an alternative “neutral” response, which is precisely the objective of a 5-point scale. This was supported further when we examined responses on the new 5-point scale. In one of the activities using the 5-point scale5, about 31% of participants (out of 29) responded “neutral” to one question. More than 20% of the responses were “neutral” for 5 of the 20 questions. Thus there is clearly a need to provide the option of a midpoint in EDI evaluation questionnaire responses.

5 Activity code 1F98FS4C


-- 6

Issues of comparison and conversion

Results of our analysis show clear evidence that bias existed as a result of EDI’s use of the 6-point scale. Furthermore, the form of that bias appears consistent with the direction specified by past research, in that it tends to exaggerate both positive and negative responses. With this established, the issues remain of how to compare and convert results from the 5-point and 6-point scales.

Experiment with different scales

The EDI evaluation unit conducted an experiment to observe directly how changing scales affects responses, and how this information could be used to make conversions in scores between the different scales. Two EDI sponsored workshops on “Partnership for Poverty Reduction” were used to conduct the experiment. At the end of each workshop, participants were randomly and unobtrusively divided into two groups. One group was asked to fill out a questionnaire using a 6-point scale and the other using a 5-point scale. The workshops were held in San Salvador, El Salvador, in November, 1997 and in Kingston, Jamaica in January, 1998. The objective of the experiment was to produce a set of results whose differences were caused only by the use of different scales. Results from this experiment are shown in Appendix C. These results were used to make inferences about the underlying population of mean scores on 5-point and 6-point scales for activities of the same quality.

Both scales start from “1,” where “1” denotes the worst performance. The scales increase with the performance level and the last number (5 or 6) on the scale represents the best performance. Two important facts have to be considered. The first is that, where a 5-point scale has a specific midpoint of 3 (where 3 denotes a neutral/average response), the 6-point scale has no such provision. On a 6-point scale, if a participant wants to respond neutral then he/she is forced to choose either 3 or 4, none of which is exactly the midpoint of the scale (the true midpoint is 3.5). Another issue is that the scales try to capture a continuum where the lowest position shows the worst performance, which slowly improves as the scale increases. This is shown graphically in Figure 2. 1 2 3 4 5 1 2 3 4 5 6 3.5 SAME PERFORMANCE MEASURED ON TWO DIFFERENT SCALES

Figure 2. Two scales measuring same activity


-- 7

Results from the experiment were used to establish two points with regard to the 5-point and 6-point scales. The first point was to determine if the mean scores using these two different scales were normally distributed. If it were found that these mean scores were normally distributed for both scales, the second point was to determine if there was a difference between the variances and mean values of the scales. If it could be established that (i) the mean scores were normally distributed, with equal variances, and (ii) the only difference between mean scores from the two scales were due to the difference between their population means, then the basis for a conversion would be to adjust for the simple difference between these population means. The results from this experiment are presented below.

Test for Normality

Appendix C gives the mean values of responses to 8 questions asked in both seminars using both the 5-point and 6-point scales. It can be inferred from Central Limit Theorem (CLT) that the mean scores are normally distributed irrespective of the distribution of the individual responses. However, we confirmed this using the Kolmogorov-Smirnov test for normality which gave a Lilliefors significance of 20% for mean scores on both scales. The results show that null hypothesis of normality has a p-value of 0.2. The fact that distributions of mean scores are approximately normal gives us a way to compare the scores on the two different scales. This result is also useful for performing tests that compare means and variances.

Test for variances and means

Once we established that the data from both 5-point and 6-point scales were normally distributed, we then tested to see if there was a significant difference between the variances and means for the two groups. Also it is desirable to test for equality of variances before doing tests for comparing means, such as the Student T test. Results are shown in Box 1 and in Tables 2 and 3. The null hypothesis that variances of responses on two scales are the same cannot be rejected. A high significance (p = 52.1%) shows that variances are statistically equal. Thus, merely increasing or decreasing the scale by one point does not change the variance of mean scores significantly.

After finding equal variances of mean scores on both scales, the next step was to compare overall means. If overall means differ but variances are the same, we can derive a correction factor to compare results obtained using different scales. The results are shown in Box 2 and Tables 2 and 3.

The distribution of the two mean scores on two different scales is normal but not identical.

The tests for distribution parameters (see Box 1 and Box 2) clearly show that, statistically, the variances on two scales are the same but the population means are different. In fact, the overall mean on a 6-point scale is statistically higher in value than the overall mean on a 5-point scale. Note that in this experiment, variables such as quality, background of participants and trainers etc. have been controlled. The data are from the same program, conducted by the same task manager and so the quality is not a variable here. Also, participants were randomly divided into two groups and as such there cannot be a systematic bias in any one group. The only variable in the experiment is the different scale being used in the questionnaires.


-- 8

Group Statistics

SCALE

N

Mean Std.

Deviation Std.

Error Mean MEAN 5

6 16 16

3.9969 4.6075

.3432

.4043 8.580E-02

.1011 Table 2. SPSS Output for Group Statistics for comparing means and variances

.422 .521 -4.606 30 .000 -.6106 .1326 -.8814 -.3399

-4.606 29.230 .000 -.6106 .1326 -.8817 -.3396

EqualvariancesassumedEqualvariancesnotassumed

MEANF Sig.

Levene's Test forEquality of Variances

t dfSig.

(2-tailed)Mean

DifferenceStd. ErrorDifference Lower Upper

95% ConfidenceInterval of the

t-test for Equality of Means

Independent Samples Test

Table 3. Samples Test Results

Results from the test for equal variance clearly show that the population variances are the same. The p value for significance is well outside of the acceptable range at 0.521. The test for equal means yields a p value of “0,” indicating a significant difference between the means of the two populations. Thus, even if population variances are statistically same, the population means are not. The mean on the 6-point scale is higher, by an amount of 0.61 than the mean of the 5-point scale. Population distributions are shown in Figure 3.

µ5 µ6 0.61 Figure 3. Population Distribution of mean scores on 5-point scale and 6-point scale (0.61 is the difference found in the experiment).


-- 9

TEST FOR VARIANCES OF TWO DISTRIBUTIONS Let σ5 = Standard deviation of mean on 5-point scale σ6 = Standard deviation of mean on 6-point scale n5 = Number of observations on scale of 5 n6 = Number of observations on scale of 6 α = Level of significance The values in our case are (standard deviations approximated by sample standard deviations): σ5 = .3432146, σ6 = .4042689, n5 = 16, and n6 = 16 For a level of significance of 10%(α = 0.10), F15,15,.05 = 2.40 The null hypothesis is that both the population variances are same. That is , H0: σ5

2 = σ62

Against the two sided alternative H1: σ52 ≠ σ6

2 Decision Rule is to reject H0 if (σ6

2/ σ52 ) > F15,15,.05

But (σ62/ σ5

2) = 1.3874 which is clearly < 2.40. Therefore, the null hypothesis can not be rejected at a significance level of 10%. In fact the significance level in Levene’s Test for equality of variances is 52.1%, clearly showing that variance does not change significantly by merely increasing or decreasing the scale by one point.

Box 1: Test for equality of variances

TEST FOR POPULATION MEANS OF TWO DISTRIBUTIONS Let X5 = Overall sample mean of mean scores on 5-point scale X6 = Overall sample mean of mean scores on 6-point scale

The result of above test enables us to test the (in)equality of the two population means. From the observed sample variances, an estimate of the common population variance is:

s2 = ( ) * ( ) *n n

n n5 5

26 6

2

5 6

1 12

− + −+ −

σ σ = 0.1406

Now let us test the null hypothesis that all else equal, the mean score of a 6-point scale is equal to the mean score of a 5-point scale. That is , H0 : µ6 = µ5 Where µ5 = Population mean score on a 5-point scale µ6 = Population mean score on a 6-point scale against the two sided alternative that H1: µ6 ≠ µ5

This test uses Student-t distribution as the number of observations is less than 30. The decision rule is ( for a significance level of 10%)

Reject the null if absolute value of X X

s n n n n5 6

5 6 5 6

−

+* ( ) / ( * ) > t30,.05

Which gives us 4.605 > 1.697 Thus the null hypothesis that the two means are equal is clearly rejected at a significance level of 10%. In fact the p value of the test is almost zero. See the SPSS output for Independent Samples Test in Table 2 and Table 3. Box 2: Test for means of two distributions


-- 10

After establishing these conditions, we can derive a fairly accurate method of converting the mean scores from a 6-point scale to a 5-point scale and vice-versa. In fact, the problem reduces to simply arriving at the best point estimates of two population means, since, as the above figure shows, the mean scores are distributed with the same variance but different means. So we can simply compare the means of two populations. Say for the same activity,

µ5 = Population mean score on a 5-point scale; and µ6 = Population mean score on a 6-point scale.

We have shown that µ6 > µ5 . Now if we have best point estimates of µ5 and µ6 then we can convert (for the purpose of comparison) a mean score on 5-point scale to a mean score on 6-point scale by adding (µ6 - µ5 ) to it. For example, if the best estimate of this difference is say 0.61, then a score on 5-point scale should be increased by 0.61 for comparing it to a score on a 6-point scale.

Verifying results beyond the experiment

Results from the experiment show that, on an average, the mean score of responses on a 6-point scale is 0.61 higher than the score on a 5-point scale. But this result is based on only one activity and may not be the best estimate of the difference of two population means. It only indicates that the actual mean difference lies somewhere near 0.61. We attempted to verify the results of this experiment and get a better estimate of this difference with data from more activities. The ideal situation would be to use data from multiple activities measured on two different scales simultaneously. This was not practical; we were limited in this approach by the fact that no such experiments had been done (with the exception of this study) nor have activities been compared using different scales on the same set of questions.

To overcome this constraint, we randomly selected activities evaluated on different scales which were presented in 1998. Random selection was used to ensure that the overall, average quality of the seminars in both categories was the same. Also, we selected similar questions from these activities to compare. As the questions were not identical, we divided questions into four broad categories (see Appendix D). Questions on administrative/logistics were excluded as were questions asked before the activity’s start. Only data from the end of the course evaluation questionnaires were used. Every end of activity questionnaire contained questions about relevancy of the activity, quality of contents, usefulness of the activity in improving knowledge, ability to meet activity’s objectives and whether the activity had been a worthwhile use of participants’ time. We randomly selected 20 responses to these questions from different activities which themselves were selected randomly. A sample size of 20 was chosen to (i) ensure a constant sample size and (ii) ensure that sample distribution remained normal. Also, in most of the EDI activities, the number of participants was generally 20 or more. Using these similar questions and 20 responses, we calculated scores for various activities evaluated on 5-point and 6-point scales (See Appendix D for details). The normality tests showed that sample means on both scales were normally distributed with Lilliefors significance of 20% (see Appendix E). We then compared the variance and means of these responses. Results from these comparisons are shown in Table 4 and Table 5.


-- 11

Group Statistics

SCALE

N

Mean Std.

Deviation Std.

Error Mean MEAN 5

6 20 20

4.2150 4.9950

.3990

.2842 8.923E-02 6.355E-02

Table 4. SPSS Output for Group statistics for comparing means and variances

1.881 .178 -7.120 38 .000 -.7800 .1095 -1.0018 -.5582

-7.120 34.330 .000 -.7800 .1095 -1.0025 -.5575

EqualvariancesassumedEqualvariancesnotassumed

MEANF Sig.

Levene's Test forEquality of Variances

t dfSig.

(2-tailed)Mean

DifferenceStd. ErrorDifference Lower Upper

95% ConfidenceInterval of the Mean

t-test for Equality of Means

Independent Samples Test

Table 5. SPSS output for Independent Samples Test

These results are consistent with our earlier findings using the experimental data. The variances are equal as shown by a very high level of significance or Levene’s Test (p value of 17.8%). Also the means are clearly different. These results show that the mean response to the questions falling in the four categories of relevancy, course content/usefulness, objectivity and worthwhileness is 0.78 more on a 6-point scale than on a 5-point scale measuring the same activity. For example, if for a particular activity in the future the mean score on a 5-point scale is 4.5, then a comparable figure on a 6-point scale will be 4.5 + 0.78 = 5.28.

Conclusions

Conversion to a 5-point scale seems appropriate for EDI. This scale allows participants to select a valid response of “neutral/average” when rating an activity and effectively eliminates response bias present with an even-point scale. A 6-point scale suffers from having no neutral/average midpoint, thereby forcing the participants to choose either 3 or 4 which can distort the overall rating of an activity by making good results better and bad results worse. This appeared to be the case in EDI when the 6-point scale was in use. The midpoint or neutral response also appears to be a valid choice when rating EDI activities. When participants are given a choice of selecting a neutral/average midpoint, a considerable number (up to 30% in some activities) do select this response category.

Our study also found that when the same activity is measured on two different scales, the

mean scores on a 6-point scale will be greater than the mean scores on a 5-point scale, but the variance will not differ significantly due to decreasing or increasing the scale by one point. This may not be true if the scale changes by more than one point. These results provide the basis for converting and comparing the ratings on a 5-point scale to those of a 6-point scale by making an adjustment for the means accordingly. We have attempted to find the best point estimate of the mean difference between scores on different scales, for the same quality of activities, and for questions falling in four categories, viz., relevancy, course content/usefulness, objectivity and


-- 12

worthwhileness. This difference, 0.78, can be used as a benchmark to convert/compare scores obtained on a 5-point scale to that of previous scores obtained by using a 6-point scale. For example, if for a particular activity in the future the mean score on a 5-point scale to a question is 4.5, then a comparable figure on a 6-point scale will be 4.5 + 0.78 = 5.28. Task Managers now have a method for comparing scores from past activities which used the 6-point scale to current activities now using the 5-point scale.


-- 13

References Donald L. Kirkpatrick ( 1994 ). Evaluating Training Programs : The Four Levels, San Francisco, Berrett-Koehler Publishers. Cox III, Eli P. ( 1980 ). The Optimal Number of Response Alternatives for a Scale : A Review. Journal of Marketing Research, Vol. XVII ( November, 1980 ), 407-22. Schuman, Howard & Presser, Stanley ( 1981 ). Questions and Answers in Attitude surveys. Experiments on Question Form, Wording, and Context. New York, Academic Press, 161-177. Schwarz, Norbert , Knauper, Barbel , Hippler, Hans-J , Noelle-Newman, Elisabeth & Clark, Leslie (1991). “Rating Scales : Numeric Values may change the meaning of scale labels”. Public Opinion Quarterly 1991:499-688. DeVellis, Robert F ( 1991 ). Scale Development : Theory and Applications. Applied Social Research Methods Series, Volume 26. SAGE Publications. Newbold, Paul ( 1994 ). Statistics for Business & Economics. NJ : Prentice-Hall, Inc. Fowler, Floyd J. ( 1995 ). Improving Survey Questions : Design and Evaluation. Applied Social Research Methods Series, Volume 38. SAGE Publications. John Oxenham (1997 ). End-of-Activity Evaluations - A Three Year Retrospective. Office Memorandum of The World Bank dated June 9, 1997.

19

APPENDIX C

Results of the experiment

Question Number

Activity:4N97CA5C Five-Point Scale

Mean

Responses

Six-Point Scale

Mean

Responses

Activity:4J98CA5C Five-Point Scale

Mean

Responses

Six-Point Scale

Mean

Responses 1 4.50 14 5.07 14 3.91 11 4.70 10 2a 4.00 19 5.06 17 3.83 12 4.91 11 2b 4.11 19 4.82 17 3.42 12 4.64 11 2c 4.00 19 4.53 17 3.58 12 3.91 11 2d 3.79 19 4.47 17 3.50 12 3.82 11 3 4.32 19 4.88 17 4.58 12 4.75 12 4 4.37 19 5.06 17 4.17 12 4.75 12 5 3.78 18 4.35 17 4.09 11 4.00 12

The following questions were asked : 1. To what degree do you feel we achieved our objective? 2a. Was the workshop personally useful in providing better information? 2b. Was the workshop personally useful in provid ing new or expanded concepts? 2c. Was the workshop personally useful in revealing a wider range of policy options? 2d. Was the workshop personally useful in enabling you better to asses policy alternatives? 3. To what degree has this seminar been relevant to your current official functions? 4. To what degree has this seminar been a worthwhile use of your time? 5. To what extent did the seminar materials contribute to the effectiveness of the seminar?

21

APPENDIX D

Mean scores for different categories of questions asked on two different scales

Category >

Relevancy

Quality of contents/ Usefulness in improving knowledge

Meeting of objectives

Worthwhile use of time

Activity Code

( scale )

Question ( score )

Question ( score )

Question ( score )

Question ( score ) 4F98PF1C ( 5 )

To what degree did the workshop focus on issues important to you? (4.65)

To what degree are the materials useful to you? (4.6) To what degree were the presentations useful to you? (4.25 )

To what degree do you feel the workshop achieved its objectives? (4.55)

To what degree was the workshop a worthwhile use of your time? (4.7)

1F98FS4C ( 5 )

What I learned is relevant to my daily work. (4.65)

The workshop content was well prepared. (4) I know much more about investigative journalism now. (4.05)

The workshop achieved its objectives. (4.2)

The workshop was a worthwhile use of my time. (4.55)

1R98JA3C ( 5 )

To what extent was the course relevant to your current works or functions? (4.4)

To what extent were the materials constructive? (4.2) To what extent did the course help you better asses the consequences of different policy alternatives? (3.35)

To what extent did the courses achieved the stated objectives? (3.5)

To what extent was the course a worthwhile use of your time? (3.95)

1R98KE5F ( 5 )

Was the consultation relevant to your country’s needs? (4.45)

Did the consultation treat the issues in sufficient depth? (3.55) Did the consultation include relevant and useful presentations ? (4.25)

Did the consultation achieve the objectives you had in mid? (4)

Was the consultation a worthwhile use of your time? (4.45)

22

Mean scores for different categories of questions asked on two different scales (continued) Category >

Relevancy

Quality of contents/ Usefulness in improving knowledge

Meeting of objectives

Worthwhile use of time

Activity Code

( scale )

Question ( score )

Question ( score )

Question ( score )

Question ( score ) 1R98PE3C ( 6 )

Has the seminar been relevant to your official function? (5)

Did the seminar focus on the most important issues? (5) Did the seminar enable you to be better informed? (4.75)

Did we achieve our objectives? (4.75)

Has the seminar been a worthwhile use of your time? (4.95)

4R98EB5C ( 6 )

Was the seminar relevant to your current work or functions? (4.75)

Did it focus on what you most needed to address to improve project design and implementation? (4.4) Do you feel better equipped to design and implement health sector projects ? (4.6)

Did the seminar achieve its stated objectives? (4.8)

Was the seminar a worthwhile use of your time? (5.25)

7J98AE4C ( 6 )

To what extent was the first week of this course relevant to your current work or functions? (4.95)

To what extent the first week of this course relevant to your country’s needs? (5.1) To what extent did the first week of this course help you clarify the next steps to undertake? (4.8)

To what extent did the first week of this course help you improve your regulatory skills? (4.95)

To what extent was the first week of this course a worthwhile use of your time? (5.35)

298RFI3C ( 6 )

To what degree has this seminar been relevant to your official functions? (5.35)

To what extent did the seminar materials contribute to the effectiveness of the seminar? (5.5) Was the seminar personally useful in providing better information? (5.35)

To what degree do you feel we achieved our objectives? (5.1)

To what degree has the seminar been a worthwhile use of your of your time? (5.2)

23

APPENDIX E

.135 20 .200* .908 20 .062MEAN1Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

Tests of Normality

This is a lower bound of the true significance.*.

Lilliefors Significance Correctiona.

Table 1. Test of normality for mean scores using a five point scale

Normal Q-Q Plot of MEAN1

Observed Value

5.04.84.64.44.24.03.83.63.43.2

Exp

ecte

d N

orm

al

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

Figure 1. Test of normality for mean scores using a five point scale

24

.104 20 .200* .973 20 .785MEAN2Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

Tests of Normality

This is a lower bound of the true significance.*.

Lilliefors Significance Correctiona.

Table 2. Test of normality for mean scores using a six point scale

Normal Q-Q Plot of MEAN2

Observed Value

5.65.45.25.04.84.64.44.2

Exp

ecte

d N

orm

al

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

-2.0

Figure 2. Test of normality for mean scores using a six point scale

Documents

Evaluating EDI* Participant Reactions via Different ...€¦ · Evaluating EDI Participant Reactions via Different Response Scales: A Technical Review 4 -- There is no definitive