Student evaluations of teaching effectiveness: a Nigerian investigation

Higher Education 24, 453-463, 1992. �9 1992 Kluwer Academic Publishers. Printed in the Netherlands.

Student evaluations of teaching effectiveness: a Nigerian investigation

DAVID WATKINS 1 & A D E B O W A L E A K A N D E ~ 1Department of Education, University of Hong Kong 2Department of Psychology, Obafemi Awolowo University, Nigeria

Abstract. An investigation is reported which tests the applicability of two American instruments designed to assess tertiary students' evaluations of teaching effectiveness with 158 Nigerian undergraduates. The scales were found to have generally high internal consistency reliability coefficients, most of the items were seen to be appropriate, and every item was considered of importance by at least some of the students. In addition, all but the Workload/Difficulty items clearly differentiated between 'good' and 'poor' lecturers. Factor analysis found a strong main factor of teaching effectiveness plus a minor factor referring to course workload and difficulty. Further analysis generally supported the convergent and discriminant validity of those scales hypothesized to measure similar or dissimilar components of effective teaching. However, this analysis supported the factor analytic results as more overlap between aspects of teaching skill and enthusiasm was found than has been evident in Western studies. Thus there must be doubt about the cross-cultural validity of a multidimensional model of teaching effectiveness.

Key words: student evaluation of tertiary teaching; Nigerian students

Introduct ion

The literature on the value of student ratings of tertiary teaching is a massive one which provides evidence of considerable hostility and suspicion on behalf of some U.S. faculty. While the early American evidence tended to support the reliability but cast doubt on the validity of student ratings, several recent reviews of research in this area are supportive of their value (Dunkin and Barnes, 1986; Marsh, 1987). Moreover, it appears that much of the inconsistency in research findings is due to inadequate measuring instruments (Frey, 1982; Marsh, 1987). In particular, it appears now that teaching effectiveness is multifaceted and that any instrument which focuses on a single overall score is l ikely to be inadequate. For example, a lecturer who is well organized may not be the best of oral communicators. Failure to separate these different components of effective teaching has led to conflicting research findings as well as inadequate information for diagnostic or decision- making purposes (e.g. some aspects of poor teaching may be subject to improvement through training, others may not).

The purpose of this research was to assess the reliability and validity of two American-developed measuring instruments - the Students ' Evaluations of Educational Quality (SEEQ) developed by Marsh (1981a) and the Endeavor Instructional Rating Form devised by Frey (1978) - for use with Nigerian students. Although these instruments have been shown to be applicable in Spain (Marsh,

454

Touron, and Wheeler, 1985), Australia (Marsh, 1981a; Marsh and Roche, 1991), and New Zealand (Watkins, Marsh, and Young, 1987), there is still doubt about their cross-cultural utility (Clarkson, 1984; Watkins and Thomas, 1991).

The SEEQ Instrument

The SEEQ (Students' Evaluation of Educational Quality) and the research that led to its development have recently been summarized (Marsh, 1982a, 1983, 1984). Numerous factor analyses have identified the nine SEEQ factors in responses from different populations of students (e.g., Marsh, 1982a, 1982b, 1983), and also in lecturer self-evaluations of their own teaching effectiveness when they were asked to complete the same instrument as their students (Marsh, 1982b; Marsh and Hocevar, 1983). The nine SEEQ factors are Learning/Value, Instructor Enthusiasm, Organization/Clarity, Group Interaction, Individual Rapport, Breadth of Coverage, Examinations/Grading, Assignments/Readings, and Workload/Difficulty.

Marsh (1982b, 1987) argued that students' evaluations, like the effective teaching they are designed to reflect, should be multidimensional. He supported this common-sense assertion with empirical results and also demonstrated that the failure to recognize this multidimensionality has led to confusion and misinterpretation in student-evaluation research.

The reliability of responses to the SEEQ, based upon correlations among items designed to measure the same factor and correlations among responses by students in the same course, is consistently high (Marsh, 1982a). To test the long-term stability of responses to the SEEQ, students from 100 classes were asked to re- evaluate teaching effectiveness several years after their graduation from their university program, and their retrospective evaluations correlated 0.83 with those the same students had given at the end of each class (Overall and Marsh, 1980). Ratings on the SEEQ have successfully been validated against the ratings of former students (Marsh, 1977), student achievement as measured by an objective examination in multisection courses (Marsh, Fleiner and Thomas, 1975; Marsh and Overall, 1980), lecturers' evaluations of their own teaching effectiveness (Marsh, 1982b; Marsh, Overall, and Kesler, 1979), and affective course consequences such as applications of course materials and plans to pursue the subject further (Marsh and Overall, 1980). None of a set of 16 potential sources of bias (e.g., class size, expected grade, prior subject interest) could account for more than 5% of the variance in SEEQ ratings (Marsh, 1983), and many of the relationships were inconsistent with a simple bias explanation (e.g., harder, more difficult courses were evaluated more favorably).

The Endeavor Instrument

The Endeavor Instrument measures seven components of effective teaching - components that have been identified through the use of factor analysis in different

455

settings (Frey, Leonard, and Beatty, 1975). The seven factors are Presentation Clarity, Workload, Personal Attention, Class Discussions, Organization-Planning, Grading, and Student Accomplishment. In validating the ratings obtained from this instrument, Frey has shown that the ratings on the Endeavor are correlated with student learning (Frey, 1973, 1987; Frey, Leonard, and Beatty, 1975). In these studies, as well as in similar research described below, student ratings are collected in large multisection courses (i.e., courses in which the large group of students is divided into smaller groups or sections and all instruction is delivered separately to each section). Each section of students in the same course is taught throughout by a different lecturer, but each is taught according to a similar course outline, has similar goals and objectives and, most important, is tested with the same standardized final examination at the end of the course (for further discussion see Cohen, 1981; Marsh, 1982a, 1984; Marsh and Overall, 1980). Frey concluded that those sections of students that rate teaching to be most effective are also the sections that learn the most as measured by performance on the final examination, thus supporting the validity of ratings on the Endeavor Instrument.

Inspection of the item content supported by previous research (Marsh, et al., 1985) revealed considerable overlap in the dimensions measured by these two instruments (see Table 1). One SEEQ factor, Organization/Clarity, seems to have been divided into two factors for Endeavor instrument (Presentation Clarity and Organization/Planning). On the basis of earlier studies, 32 of the 34 SEEQ items (M1-M32) and 21 Endeavor items were classified into one of 16 dimensions (see Table 2). Items A1-A6 were added to try to better identify the factors involved.

Table 1. Pairs of corresponding factors in SEEQ and Endeavor

SEEQ factors Endeavor factors

1. Learning/Value 2. Group Interaction 3. Individual Rapport 4. Examinations/Grading 5. Workload/Difficulty

6. Organization/Clarity

1. Student Accomplishments 2. Class Discussion 3. Personal Attention 4. Grading 5. Workload 6. Presentation Clarity 7. Organization/Planning

The Nigerian Context

Nigeria is a vast and diverse country with a population of some 110 million people representing some 250 cultural and linguistic groups. As in many third world countries, education has been seen as an important means of national development. The rate of growth of the Nigerian University System has been described as "phenomenal" (Adesola, 1991). In 1962/1963 there were but four universities and 3,646 students whereas in 1987/1988 this figure had risen to 160,767 students and with the opening of new universities there were expected to be 29 universities by November, 1990 (Adesola, 1991).

456

Such a massive expansion of higher education has stretched the pool of available academic personnel to the limit. There is a need therefore to monitor the performance of academic staff if a high quality of education is to be attained. These problems have been exacerbated by the decrease in expatriate lecturing staff and political unrest amongst the students which has led to frequent campus closures. Selection and promotion of academic staff has also been criticised as being influenced by political pressure, tribal allegiances, and bribery. So there is a clear need for the development of instruments which can be shown to provide objective evidence of teaching effectiveness in Nigeria.

The PresentStudy

The following aspects of the SEEQ and the Endeavor instruments were assessed here for Nigerian students:

(a) The internal consistency reliabilities of the scales; (b) The ability of items to discriminate between "good" and "poor" lecturers; (c) The perceived importance and appropriateness of the items; (d) The validity of the underlying factor models; (e) The convergent and discriminant validity of the scales; (f) The relationship of scale scores to factors such as the size of the class, the

grade achieved in the course, whether the course was a student's major or minor, and the age of the lecturer concerned.

Method

Subjects

The evaluation survey was administered to a total of 158 senior undergraduate social science students currently enrolled at the Obafemi Awolowo University in lie Ife, Nigeria. The subjects were guaranteed the confidentiality of their responses and were not asked to identify themselves in any way.

Statistical Analysis

Each item was initially tested in terms of (a) its ability to discriminate among the good and poor instructors; (b) its appropriateness (i.e., the lack of "not appropriate" responses); and (c) its importance (i.e., the number of "most important" nominations). Items were categorized as representing 10 dimensions on an a priori basis (support for these dimensions was found in the Australian study described by Marsh, 1981a and confirmed in the New Zealand investigation of Watkins et al.

457

1987) and a factor analysis of responses to all items was used to test the ability of the responses to differentiate among these hypothesized components of teaching effectiveness formed on responses to items from SEEQ and the Endeavor instruments.

All the statistical analyses were conducted with the commercially available SPSS-X statistical package (Hull and Nie, 1984). Principal axis factor analyses were performed with iterated communality estimates, a Kaiser normalization, and an oblique rotation, also using the SPSS-X procedure. The Scree test criterion was used as a guide to the number of factors to be rotated but a number of solutions were examined to find the best fit to the hypothesised factor structure. For purposes of this study, blank and "not appropriate" responses were considered to be missing values. Each of the factor analyses was performed on correlation matrices constructed with "pair-wise deletion" for missing data.

Results

Scale Reliabilities

From Table 2 it can be seen that the estimated coefficients of reliability, et, ranged from 0.68 to 0.93 (the median et's were 0.92 and 0.91 for the SEEQ and Endeavor, respectively).

Evaluation of ltems

It can be seen from Table 2 that all items, except for those on the Workload/Difficulty scales, discriminated well between "good" and "poor" lecturers. The discrimination is particularly clear for items on the Lecturer Enthusiasm, Organization/Clarity, Learning/Value, and Personal Attention dimensions. It is not surprising that the workload/difficulty-related items do not differentiate between "good" and "poor" lecturers - after all, a lecturer is surely "poor" if he or she provides material which is much too easy or much too difficult (similarly, far too great or too little quantity of work). None of the items were judged to be inappropriate by more than 32 (10%) of the students. Thus it seems that the items were perceived as relevant to the evaluation of the teaching received by the majority of the Nigerian students.

After completing the evaluation forms for a lecturer the students were asked to select as many as five of the questionnaire items that they felt were most important in describing their overall learning experience in the classes of both the "good" and "poor" lecturers. Each item received at least three "most important" nominations. Items assessing the degree to which the course increased the students' interest and knowledge of the subject matter; whether the lecturing style held one's interest; and the degree to which the presentation was clear and the grading fair were seen as being the most important.

4 5 8

Table 2. Paraphrased items and hypothesized factors for Marsh's (M) SEEQ and Frey's (F) Endeavor instruments plus obtained et scale reliability estimates, mean item responses for each 'lecturer' type and number of 'not appropriate' and 'not important' responses

Scales and Items

Mean responses Number Number for lecturers of 'most of 'most chosen as: important' appropriate' Good Poor nominations reponses

1. Group interaction (SEEQ) et = 0.93 Ml3 Encouraged class discussion 6.86 3.52 M14 Students invited to share knowledge/ideas 6.58 3.01 M15 Encouraged questions and gave answers 7.20 2.98 MI6 Encouraged questioning of teacher's ideas 6.80 3.09

2. Learning (SEEQ) et = 0.93 M1 Course challenging and stimulating 7.43 3.16 M2 Learned something valuable 7.66 3.92 M3 Class increased subject interest 7.28 2.99 M4 Learned and understood subject matter 7.12 3.67

3. Workload/difficulty (SEEQ) ct = 0.70 M32 Course difficulty (easy-hard) 5.14 5.67 M33 Course workload (light-heavy) 5.93 5.23 M34 Course pace (slow-fast) 6.26 4.10

4. Examinations/grading (SEEQ) e~ = 0.86 M25 Examination feedback valuable 6.56 3.71 M26 Evaluation methods fair/appropriate 7.03 3.54 M27 Tested course content as emphasized 6.97 3.74

5. Individual Rapport (SEEQ) et = 0.93 M17 Lecturer friendly to individual students 7.00 3.49 M18 Lecturer welcomed students seeking advice 7.28 2.79 M19 Lecturer interested in individual students 6.81 2.76 M20 Lecturer accessible to individual students 6.83 3.04

6. Organization/clarity (SEEQ) ot = 0.93 M9 Lecturer explanations clear 7.40 3.10 M10 Course materials well explained and prepared 7.31 3.11 M11 Course objectives stated and pursued 6.99 3.32 MI 2 Lectures facilitated taking notes 7.07 3.17

7. Lecturer Enthusiasm (SEEQ) ct = 0.92 M5 Enthusiastic about teaching 7.26 3.42 M6 Dynamic and energetic 7.54 3.23 M7 Enhanced presentation with humour 6.85 3.05 M8 Teaching style held your interest 7.27 2.84

8. Breadth of coverage (SEEQ) ot = 0.88 M21 Contrasted various theories 6.42 3.84 M22 Gave background of ideas/concepts 6.89 3.33 M23 Gave different points of view 7.07 3.13 M24 Discussed current developments 7.08 3.47

9. Readings (SEEQ) et = 0.68 M28 Readings/texts were valuable 6.29 4.01 M29 They contributed to understanding 6.70 3.76 Overall Rating Items (SEEQ) M30 Overall Course Rating 7.38 3.67 M31 Overall Lecturer Rating 7.81 3.24

25 7 16 12 21 10 20 12

19 10 36 28

21 23 12

11 18 3

26 16 11 42

45 15 9

26

20 19 22 61

11 9 4

56

13 8

12 18

7 9

13 8

18 16 6

6 8 9

12

7 10 9 9

8 12 7

12

20 12 10 16

6 10

Table 2 - continued

459

Scales and Items

Mean responses Number Number for lecturers of 'most of 'most chosen as: important' appropriate' Good Poor nominations reponses

10.Class discussion (Endeavor) ct = 0.92 F10 Class discussion was welcome 6.86 3.40 22 11 F11 Students encouraged to participate 6.73 3.22 13 9 F12 Encouraged students to express ideas 7.03 2.97 19 3

11. Student Accomplishments (Endeavor) et = 0.92 F19 Understood the advanced material 6.95 3.13 9 8 F20 Course improved ability to analyze issues 7.48 3.58 17 7 F21 Course increased knowledge and competence 7.49 3.71 52 9

12. Workload (Endeavor) et = 0.79 F4 Students had to work hard 7.19 5.60 16 7 F5 Course required a lot of work 6.51 4.90 8 4 F6 Course workload was heavy 6.14 4.65 17 2

13.Grading/examinations (Endeavor) et = 0.91 F16 Grading fair and impartial 7.27 3.74 51 11 F17 Grading reflected student performance 7.09 3.65 18 9 F18 Grading indicative of accomplishments 6.85 3.60 8 11

14.Personal Attention (Endeavor) et = 0.91 F7 Lecturer listened and was willing to help 7.23 3.03 18 6 F8 Students able to get personal attention 6.63 3.08 21 15 F9 Lecturer concerned about student difficulties 6.88 2.95 24 7

15. Presentation clarity (Endeavor) ct = 0.91 F1 Presentation clarified materials 7.18 3.09 26 6 F2 Presented clearly and summarized 7.39 3.31 23 3 F3 Made good use of examples 7.44 3.55 22 5

16.Planning/objectives (Endeavor) et = 0.87 F13 Presentations planned in advance 7.28 3.75 7 17 F14 Provided detailed course schedule 7.22 3.74 8 6 F15 Class activities orderly scheduled 6.91 3.24 8 5

F a c t o r A n a l y s i s

F a c t o r ana lys i s o f the 62 c o m b i n e d i nven to ry i t ems ind ica t ed tha t there was one

large u n d e r l y i n g fac to r w i th an e igen va lue o f 38.8. The nex t s even e igen va lues

r a n g e d f rom 2.93 to 0.75. T he Scree tes t suppor t ed e i the r a one or two- fac to r

solut ion. Neve r the l e s s all so lu t ions up to 8 fac tors were examined . U n f o r t u n a t e l y

the 4-8 fac to r so lu t ions fa i led to c o n v e r g e in 25 i te ra t ions and the 3 fac to r so lu t ion

on ly s h o w e d two d is t inc t factors. H o w e v e r , the two fac to r so lu t ion was c lear ly

in te rpre tab le . Fac to r II (4 .0% of the va r i ance ) had 5 m a i n load ings r a n g i n g f rom

0.49 to 0 .70 all f r om i t ems re fe r r ing to course work load or diff icul ty . Al l the o ther

i t ems h a d load ings on F a c t o r I (62 .1% o f the va r i ance ) r a n g i n g f r o m 0.49 to 0 .94

( m e d i a n load ing = 0.82). Thus the m a j o r fac tor c o m b i n e d eva lua t ions o f va r ious

facets o f t e ach ing e f f ec t iveness whi l e the m i n o r f ac to r r e f l ec ted cou r se diff icul ty .

The re was on ly a m o d e r a t e cor re la t ion , 0.31, b e t w e e n these factors .

460

Convergent and Discriminant Validity

The convergent and discriminant validity of the nine SEEQ and seven Endeavor scales were assessed in a modified multitrait-multimethod (MTMM) matrix, where the scales of teaching effectiveness are the multiple traits and the two different instruments correspond to the multiple methods (see Table 3).

Table 3. 'Multitrait-multimethod' matrix of correlations among SEEQ and Endeavor scales (n=316 sets of rat ings)

SEEO scales

1. Group interaction 2. Learning/value 3. Workload/difficulty 4. Examinations/grading 5. Individual rapport 6. organization/clarity 7. Enthusiasm 8. Breadth of coverage 9. Assignments/readings

Endeavor scales 10. Class discussion 11. Student accomplishments 12. Workload 13. Grading 14. Personal attention 15. Presentation clarity 16. Organization/planning

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.81

0.23 0.25

0.77 0.78 0.27

0.86 0.81 0.22 0.78

0.81 0.86 0.25 0.78 0.82

0.83 0.87 0.23 0.77 0.87 0.87

0.76 0.78 0.33 0.75 0.79 0.79 0.80

0.66 0.66 0.31 0.68 0.65 0.69 0.64 0.66

0.89 0.78 0.19 0.77 0.84 0.77 0.79 0.75 0.65

0.79 0.91 0.26 0.76 0.78 0.86 0.85 0,79 0.68 0.78

0.44 0.50 ~ 0.49 0.41 0.49 0.46 0.52 0.59 0.44 0.50

0.75 0.76 0.26 0.7.....99 0.76 0.79 0.75 0.75 0.66 0.76 0.76 0.49

0.83 0.80 0.22 0.78 0.86 0.79 0.82 0.75 0.66 0.84 0.80 0.46 0.77

0.79 0.84 0.25 0.78 0.82 0.89 0.86 0.78 0.69 0.80 0.84 0.47 0.76 0.81

0.75 0.78 0.34 0.78 0.79 0.8_...~5 0.79 0.77 0.71 0.76 0.79 0.53 0.79 0.76 0.83

Note: Convergent validities are underlined.

Convergent validity refers to the correlations between SEEQ and Endeavor scales that are hypothesized to measure the same construct (see Table 1), while discriminant validity refers to the distinctiveness of the different dimensions and provides a test of the multidimensionality of the ratings. With only minor modifications, the criteria developed by Campbell and Fiske (1959) can be applied to these data as follows:

1. Convergent validities, correlations between SEEQ and Endeavor scales that are hypothesized to match, should be substantial. Here convergent validities ranged from 0.53 to 0.91 (median 0.80).

2. One criterion of discriminant validity is that correlations between these matching scales should be higher than the correlations between non-matching SEEQ and Endeavor scales in the same row or column of the rectangular submatrix. This test is met by all but 1�89 of the 96 comparisons (counting half when the correlations are equal) and thus this criterion is well satisfied.

461

3. Another criterion of discriminant validity is that correlations between each convergent validity should be higher than those SEEQ and Endeavor scales correlations with the other eight SEEQ and six Endeavor scales, respectively. This test is met for all but 4 of the 98 such comparisons and, once again, this criterion is generally satisfied. However, there is clearly considerable overlap amongst the Student Accomplishments, Presentation Clarity, Lecturer Enthusiasm, and Individual Rapport scales.

4. The pattern of correlations among SEEQ scales should be similar to the pattern of correlations among Endeavor scales (e.g., because the two SEEQ scales of Group Interaction and Individual Rapport are highly correlated, so should be the two corresponding Endeavor factors of Class Discussion and Personal Attention). Inspection of the correlations in Table 3 generally indicates a similarity in pattern of correlations.

Variables Related to Ratings

The students were asked to supply the following information about the classes taught by the "good" and the "poor" lecturer: Whether the class was in their major subject area; the approximate number of students in the class; the grade they received in the course; the gender and approximate age of each teacher. The mean responses of these variables possibly distinguishing between "good" and "poor" lecturers are reported in Table 4. Given the exploratory nature and doubts about the generalizability of this section of the study, no statistical tests were performed. It would seem that there is a trend for students to rate as "good" lecturers who taught in their major area (48% of the "poor" ratings were in a minor as against 18% of the "good" ratings) and who taught courses in which these students obtained higher grades (mean grades of 3.44 and 5.99 in the "good" and "poor" lecturers' course, respectively). Higher ratings were not associated either with larger class sizes or with the students' perceptions of their teachers' ages.

Table 4. Means of classroom aspects possibly related to ratings of lecturers (N = 158)

Item "Good" Lecturer "Poor" Lecturer

Is this subject your major 1.18 1.48 (Yes = 1, No = 2)

Student's estimate of size of class 185.94 187.58 Grade student obtained in course 3.44 5.99 (A+=l toE=l l )

Estimated age of lecturer in years 40.35 39.24

Gender of lecturer 1.09 1.14 (Male = 1, Female = 2)

462

Conclusions

The findings of this study generally lend support to the reliability but throw doubt on the factor structure of the SEEQ and Endeavor instruments for use with Nigerian students.

The internal consistency estimates obtained were fairly impressive for such short scales. The scale items differentiated "good" and "poor" lecturers in the ways expected. That workload and difficulty items did not was also quite predictable (and supports U.S. findings that students do not simply give high ratings to easy courses where they don't have to do much work). The items were considered appropriate by most of the students while every item was chosen by at least a few as being most important. However, the item factor analysis did not support a multidimensional model of teaching effectiveness. A much stronger than expected general factor covering all evaluative ratings of supposedly different aspects of teaching performance was obtained. Similarly while the overall convergent and discriminant validity of the instruments was demonstrated by the modified MTMM analysis in most cases, there was clear evidence of overlap between some of the scales. The same finding resulted from an investigation with Indian students (Watkins and Thomas, 1991). Further research will be required to determine whether this is due to an artefact of a "halo" effect resulting from asking students to rate a "good" and a "poor" lecturer or to Nigerian and Indian students perhaps having a more global perception of teaching effectiveness than Western students. Qualitative investigations which explore student conceptions of teaching may well be needed to settle this issue.

These findings indicate that teaching effectiveness can be measured in a Nigerian setting, that evaluation instruments developed at American universities may well be reliable in Nigeria, but that the distinct components that underlie evaluations of teaching effectiveness at American universities may not be separable in Nigeria. Taken together with earlier findings in Australia, New Zealand, Spain, and India, strong support is provided for the cross-cultural reliability of these instruments, but doubt is cast on the cross-cultural validity of the underlying model of teaching effectiveness. Moreover, some of the difficulties attempting to compare lecturers' teaching effectiveness in practice are suggested by the findings that student ratings of lecturers may be related to factors such as whether the course is in the student's major area and the grade the student achieves in that course. To what extent such factors represent genuine indicators of teaching effectiveness rather than biased responding is still a matter of conjecture although the American research seems to support the former conclusion (Marsh, 1987).

References

Adesola, A.O. (1991). 'The Nigerian university system: meeting the challenges of growth in a depressed economy.' Higher Education, 21,121-133.

Campbell, D.T., and Fiske, D.W. (1959). 'Convergent and discriminant validation by the multitrait- multimethod matrix.' Psychological Bulletin, 56, 81-105.

463

Clarkson, P.C. (1984). 'Papua New Guinea students' perceptions of mathematics lecturers.' Journal of Educational Psychology, 76, 1386--1395.

Cohen, P.A. (1981). 'Student ratings of instruction and student achievement: A meta-analysis of multisection validity studies.' Review of Educational Research, 51, 281-309.

Dunkin, M., and Barnes, J. (1986). 'Research on teaching in higher education'. In Wit/rock, M.C. (Ed.), Handbook of Research on Teaching, Third Edition: New York: MacMillan.

Frey, P.W. (1973). ' Student ratings of teaching: Validity of several rating factors.' Science, 182, 83-85. Frey, P.W. (1978). 'A two-dimensional analysis of student ratings of instruction.' Research in Higher

Education, 9, 69-91. Frey, P.W. (1982). 'Components of teaching'. Instructional Evaluation, 7, 3-10. Frey, P.W., Leonard, D.W. and Beatty, W.W. (1975). 'Student ratings of instruction: Validation

research." America~ Educational Research Journal, I2, 327-336. Hull, C.H., and Nie, H.H. (1984). SPSS-•. New York: McGraw Hill. Marsh, H.W. (1977). 'The validity of students' evaluations: Classroom evaluations of instructors

independently nominated as best and worst teachers by graduating seniors.' American Educational Research Journal, 14, 441-447.

Marsh, H.W. (1981a). 'Students' evaluations of tertiary instruction: Testing the applicability of American surveys in an Australian setting.' Australian Journal of Education, 25, 177-192.

Marsh, H.W. (1981b). 'The use of path analysis to estimate teacher and course effects in student ratings of instructional effectiveness.' Applied Psychological Measurement, 6, 47-60.

Marsh, H.W. (1982a). 'SEEQ: A reliable, valid, and useful instrument for collecting students' evaluations of university teaching.' British Journal of Educational Psychology, 52, 77-95.

Marsh, H.W. (1982b). 'Validity of students' evaluations of college teaching: A muhitrait-multimethod analysis.' Journal of Educational Psychology, 74, 264-279.

Marsh, H.W. (1983). 'Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics." Journal of Educational Psychology, 75, 150-166.

Marsh, H.W. (1984). 'Students' evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility.' Journal of Educational Psychology, 76, 707-754.

Marsh, H.W. (1987). 'Students' evaluations of university teaching: Research findings, methodological issues, and directions for future research.' International Journal of Educational Research, 11, 253-388.

Marsh, H.W., Fleiner, H., and Thomas, C.S. (1975). 'Validity and usefulness of student evaluations of instructional quality.' Journal of Educational Psychology, 67,833-839.

Marsh, H.W., and Hocevar, D. (1983). 'ConfLrmatory factor analysis of multitxait-multimethod matrices.' Journal of Educational Measurement, 20, 231-248.

Marsh, H.W., and Overall, J.U. (1980). 'Validity of students' evaluations of teaching effectiveness: Cognitive and affective criteria.' Journal of Educational Psychology, 72, 468-475.

Marsh, H.W., Overatl, J.U., arid Kester, S.P. (1979). "Validity of student evaluations of instructional effectiveness: A comparison of faculty self-evaluations and evaluations by their students." Journal of Educational Psychology, 71,149-160.

Marsh, H.W., and Roche, L.A. (1991). 'The use of student evaluations of university teaching in different settings: The applicability paradigm.' As yet unpublished manuscript, University of Western Sydney.

Marsh, H.W., Touron, J., and Wheeler, B. (1985). 'Student evaluations of university instructors: The applicability of American instruments in a Spanish setting.' Teaching and Teacher Education, 1, 123-138.

Murray, H.G. (1980). A Comprehensive Plan for the Evaluation of Teaching at the University of Queensland. Brisbane: University of Queensland.

Overall, J.U., and Marsh, H.W. (1980). 'Students' evaluations of instruction: A longitudinal study of their stability.' Journal of Educational Psychology, 72, 321-325.

Watkins, D., Marsh, H., and Young, D. (1987). 'Evaluating tertiary teaching: A New Zealand perspective." Teaching and Teacher Education, 2, 41-53.

Watkkts, D., and Thomas, B. (t991). "Assessing teaching effectiveness: An Indian perspective." Assessment and Evaluation in Higher Education, 16, 185-198.

Documents

Student evaluations of teaching effectiveness: a Nigerian investigation