11
.5A-ro-ELwx Plunn. Ser. Vol. 24. so. 2. pp. Ia-153. 1990 Pnnted in Creac Bntan. 411nghu reserved 0038-0121 90 S3.00+0.00 Copyright r 1990 Pcrgamon Press plc Technical Issues in Measuring Scholastic Improvement due to Compensatory Education Programs ANAND DESAI School of Public Policy and Management, Ohio State University, 1775 College Road, Columbus, OH 432 IO- 1399, U.S.A. and ARIE P. SCHINNAR Wharton School. University of Pennsylvania, Philadelphia. PA 19104. U.S.A (Received September 1989) Abstract-Evaluation studies which require measurement of the effect of an intervention have been plagued by problems due to the effect of regression to the mean. In this paper, we review some of the efforts made to overcome problems due to this phenomenon in the evaluation of compensatory programs for reading and mathematics. We report on a study of Chapter 1 projects in the School District of Philadelphia where a new index of performance was created to overcome the etkts of rcgrcssion to the mean. INTRODUCTION As a response to the belief that economic disadvantage is, at least, in part a source of poor performance at school, the federal government mandated that remedial education programs be developed and made available to appropriate public school students. Evaluations of these federally-funded programs were also mandated. This paper examines the various technical problems found in the literature on evaluation of the effects of compensatory education, and suggests alternative measurement techniques employed in an analysis of similar educational programs carried out by the School District of Philadelphia. In particular, the paper focuses on problems associated with the effect of regression to the mean whenever an effort is made to measure change. In 1965, Title I of the Elementary and Secondary Education Act became law, providing funds for compensatory education programs. This law was later amended and renamed Chapter 1 of the Education Consolidation and Improvement Act of 1974. A key feature of this amendment was that it required mandatory evaluations of the Chapter 1 instructional programs. The initial concern of these evaluations was to determine whether the programs were reaching the academically poor students from low-income families for whom they were intended, with economic disadvantage determined by total family income. For a child to be eligible for compensatory instruction, two criteria must be met. First, his or her performance must be below a predetermined level of academic achievement. Second, the child must be enrolled in a school eligible for Chapter I funding. Such eligibility is attained if the percentage of students from poor families exceeds a prespecified minimum. Thus, only those children in schools receiving Chapter 1 funds can receive this compensatory instruction. A number of programs with similar goals but based on different theories of education delivery are supported by Chapter I funds. Program effectiveness, both with regard to design and cost, varies considerably across these differing implementations. Effectiveness studies have sought to explain and identify the sources of these variations and to link costs with levels of improvement attributable to compensatory education. However, there appears to be little agreement in the SEPS 2.*--E 143

Technical issues in measuring scholastic improvement due to compensatory education programs

Embed Size (px)

Citation preview

Page 1: Technical issues in measuring scholastic improvement due to compensatory education programs

.5A-ro-ELwx Plunn. Ser. Vol. 24. so. 2. pp. Ia-153. 1990 Pnnted in Creac Bntan. 411 nghu reserved

0038-0121 90 S3.00+0.00 Copyright r 1990 Pcrgamon Press plc

Technical Issues in Measuring Scholastic Improvement due to Compensatory

Education Programs ANAND DESAI

School of Public Policy and Management, Ohio State University, 1775 College Road, Columbus, OH 432 IO- 1399, U.S.A.

and

ARIE P. SCHINNAR Wharton School. University of Pennsylvania, Philadelphia. PA 19104. U.S.A

(Received September 1989)

Abstract-Evaluation studies which require measurement of the effect of an intervention have been plagued by problems due to the effect of regression to the mean. In this paper, we review some of the efforts made to overcome problems due to this phenomenon in the evaluation of compensatory programs for reading and mathematics. We report on a study of Chapter 1 projects in the School District of Philadelphia where a new index of performance was created to overcome the etkts of rcgrcssion to the mean.

INTRODUCTION

As a response to the belief that economic disadvantage is, at least, in part a source of poor performance at school, the federal government mandated that remedial education programs be developed and made available to appropriate public school students. Evaluations of these federally-funded programs were also mandated. This paper examines the various technical problems found in the literature on evaluation of the effects of compensatory education, and suggests alternative measurement techniques employed in an analysis of similar educational programs carried out by the School District of Philadelphia. In particular, the paper focuses on problems associated with the effect of regression to the mean whenever an effort is made to measure change.

In 1965, Title I of the Elementary and Secondary Education Act became law, providing funds for compensatory education programs. This law was later amended and renamed Chapter 1 of the Education Consolidation and Improvement Act of 1974. A key feature of this amendment was that it required mandatory evaluations of the Chapter 1 instructional programs. The initial concern of these evaluations was to determine whether the programs were reaching the academically poor students from low-income families for whom they were intended, with economic disadvantage determined by total family income.

For a child to be eligible for compensatory instruction, two criteria must be met. First, his or her performance must be below a predetermined level of academic achievement. Second, the child must be enrolled in a school eligible for Chapter I funding. Such eligibility is attained if the percentage of students from poor families exceeds a prespecified minimum. Thus, only those children in schools receiving Chapter 1 funds can receive this compensatory instruction.

A number of programs with similar goals but based on different theories of education delivery are supported by Chapter I funds. Program effectiveness, both with regard to design and cost, varies considerably across these differing implementations. Effectiveness studies have sought to explain and identify the sources of these variations and to link costs with levels of improvement attributable to compensatory education. However, there appears to be little agreement in the

SEPS 2.*--E 143

Page 2: Technical issues in measuring scholastic improvement due to compensatory education programs

144 ANAND DEW and ARE P. SCHINNAR

literature as to what contributes to the success of a compensatory education program [22,33,37]. Virtually no studies have been replicated, making direct comparisons between them difficult at best. Efforts to identify student, teacher, program and school characteristics which influence the effectiveness of compensatory programs have been equally inconclusive.

Methodological problems have also plagued these educational evaluations. The level of statistical rigor of the analyses varies tremendously. The simplest are descriptions of various projects, providing an excellent picture of how they operate but not detailing their effects. While a number of studies have employed statistical techniques, bringing technical rigor to the analyses and “enhancing” their credibility, they have also reflected problems with data and inadequacies in the methods used.

Further, the technical problems, and the order in which they have been addressed, highlight the traditions of researchers. For example, the influences of sociology and psychometrics are evident from the emphasis olaced on reliability and validity of data, and on the development of experimental and quasi-experimental designs. The applicability of these achievement tests as well as doubts about the validity of their results have resulted in concern for their use as appropriate measures of a student’s ability. However, the introduction of new testing and scoring procedures to produce more reliable and valid scores of achievement, together with more advanced evaluation designs, do not overcome the basic problems inherent in measuring change.

The usual purpose of a review is to provide a synthesis of previous studies. Given the results of these studies and the lack of general consensus regarding a correct approach to measuring change, we focus rather on the theoretical and empirical issues which have arisen in the literature, with emphasis on how measures of improvement are obtained and on the technical problems observed in these measurements. Studies of program and cost-effectiveness are also reviewed, again from the point of view of the methods used.

PROBLEMS DUE TO REGRESSION TO THE MEAN

In identifying factors that determine the success of a compensatory education program, one must be able to correctly measure the effect of intervention. A major technical difficulty arises in the measurement of such change: students in a compensatory education program are tested twice and the effect, expressed in terms of change from the first or pre-test results to the second or post-test, is assumed to represent the improvement attributable to participation in the program. There is, however, a bias in this measurement resulting from the effect of regression to the mean [46].

This effect may be explained as follows. The score a child receives on a test is a measure of what may be termed the child’s true ability. These test scores are not perfectly reliable; they are affected by such factors as guessing, misunderstanding questions, and mis-marking answer sheets. Test theorists define a true score as the mean of the scores an individual would obtain in repeated administrations of a test [31]. The error in the score could be positive or negative depending, for example, on whether the child made some lucky guesses or mis-marked the answer sheets. When a child scores below the group mean, the observed score is more likely to be under the child’s true score. It follows, therefore, that in a sample of children with below average scores, there is a greater likelihood of there being more negative errors than positive ones. The mean score of such a sub-sample would be spuriously low because of the predominance of negative errors.

When the children in this group are re-tested, their mean test score will be higher, and closer to the overall mean, due to the tendency of the positive and negative errors to even out. Hence, even without any additional instruction, the mean score of the sub-sample would appear to have increased. This statistical artifact is referred to as the “regression to the mean effect” [5,28,29, 351.

Program eligibility of children in Chapter l-funded schools is, as mentioned before, based on poverty and test scores. Students are tested and those scoring below a particular level are considered to be in need of compensatory education. If the selection is done on the basis of these scores then the distribution of the scores, of the selected students, will be truncated at the cut-off level. Also as previously noted, the preponderance of errors in this sub-sample will most likely be negative, i.e. an individual student’s true score is probably higher than the observed score. When this sub-sample of students is tested again. the new distribution of scores will once again be bell-shaped and, on average, the scores will be higher due to regression to the mean. Thus, if the test for

Page 3: Technical issues in measuring scholastic improvement due to compensatory education programs

Technical issues in measuring scholastic improvement 145

eli_eibility and the pre-test are one and the same, the mean score will show an improvement on the post-rest even without any additional instruction. In other words, there is a tendency in the post-test score mean to regress towards the overall mean.

Two approaches have been used to address this regression problem. One is based on the selection of the sample of children to be included in the evaluation, while the other is based on selecting the measure of improvement.

Sample selection

Proper selection of student samples for the evaluations is important in order to (a) ensure the randomness of data required for statistical inference; and (b) minimize the effect of regression to the mean.

To answer the question, “What would have occurred if the student did not receive the compensatory instruction?” *he statistical approach would be to compare those who received the instruction with those who did not. This approach, however, assumes that the persons in the two groups were comparable; in other words, that the students were essentially similar and were randomly assigned to the two groups. The design requirements of a statistical analysis are, unfortunately, in conflict with the objectives of compensatory education programs, i.e. to provide instruction to all students needing it.

The statistics and education literatures discuss at great length the problems of evaluations involving non-random comparison groups [5,6,38,42]. The evaluation studies attempt to draw upon this literature on quasi-experimental design, but dafa limitations and inability to systemat- ically apply the “treatment” to some students and not to others has made the construction of statistically random data sets difficult. The general tendency, however, appears to be to ignore this problem, with the result that little can be ascertained about the marginal effects of compensatory programs.

One approach for reducing the regression effect, through sample selection, is to distinguish the two tests by using a separate criterion for selecting the remedial instruction students and then giving them a pre-test to determine the base level [7, 16,21,23,47]. This process does reduce the effect of regression to the mean but does not eliminate it entirely. A residual effect will remain if there exists a correlation between the selection test score, the pre-test score and the post-test score. As a result of this potential correlation, both the pre- and post-test scores will exhibit regression to the mean when compared to the selection test score. The difference between the two regression effects will be the residual that remains even when the pre-test and the selection tests have been separated.

A curious anomaly can arise if the correlation between the selection test and the pre-test is smaller than that between the selection test and the post-test. In such a situation, the regression towards the mean for the post-test will be smaller than that for the pre-test, so that the difference will be negative. Thus, the residual effect will be a regression away from the mean which, if the gain between the pre- and post-test is small, can manifest itself as a negative effect of the program.

There is also a secondary, presumably smaller, bias due to differential regression to the mean. Differential regression effects occur, for instance, when there exists a wide range of scores in the sub-sample. The effect on scores from one end of the sub-sample will be different than that from the other end. The bias due to the regression effect is compounded in cases where the dispersion is large. Thus, separating the selection test can reduce but not remove the effect. Other efforts to remove the effect of regression to the mean involve incorporating corrections for the bias in the construction of the index used to measure change. We discuss some of these efforts next.

Measurement indices

The regression to the mean effect is acknowledged by most statistically sophisticated evaluations. The use of a vast number of testing instruments for measuring the performance of school children has made it difficult to conduct even the simplest of test score comparisons. This plethora of achievement tests, coupled with the debate in the statistical literature on how best to measure change, has given rise to extensive literature discussion on the validity, reliability, and proper analyses of these test scores. In addition, some of the statistical techniques that may be used to overcome the regression effect have also been examined [I 1, 12,24,27]. Before discussing remedies

Page 4: Technical issues in measuring scholastic improvement due to compensatory education programs

146 ANAND DESAI and ARIE P. SCHISNAR

for the regression to the mean effect, it would be valuable to note some of the major evaluation studies and the wide variety of test scores they use, in order to indicate the subsequent difficulty encountered in attempts to compare study results.

Chapter 1 funding is mainly used to support remedial instruction in reading and mathematics; thus, many of the evaluation studies are specific to these programs, most using data readily available at the schools under review. The result is that each study relies on test instruments used at local schools; hence, one has to compare scores on Metropolitan Achievement Tests [7,23]; California Achievement Tests [lo, 14, 161; Stanford Diagnostic Test [15], and so forth. The use of multiple tests in the same study is not uncommon, particularly when both mathematics and reading are tested [23], or when locally developed tests are used to measure specific effects [21].

In a field rife with discussion over what various tests should or in fact do measure, the lack of standardization is not surprising. However, the lack of standardization goes beyond the type of test and extends als:, to the score used. The choices are many-raw test scores, corrected (for incorrect answers) scores; skill mastery (dichotomous 0, 1 variables); percentiles (national, regional, local, etc.); grade equivalents; normal curve equivalents; and growth scale scores. (See [34], for a clear and concise discussion of their merits.) Whether or not these scores can be compared across grade levels or be statistically manipulated depends upon how the raw scores were transformed to obtain the achievement indices.

There is, however, a certain degree of consistency in the selection of the variable used in determining the efficacy of compensatory education programs. The most commonly used variable is the difference in the score between two tests.

Associated with each test instrument is a standard error of measurement and a reliability coeflicient. Measures of reliability between test instruments, purporting to measure the same achievement, have also been developed. The simplest correction for regression to the mean in difference scores is provided by using a formula based on these reliability coefficients. The correction consists of obtaining a student’s “estimated true score.” Thus, the difference between a child’s estimated true scores on the pre- and post-tests, should serve as an unbiased estimate of the improvement attributable to the compensatory education program. This statistic, however, does not fully remove the regression to the mean effect. Problems with the estimate are due, in part, to the confusion about which mean to use in its construction. Should it be the population mean or the mean of the children who are in the program, or some other mean which, itself, may be subject to the regression effect [24].

A variation on the above approach is to take the difference between the post-test and a regression-adjusted score. The latter is obtained by identifying a subset of students (control) who have not received remedial instruction and giving them the post-test. A straight line is fitted to the control students’ pre-test scores and then used to obtain the predicted score absent of instruction [IO, 261. This approach, too, is not without problems. Uneven attrition from the samples, and floor and ceiling effects due to the limited range of test scores can be factors in biasing the slope of the fitted line, resulting in over or underestimates of the gain.

To summarize, most attempts at correcting for the regression to the mean effect are statistical in nature and themselves suffer either from the regression effect or from other problems in the data. Hence, the longstanding debate on how to measure change continues in the literature with little agreement on remedies for the effect of regression to the mean.

PROGRAM EFFECTIVENESS

The primary objective of Chapter I evaluations has been to determine the success of compensatory programs in attaining their educational objectives. In the previous section we reviewed some of the program evaluation studies which have also attempted to address the issue of bias due to the regression to the mean effect. Other studies attempt to measure program effects and then link these effects to various program characteristics in order to identify factors which influence effectiveness [ 18. 191.

A wide variety of techniques have been used in these evaluations, ranging from simple tests of differences to detailed models of the compensatory education process. The simplest analysis consists of comparing the median score on the pre- and post-tests [7]. Another fairly common approach

Page 5: Technical issues in measuring scholastic improvement due to compensatory education programs

Technical issues in measuring scholastic improvement 147

is to use a t-test to compare the means of pre- and post-tests [14, 25, 26, 36,431. These studies do not, however. employ statistical techniques to remove the effect of extraneous factors such as socio-demographic characteristics. Of the more extensive efforts at removing the influence of background differences, matching pairs of students is probably the most effective [39,40].

In theory. this approach involves taking into account a large number of combinations of factors; but, in practice. it may be possible to find similar students and match them without having access to vast amounts of data. Matched pairs, if then used with analysis of covariance (ANCOVA) or regression models, can be very effective in identifying crucial factors [30]. Analysis of covariance was used in the Triesman et al. [45] study of remedial reading, but its results were not entirely satisfactory. due mainly to the lack of homogeneity within the comparison groups. Theoretical work on circumventing this problem has been discussed by Rubin [4l]; however, his suggestions depend upon a teacher’s impression of a student, which may not always be consistent.

Analysis of variance (ANOVA) and covariance models attempt to determine if any of the variation in achievement across projects can be explained by program or other characteristics [20,48]. In contrast to ANCOVA models, ANOVA models make fewer assumptions about the interrelationships between the variables and are, therefore, less susceptible to assumption viola- tions. The results one obtains from both models, however, help only in terms of determining whether there are any group differences, not in terms of the level of difference. In other words, marginal effects cannot be determined using analysis of variance or covariance models; only the presence or absence of an effect can be ascertained.

On the other hand. regression models, if properly developed and replicated, have the potential for providing insight as to the determinants of a good compensatory program. Regression analysis attempts two tasks simultaneously: first, to measure the marginal effect of an intervention; and second. to identify, in the same analysis, those characteristics of the intervention which were most influential in achieving the effect. This approach assumes that the influential characteristics are known so that the analysis is an attempt to determine their marginal effects. The method does not identify which factors are important, but rather only the ones whose systematic effect cannot be dismissed as nil. Nor does the method provide support to any conclusions drawn about variables that are not statistically significant. In fact, the least important factors may appear, at times, statistically significant while the most important factors remain statistically insignificant.

If the characteristics of the intervention are not known, they should first be identified using one set of data; a second set of data should then be employed to determine the marginal impact of these factors. With the exception of a study by Summers and Wolfe [44]. we found no other studies which report having specified the regression equation on one set of data and then replicating it on another. Generally, research results are inconsistent in their identification of influential factors; the literature on program effectiveness is generally inconclusive [22,33].

Several studies attempt linking cost factors to program effectiveness. However, inferences about the relationships between program and cost effectiveness are confounded by difficulties that arise from aggregating data, resulting in studies made on the same unit of analysis. The consensus in the education literature is that the individual student is the proper unit when examining achievements due to the education process [37]. Unfortunately, though, expenditure data are generally not available at the student level of disaggregation. Also, information on the structure or design of the educational process is not particularly meaningful at the individual student level. In order to link expenditures to the actual contributions of Chapter I, it is therefore necessary to consider program effectiveness at the project level. Perhaps for this reason, attempts to measure the cost-effectiveness of remedial programs have been relatively few. One such evaluation ([17]) relates expenditures to three alternative measures of achievement devised to correct for regression to the mean. The study concluded that results for all three measures varied across grades. Consequently, no definitive statements could be made about the educational benefit or cost-effec- tiveness of the compensatory education program. In reviewing ten years of Chapter I evaluations, McLaughlin er nl. [34] report: “On the topic of general cost-effectiveness, results have been equivocal due primarily to limitations on the validity of effectiveness measures. More recent studies do not have any more encouraging results.”

Data envelopment analysis (DEA) has been proposed as an alternative approach to measuring educational cost-effectiveness by Charnes er al. [8,9] in their study of Program Follow Through.

Page 6: Technical issues in measuring scholastic improvement due to compensatory education programs

1-M ANAND DESAI and ARIE P. SCHINNAR

Bessent and Bessent [2] and their associates [3,4] have employed DEA to compare the “efficiency” of schools and to evaluate educational program proposals. A major criticism of these studies, however, is that they assume a priori a systematic positive relationship between resources expended at the school and improvement in student performance. Based on this premise, DEA was used to develop indices of relative performance. A related, but different, approach was developed by the authors in a recent study of Chapter 1 programs carried out for the Philadelphia School District. This study is now discussed.

THE PHILADELPHIA STUDY: ALTERNATIVE APPROACHES

We begin the discussion of our analysis of compensatory education programs in Philadelphia by briefly relating, for comparative purposes, the methods employed. These methods appear to circumvent some of the problems outlined earlier in the paper. More complete details of this study and its results appear in [13].

The objective of our three-year (1982-1985) evaluation, carried out jointly by the School District of the City of Philadelphia and the University of Pennsylvania was as follows: to determine whether variations in the design and implementation of Chapter 1 programs as well as variations in the programs’ patterns of resource allocation had differing effects on student performance. Through this study we introduced a new index of improvement, conducted our data analysis at a project level, and separated the measurement of effectiveness from the explanation of factors influencing effectiveness-all features which appear to distinguish it from other such studies.

The four Chapter I programs examined were: Comprehensive Reading, Comprehensive Math- ematics, Benchmark, and Project Success. The latter two programs provided remedial instruction in both mathematics and reading. Since the intent of this research was to compare different implementations of these programs, the unit of analysis selected was an individual project within a school. Performance was measured by California Achievement Test (CAT) scores, with a pre-test being given to eligible students prior to program participation and a post-test approximately one year later.

The analysis consisted of four stages: (i) obtaining an index of improvement attributable to these remedial programs; (ii) demonstrating how this index may be used with data on student, project, and school characteristics to identify factors which determine program effectiveness; (iii) obtaining an index of resource efficiency; and (iv) comparing project implementations on the basis of resource elliciency and program effectiveness. One of the criticisms of the earlier studies has been that they used the same analysis to both assess the effect of the remedial programs and identify the factors which contributed to their success. In separating the assessment of remedial programs from the analysis of the factors which contributed to their success, we were able to distinguish between the ability to identify these factors and determination of program success itself.

Similar to the Texas studies [2-41, our work utilized DEA in developing multidimensional indices of improvement in reading and mathematics. As noted above, improvement was based on changes in the California Achievement Test scores. In this regard, the scores were combined by project into two groups: Early Elementary (grades 2-4) and Later Elementary (grades 4-6). In addition to being able to compare different implementations of the programs, two other considerations underlie our choice of individual project as the unit of analysis. Combining students from different grades was appropriate since, in many schools, the remedial instruction classes put together children from several grade levels. Secondly, the individual project was the smallest unit for which disaggregated resource utilization information was available. Thus, it seemed preferable to aggregate raw data to this level rather than to combine indices of achievement computed at a lower level of aggregation. In other words, since the resource utilization analysis was to be conducted at the project level, the individual student achievement data were aggregated to this level; the improvement index was then constructed.

To study the impact of differences in expenditures on the effects (i.e. outcomes) of these projects, an index of resource efficiency was also constructed. Three expenditure categories-salaries, fringe benefits for instructional staff. and expenditures on books, materials, and supplies-together with the number of students enrolled in each project, were used to create this index.

Page 7: Technical issues in measuring scholastic improvement due to compensatory education programs

Technical issues in measuring scholastic improvement 149

Before continuing with discussion of the study, we describe, in some detail. construction of the DEA indices and our efforts toward correction for the regression effect.

Index construction

The indices used in our analysis are based on the ratio of pre- and post-test scores. The following graphical example provides a simplified description of the index employed. The actual analysis is based on an index obtained from simultaneously using three pre- and three post-test scores for each project. To demonstrate construction of the index on the plane of paper, the example is restricted to two dimensions.

Consider an analysis of improvement in reading due to compensatory education projects implemented across a number of schools. Let the test scores be measures of comprehension and vocabulary; then the data consist of average pre- and post-test comprehension and vocabulary scores for all the children in a specific project in a given school. The comparisons are thus made between projects across schools.

As a simple, one-dimensional illustration, assume that the index was constructed from only one pre- and post-test score. The data can then be expressed as points along a line with the index calculated simply as the ratio of the two scores. A project with an average pre-test score of 250 and post-test score of 300 would generate an index of 0.83, while another project with a pre-test score of 270 and the same post-test score of 300 would score 0.90. The first project, having improved from an average score of 250-300, shows a greater improvement than does the second project which, with a higher pre-test score (270), did not enjoy the same increase. Being closer to the origin, on the index. is thus indicative of a higher level of performance than that indicated by being farther away. A similar logic is extended to two and higher dimensional indices.

For a two dimensional representation, we consider two pre- and post-test scores. In Fig. 1, let the horizontal and vertical axes denote the ratios of the pre-test vocabulary and comprehension scores to the post-test scores, respectively. The axes represent a pre-test score per unit post-test score. Hence. a project is represented by a co-ordinate pair of ratios of pre- to post-test scores. The projects, due to variations in these scores, would thus be scattered in the plane producing a scatter plot similar to that shown in the figure.

In two and higher dimensions, the notion of superior performance is not obvious. For example, in Fig. 1 it is clear that project B has a smaller ratio along both axes than does project D. A similar comparison between D and E is not possible. The notion of the best pructice frontier is introduced to allow such comparisons. This frontier is obtained by connecting all the extreme projects which do not have another project with a smaller ratio along one or the other dimension. A, B, and C are such projects and are said to define the best practice frontier. This frontier is obtained by connecting the points A, B, and C in a piecewise linear fashion and then extending the lines from

Vocabulary

Fig. I. Hypothetical distribution of projects.

Page 8: Technical issues in measuring scholastic improvement due to compensatory education programs

150 ANAND DFSAI~I-I~ ARIEP.SCHINSAR

the extreme points A and C to infinity, parallel to the axes. Thus, QABCQ’ denotes the best practice frontier supported by the best performing projects A. B, and C.

The index of improvement is a relative measure based on distance of a project from the best practice frontier. The measurement is made along the ray from the origin to the point denoting the project. The score for project D on the improvement index is thus OD’/OD where D’ is the point at which the ray from the origin to D intersects the best practice frontier. Clearly, for all points on the frontier. the value of this ratio is 1.0. Points off the frontier have positive scores less than 1 .O. The portion of the frontier between A and B is called thefacet with which the performance of D is compared. Associated with each facet is a reference set which consists of all points on the frontier which define that facet; for example, A and B form the reference set for the facet. The rays from the origin, through the projects on the frontier, partition the data into comparison groups such that the members of a comparison group share a common facet and reference set. The scale thus offers a flexible measure of improvement which is specific to each project. Note that, unlike statistical measures, this scale is not based on average performance but rather is obtained by comparison with the best observed practice.

In the Philadelphia study, the scale for measuring student improvement was constructed in two stages. First. the ratios of the post-test and pre-test CAT scores were developed for each of the sets of mathematics and reading scores. For mathematics, the index was composed of raw test scores on concepts, computation, and total mathematics skills over the testing interval. The reading components included comprehension, vocabulary. and total reading. Incorporating all dimensions in construction of the index assumes that improvement does not manifest itself evenly across dimensions; for example, some students may show greater improvement in mathematical concepts while others could realize greater gains in computation. The index provides additional flexibility for making comparisons among those sites that show similar patterns of improvement along particular dimensions. A comparative index of project performance thus developed permits a controlling for diffcrcnces among project sites which may be due, for example, to instructional methods or student aptitude.

A frontier of best performing sites (best practice frontier) was identified, with all project sites on that frontier assigned a score of 1 .O. This frontier was then used as a multidimensional reference surfrtcc to which other project sites were compared and indexed. The resulting 0 to I scale provided a measure of the distance from the best practice frontier (which consisted of sites having shown the most improvement in student achievement scores).

Correction fbr the regression effect

As in previous studies, the improvement index obtained through frontier analysis is not truly indicative of the relative performance of Chapter I projects because of the effect of regression to the mean. Since this efTect artificially boosts the improvement index of projects with low pre-test scores, a constraint was imposed on computation of the index. This restriction was accomplished by defining the reference frontier, or best practice for each Chapter I project, in terms of only those projects whose average pre-test score was the same or better. The rationale for this constraint is that. as a result of the regression to the mean effect, projects with lower pre-test scores will show greater improvement; thus. a project with a low pre-test score should show at least as much improvement as a project with a higher pre-test score if its performance is to be considered equal to that of the project with the higher score. Hence, if we compare projects with similar or higher average pre-test scores, the scores then act as conservative measures of improvement in which the regression to the mean effect is nullified. We call the scale thus obtained a fuir scale.

In constructing a fair scale in the Philadelphia study, we took advantage of a feature of DEA which could be construed as a drawback of the technique. As noted above, the DEA (best practice) frontier is not a smooth surface but, instead, consists of a number of piecewise linear facets. These facets generate a corresponding partitioning of the data into comparison groups. This ability to form comparison groups serves two purposes: First, it allows one to specify the peer group with which the comparisons are to be made, thereby ensuring that the comparisons are fair. (The fair scale was obtained by defining the peer group such that it nullified the regression effect.) Second, it associates with each member of the comparison group a set of best performing projects

Page 9: Technical issues in measuring scholastic improvement due to compensatory education programs

Technical issues in measuring scholastic improvement 151

(reference set) which, having been thus identified, may be used as role models to be emulated by other projects in the group.

Orerciew of the stud), results

The Philadelphia study represented the first instance of a formal analysis which identified projects “known” by teachers and administrators to be the best. This effort allowed for confirmation of the face validity of the basic analysis. In addition, although the identities of the best and the worst projects were well known, the analysis provided a measure of the gap and, in some instances, surprising information about the size of the gap. For instance, it had been commonly believed that the schools in one area of Philadelphia were uniformly good. However, our analysis showed that even among these schools there was a fair degree of variation and that the worst schools here were not that much superior to some of the so-called “bad” schools in other parts of the district. Identification of the “best” and “worst” schools also set the stage for future case studies which could be used to obtain additional. idiosyncratic, information not readily available from the type of analysis described here.

Having obtained this fair scale, we used the index as the dependent variable. with student. project. and school characteristics as the independent variables in a set of regression analyses. The explanatory power of the resulting regression models was poor. The variables of particular interest were those pertaining to the structure of project implementation. Implied was the question of whether differences in project implementation mattered. It was known that project implementations varied from one site to another. Data on these efforts, however, showed little or no variability across projects. This seemed to imply that either the data did not capture project differences or that there were simply no differences in the structure of these projects.

The Philadelphia School District consists of seven administrative sub-districts. The differences among the sub-districts did appear to explain some of the variation in the improvement index. The prccisc source of these differences was not clear, but they did appear to reflect variations across the sub-districts’ socio-economic characteristics.

Results of the analysis of the improvement index and the resource efficiency index were, fortunately, more conclusive. As described earlier, some projects focused on either reading or mathematics while others considered both components. Our analysis showed that the projects which focused either on reading or on mathematics appeared to perform better than those which included both components. In general, extra expenditure did not appear to buy additional improvement in achievement. implying that if test scores represent the only performance criterion then it is most appropriate to implement but the least expensive projects.

Care should be taken in the use of DEA scores in a regression analysis. Although the distribution of the scores can often be bell-shaped within the O-I range, there is always a large mass at the high end of the distribution due to the many data points on the frontier being assigned a value of unity. Clearly, the assumption underlying ordinary least squares analysis are violated in using DEA scores as the dependent variable. Acknowledgement of this problem has yielded a variety of efforts aimed at reducing the resulting bias in the estimates of the regression coefficients. Such efforts include conducting the analysis after excluding the frontier data points [13]; assuming that the scores have a lognormal. half-normal or exponential distribution and then obtaining the maximum likelihood estimates [I]; and transforming the data, making alternative assumptions about the distributional form of the transformed variable [I, 321. In these studies, however, it is not evident from comparisons of the resulting coefficients that the computationally more arduous techniques provide more robust results [32]. The use of DEA scores in regression analysis thus needs to be explored further before clear guidelines about appropriate usage can be established.

SUMMATION

This analysis of the compensatory education programs in the Philadelphia School District served a most useful purpose in identifying the variations that exist in the performance of different implementations of essentially similar programs. We were not very successful, however, in eliciting the factors (from those on which data were available) that contributed to these differences. Identification of the best and worst performing projects was important in that it pointed to specific

Page 10: Technical issues in measuring scholastic improvement due to compensatory education programs

IS2 ANAND DESAI and Aw P. SCHINNAR

implementations which could be used for case studies. This, in turn, can help generate explicit guidelines for improving project performance.

The current study brought together a number of disparate data sets maintained by different divisions within the Philadelphia School District. Development of this database has laid the groundwork for creation of an ongoing monitoring system for various activities within a school

district. Based on our review of the literature and the methods applied in the evaluation of compensatory

education programs, as well as our own analysis of such programs within the Philadelphia School

District, the following observations can be made: The requirements of program evaluation vis-d-ois the daily functioning of compensatory

education programs are not compatible. Statistically valid evaluation designs require random allocations or matching with control groups. These conditions are in direct conflict with the (non-random) nature of the programs which attempt to reach as many needy students as possible. The creation of absolu:e or a priori standards has been suggested in instances where it is not possible to compare among statistically equivalent groups. Relative comparisons remain the best source of information on the effects of these programs, but special attention needs to be paid to correct specification of the units of analysis.

The effects of regression to the mean are difficult to remove using standard statistical techniques. One approach which has had some success suggests that the pre-test not be used as the selection criterion for program eligibility. An alternative approach to reducing regression effects would be to make comparisons only among those students with similar pre-test scores. The partitioning of data required to compare only those students with similar pre-test scores is feasible using the index obtained from procedures such as those employed in the Philadelphia study.

Cost-effectiveness studies provide valuable information on how funds are employed in various projects. The most common problem with these analyses is that the data used are not originally collected for that purpose. The data have to be manipulated in order to obtain a level of aggregation common with that of the achievement and financial data. Careful collection of data for the appropriate unit of analysis is crucial.

Some basic problems continue to mar compensatory education evaluations. These involve phenomena such as the variety of testing instruments used, the validity of the tests themselves, sample selection, identification of appropriate units of analysis, the measurement of program effects, and attempts to make causal statements linking student, teacher, school, and program characteristics to program effectiveness.

AcknoH/~[fg~~tmrs-This paper is based on a joint research project of the University of Pennsylvania and the School District of Philadelphia supported by a grant from the Pew Memorial Trust. The authors would like to acknowledge the contributions of S. H. Davidoff, R. J. Fishman, J. Lytle. A. Summers and B. Whitehill. The authors are also grateful to anonymous referees and the Editor-in-Chief whose comments and suggestions helped strengthen the paper. Remaining errors and omissions are, of course, the responsibility of the authors.

I

2.

3.

4.

5.

6.

7.

REFERENCES

R. D. Banker and H. H. Johnston. Evaluating the impacts of operating strategies on efficiency: an application to the U.S. airline industry. Presented at the Confirence on New Uses o/D&A in Munagemenf. University of Texas at Austin (1989). A. Bessent and E. W. Bessent. Determining the comparative efficiency of schools through data envelopment analysis. &fur. Admin. Q. 16. 57-75 (1980). A. Bessenl, E. W. Bessent. J. Kennington and B.. Regan. An application of mathematical programming IO assess productivity in the Houston independent school district. Mgnn Sri. 28, 1355-1367 (1982). A. Besscnt. E. W. Bessent. A. Charnes. W. W. Cooper and N. C. Thorogood. Evaluation of educational program proposals by means of DEA. Educ. Admin. Q. 19. 82-107 (1983). D. T. Campbell and R. F. Borouch. Making the case for random&d assignmcm to treatments by considering the alternatives: six ways in which quasi-experimental evaluations in compensatory education tend to underestimate effects. In Et-oluuriun ond Experimenr (Edited by C. A. Bennett and A. A. Lurnsdaine). pp. 195-296. Academic Press, New York (1975). D. T. Campbell and J. C. Stanley. Experimenrol und Quusi-Experimenful Designs for Research. McNally, Chicago ( 1966). E. Chamberlain. D. Beck and I. Johnson. Language development component, compensatory language experiences and reading program. Final Evaluation Report, Columbus Public Schools, Ohio Department of Evaluation Services. Columbus (1983).

Page 11: Technical issues in measuring scholastic improvement due to compensatory education programs

Technical issues in measuring scholastic improvement 153

8. A. Charnes. W. W. Cooper and E. Rhodes. Measuring efficiency of decision making units. Eur. 1. Opn. Res. 3.429444 (1979).

9. A. Charnes. W. W. Cooper and E. Rhodes. Data envelopment analysis as an approach for evaluating program and managerial efficiency-with an illustrative application to the program follow through experiment in U.S. public school education. Mgmr Sci. 27. 668-697 (1981).

IO.

I!. 12. 13.

14.

IS.

16.

17. 18.

19.

20.

21.

22. 23. 24. 25.

26.

27.

28. 29. 30. 31. 32.

33.

34.

35. 36.

37.

38.

39.

40.

41. 42.

43.

44.

45.

46.

47.

48.

J. Crawford. A study of instructional processes in Title I classes: 1981-82, Oklahoma City public schools. Oklahoma Department of Planning, Research and Evaluation. Oklahoma City. Okla. (1983). L. J. Cronbach and R. Furby. How we should measure “change”-or should we? Psycho/. J. 74, 68-80 (1970). L. J. Cronbach and G. C. Gleser. Assessing similarity between profiles. Psychol. Bull. 50. 45f%473 (1953). A. Desai. Extensions to measures of relative efficiency with an application to educational productivity. Ph.D. dissertation, University of Pennsylvania, Philadelphia, Pa (1986). P. J. Devito and S. S. Rubinstein. A follow-up study of Rhode Island Title I participants. Annual Meeting of the New England Educational Research Organization, Manchester, N.H. (1977). P. F. Dienemann. D. L. Flynn and N. AI-Salam. An evaluation of the cost effectiveness of alternative compensatory reading programs. Vol. I: cost analysis. Final report, Educational Testing Service. Princeton. N.J. (1974). District of Columbia Public Schools. Evaluation of ESEA. Title I program final evaluation report, 1980-81. 19X1-82. Division of Quality Assuranc:, Washington, D.C. (1983). M. T. Errecart. Is RENP cost :ffective to the regular DCPS program? RMC Report. Gibbong. Kensington. Md (1978). D. L. Flynn. An evaluation of the cost effectiveness of alternative compensatory reading programs. Vol. II: model sensitivity. Final report, Educational Testing Service, Princeton, N.J. (1976). D. L. Flynn. An evaluation of the cost effectiveness of alternative compensatory reading programs. Vol. III: cost effectiveness. Final report. Educational Testing Service. Princeton, N.J. (1976). G. V. Glass. Data analysis of the 1968-69 survey of compensatory education (Title I). Final report, University of Co!orddo, Laboratory of Educational Research, Boulder, Cola. (1970). J. Halliwe!!. Reading program for optional assignment, Title I. Final evaluation report 1978-79. Community School District, Brooklyn, N.Y. (1979). E. Hanushek. Throwing money at schools. 1. Policy Anal. Mgmr 1, 19-41 (1981). Heuristics Inc. Elementary enrichment program. Final evaluation report for Title I. 1978-1979, Dcadham. Mass. (1979). A. R. Jcnscn. Bius in Menrul Tesfing. The Free Press, New York (1980). M. Kaufman, A. Kovncr and L. Burg. An evaluation project LEAP, ESEA Title I program of Medford. Massachusetts. 1977-1978. Northcastcrn University. Boston, Mass. (1978). M. E. Knight. Evaluation of Title I Program. Community School District 31. New York City. 1978-79. New York, N.Y. (1979). R. L. Linn and J. A. Slinde. The determination of the significance of change between pre- and post-testing periods. Rev. t&c. Res. 47. 121 -I50 (1977). F. M. Lord. Measurement of growth. E&c. psychol. Meosmt 16, 421-437 (1956). F. M. Lord. Further problems in the measurement of growth. Educ. psycho/. Meusmf 16. 437-451 (1956). F. M. Lord. Large-scale covariance analysis when the control variable is fallible. 1. Am. srorisl. Ass. 55, 307-321 (1960). F. M. Lord and M. R. Novick. Sroristicol Theories of Mental Test Scores. Addison Wesley. Reading. Mass. (196X). C. A. K. Lovcl!, L. C. Walters and L. Wood. Exploring the distribution of DEA scores. Presented at the Conference on New Uses u/ DEA in Munogemenf, Universit; of Texas at Austin (1989). D. F. Lueckc and N. F. McGinn. Regression analyses and education production functions: can they be trusted? /furr~urd

- E&c. Rro. 45, 325-350 (1975). D. H. McLaughlin, K. J. Gilmartin and R. J. Rossi. Controversies in the evaluation of compensatory education. American Institutes for Research, Report No. AIR 61700-7/77 FR II. Palo Alto, Calif. (1977). Q, McNemar. On growth measurement. E&c. psychol. Measmnt 18, 47-55 (1958). V. R. Morgan. Cost study analysis of measured gains in a reading program utilizing individualization of instruction. Florida State University, Tallahassee. Fla (1974). S. P. Mullin and A. A. Summers. Is more better? The effectiveness of spending on compensatory education. Phi D&J A’uppa 64, 339.-347 (1983). J. C. Nunnally. The study of change in evaluation research: principles concerning measurement, experimental design and analysis. In Hundbook of Eualuorion Research (Edited by E. L. Struening and M. Guttentag), Vol I, pp. IO1 - I3 I. Sage Publications, Beverly Hills, Calif. (1975). D. B. Rubin. Multivariate matching methods that are equal percent bias reducing. I: some examples. Biometrics 32, 109-120 (correction note, p. 955) (1976a). D. B. Rubin. Multivariate matching methods that are equal percent bias reducing. II: maximums on bias reduction for fixed sample sizes. Biometrics 32, 121-132 (1976b). D. B. Rubin. Assignment to treatment group on the basis of a covariate. J. e&c. Srurisr. 2, l-26 (1977). C. D. Sherwood, J. N. Morris and S. Sherwood. A multivariate nonrandomized matching technique for studying the impact of social interventions. In Hundbouk o/Euo/uafion Reseurch (Edited by E. L. Struening and M. Guttentag). Vol. I, pp. 183-224. Sage Publications. Beverly Hills, Calif. (1975). M. S. Stearns. Evaluation of the field test of Project Information Packages: Volume I. summary report. Stanford Research Institute. Menlo Park, Calif. (1977). A. A. Summers and B. L. Wolfe. Improving the use of empirical research as a policy tool: replication of educational production functions. A&. appl. Micro-Econ. 3. 199-227 (1984).

D. A. Triesmen. M. 1. Wailer and G. Wilder. A descriptive and analytic study of compensatory reading programs. Final report. Vol. I, (PR 75-26). Educational Testing Service, Princeton, N.J. (1975). W. M. K. Trochim. Methodologically based discrepancies in compensatory education evaluation, Etduarion Rw. 6, 443480 (1982). VasqueziNuali Associates Inc. Chapter I elementary reading program. Interim evaluation report 1981-1982. Newton, Mass. (I 983). W. M. Wang. Evaluating the effectiveness of compensatory education. System Development Corporation, Santa Monica, Calif. (1980).