Testing, Stakes, and Feedback in Student Achievement: A Meta Regression Analysis

Testing, Stakes, and Feedback in Student Achievement: A Meta Regression Analysis

Richard P. PHELPS & Mónica SILVA

© 2016, Richard P PHELPS, Monica SILVA 18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016 2

An enormous research literature on the effects of testing

• But, assertions that all or most of it does not exist are common:

– e.g., OECD, World Bank, US National Research Council

– Some claims are made by those who oppose standardized testing, may be wishful thinking

– Others are “firstness” claims

© 2016, Richard P PHELPS, Monica SILVA 3

Dismissive research reviews

• With a dismissive research literature review, a researcher assures all that no other researcher has studied the same topic

18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016


Firstness claims

• With a firstness claim, a researcher insists that he or she is the first to ever study a topic



Social costs are enormous

•Public, policy-makers do not understand the relative benefits of testing, and so:

– Do not test when beneficial– Test when detrimental– In general, test sub-optimally

• We cycle through pro-testing, anti-testing fads, instead of making adjustments toward optimal use



Meta-analysis

A method for summarizing research literature, with a single, comparable measure.



Background of the study:

• “The effect of testing on student achievement 1910-2010”. Phelps, R. (2012), International Journal of Testing,12(1), 21-43.

• Phelps analyzed about 700 separate

source documents comprising 1,600 studies (quantitative, qualitative and surveys)

• 2,000 other source documents reviewed and found incomplete or inappropriate



Criteria that guided search of studies to include in the Phelps (2012) meta-analyses

1. Studies in English language that found an effect from testing on student achievement or on teacher instruction…



Criteria that guided search of studies to include in the meta-analyses

2. Studies that addressed effects when:

• a test is newly introduced, or newly removed• quantity of testing is increased or reduced• test stakes are introduced or increased, or removed or

reduced



Looking for studies to include in the meta-analyses

3. Data base keyword search:

• ERIC• OCLC• Dissertation Abstracts• EBSCO• Google Scholar …etc.

4. Citation chain, or “ancestry” method



Number of studies of effects, by methodology type…

Methodology typeNumber of

studiesNumber of

effects

Quantitative 177 640

Surveys and public opinion polls (US & Canada)

247 813

Qualitative 245 245

TOTAL 669 1698



Measure of effect size for quantitative studies: Cohen’s d

d = (YE - YC) / Spool

YE = mean, experimental group

YC = mean, control group

Spooled = standard deviation



Effect size: Cohen´s (1988) guidelines:

• d between 0.20 to 0.50 weak effect

• d between 0.50 to 0.75 medium effect

• d more than 0.75 strong effect

John Hattie: in educational achievement:

d of 0.5 ≈ one grade level



Findings from Phelps (2012):

• Survey study effect sizes average >1.0

• Over 90% of qualitative studies positive

• For quantitative studies, univariate effect sizes positive and stronger when:– Testing more frequently– Testing with feedback– Testing with stakes



Overall effect size for quantitative studies*:

• “Bare bones” calculation:

d ≈ +0.55 …a medium effect

• Bare bones effect size adjusted for measurement error

d ≈ +0.71 …a stronger effect

• Using same-study-author aggregation

d ≈ +0.88 …a strong effect

Source: Phelps, 2012.



- 171 source documents

- 640 studies (i.e., effect sizes)

- population coverage: 7 million

-100 moderators, 27 coded for this study

Quantitative studies data base:



Source documents included, by type

Geographic Origin

154 USA 8 Canada 3 Multiple countries 1 Barbados, Belgium, China, Israel, Korea, Mexico, UK


Methodologies

115 Controlled trials, with random assignment 33 Multivariate (e.g., regression, SEM) 20 Pre-post (7 with shadow tests) 5 Post-test only


This study:

• Re-analyzes Phelps (2012) data set of quantitative studies to test the joint effects of selected moderators through meta-analytic regression

• Analogue to multiple regression analyses.



Meta-regression provides:

• An estimate of the contribution of testing frequency, stakes and feedback to the prediction of achievement, after controlling for background moderators.

Test whether or not model adequately explains the observed variability in effect sizes



Methodology for model fitting:

• Weighted least-squares regression as outlined by Hedges & Olkin (1985)

• Moderators in the equation selected via univariate significance tests reported in Phelps (2012).



Step II: Test Moderators Stakes Feedback Frequency

Hierarchical Meta-Regression


Step I:BackgroundModerators

Alignment Timing of predictor Commercial test used Shadow test used Large scale program Longitudinal study type

2218th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016© 2016, Richard P PHELPS, Monica SILVA

Avoiding “Data Dependence”

Most source documents include multiple studies, which may be conducted on the same population or data set – they are not

independent of each other.

Solution: Run 6 meta-regressions1 with the largest effect sizes from each source document

1 with the smallest effect sizes from each source document

1 with a randomly chosen study from each document

with and without remediation studies


Results Hierarchical WLSR:

Step Largest effect

size studies

% variance explainedWith remediation Without remediation

1 Background factors 41.4 39.02 Test factors 8.6 11.4

Total variance explained 50.0 50.4


Step Random effect

size studies




Step Smallest effect

size studies





Results Hierarchical WLSR:Direction of effects

Background Moderator Largest Effects

Random Effects

Smallest Effects

Longitudinal study + + +Timing of treatment, before outcome + – nsCommercial test used + + nsShadow test used – – –Alignment – – +Large-scale program – – –

Test Moderator

Frequency (existence of testing) + + +

Stakes (consequences) + + ns

Feedback + + +



Relative independent contributionof test factors (post hoc)

When entered last in the equation:

Added variance explained

Testing frequency > 6%

Stakes 4%

Feedback 1%



With background moderators controlled:

Conclusions


• Frequency, stakes, and feedback significantly contribute to the prediction of achievement gains

• Frequency is strongest and most consistent predictor

• Stakes and feedback are less consistent predictors

• Substantial variability remains unexplained


Where do we go from here?

• Optimal combinations of frequency, stakes, feedback?

• Tests should be more frequent? How much more?

• Which stakes work best when and on which targets: students, teachers, schools?

• Which feedback works best when and on which targets?

• Others?



Where do we go from here?

• Would be nice if we could add more moderators

• Unfortunately, the moderators available are mostly determined by the original studies


Testing, Stakes, and Feedback in Student Achievement: A Meta Regression Analysis

Richard P. PHELPS & Mónica SILVA