Upload
richard-p-phelps
View
76
Download
2
Embed Size (px)
Citation preview
Testing, Stakes, and Feedback in Student Achievement: A Meta Regression Analysis
Richard P. PHELPS & Mónica SILVA
© 2016, Richard P PHELPS, Monica SILVA 18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016 2
An enormous research literature on the effects of testing
• But, assertions that all or most of it does not exist are common:
– e.g., OECD, World Bank, US National Research Council
– Some claims are made by those who oppose standardized testing, may be wishful thinking
– Others are “firstness” claims
© 2016, Richard P PHELPS, Monica SILVA 3
Dismissive research reviews
• With a dismissive research literature review, a researcher assures all that no other researcher has studied the same topic
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 4
Firstness claims
• With a firstness claim, a researcher insists that he or she is the first to ever study a topic
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 5
Social costs are enormous
•Public, policy-makers do not understand the relative benefits of testing, and so:
– Do not test when beneficial– Test when detrimental– In general, test sub-optimally
• We cycle through pro-testing, anti-testing fads, instead of making adjustments toward optimal use
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 6
Meta-analysis
A method for summarizing research literature, with a single, comparable measure.
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 7
Background of the study:
• “The effect of testing on student achievement 1910-2010”. Phelps, R. (2012), International Journal of Testing,12(1), 21-43.
• Phelps analyzed about 700 separate
source documents comprising 1,600 studies (quantitative, qualitative and surveys)
• 2,000 other source documents reviewed and found incomplete or inappropriate
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 8
Criteria that guided search of studies to include in the Phelps (2012) meta-analyses
1. Studies in English language that found an effect from testing on student achievement or on teacher instruction…
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 9
Criteria that guided search of studies to include in the meta-analyses
2. Studies that addressed effects when:
• a test is newly introduced, or newly removed• quantity of testing is increased or reduced• test stakes are introduced or increased, or removed or
reduced
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 10
Looking for studies to include in the meta-analyses
3. Data base keyword search:
• ERIC• OCLC• Dissertation Abstracts• EBSCO• Google Scholar …etc.
4. Citation chain, or “ancestry” method
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 11
Number of studies of effects, by methodology type…
Methodology typeNumber of
studiesNumber of
effects
Quantitative 177 640
Surveys and public opinion polls (US & Canada)
247 813
Qualitative 245 245
TOTAL 669 1698
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 12
Measure of effect size for quantitative studies: Cohen’s d
d = (YE - YC) / Spool
YE = mean, experimental group
YC = mean, control group
Spooled = standard deviation
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 13
Effect size: Cohen´s (1988) guidelines:
• d between 0.20 to 0.50 weak effect
• d between 0.50 to 0.75 medium effect
• d more than 0.75 strong effect
John Hattie: in educational achievement:
d of 0.5 ≈ one grade level
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 14
Findings from Phelps (2012):
• Survey study effect sizes average >1.0
• Over 90% of qualitative studies positive
• For quantitative studies, univariate effect sizes positive and stronger when:– Testing more frequently– Testing with feedback– Testing with stakes
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 15
Overall effect size for quantitative studies*:
• “Bare bones” calculation:
d ≈ +0.55 …a medium effect
• Bare bones effect size adjusted for measurement error
d ≈ +0.71 …a stronger effect
• Using same-study-author aggregation
d ≈ +0.88 …a strong effect
Source: Phelps, 2012.
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 16
- 171 source documents
- 640 studies (i.e., effect sizes)
- population coverage: 7 million
-100 moderators, 27 coded for this study
Quantitative studies data base:
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 17
Source documents included, by type
Geographic Origin
154 USA 8 Canada 3 Multiple countries 1 Barbados, Belgium, China, Israel, Korea, Mexico, UK
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
Methodologies
115 Controlled trials, with random assignment 33 Multivariate (e.g., regression, SEM) 20 Pre-post (7 with shadow tests) 5 Post-test only
© 2016, Richard P PHELPS, Monica SILVA 18
This study:
• Re-analyzes Phelps (2012) data set of quantitative studies to test the joint effects of selected moderators through meta-analytic regression
• Analogue to multiple regression analyses.
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 19
Meta-regression provides:
• An estimate of the contribution of testing frequency, stakes and feedback to the prediction of achievement, after controlling for background moderators.
Test whether or not model adequately explains the observed variability in effect sizes
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 20
Methodology for model fitting:
• Weighted least-squares regression as outlined by Hedges & Olkin (1985)
• Moderators in the equation selected via univariate significance tests reported in Phelps (2012).
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 21
Step II: Test Moderators Stakes Feedback Frequency
Hierarchical Meta-Regression
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
Step I:BackgroundModerators
Alignment Timing of predictor Commercial test used Shadow test used Large scale program Longitudinal study type
2218th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016© 2016, Richard P PHELPS, Monica SILVA
Avoiding “Data Dependence”
Most source documents include multiple studies, which may be conducted on the same population or data set – they are not
independent of each other.
Solution: Run 6 meta-regressions1 with the largest effect sizes from each source document
1 with the smallest effect sizes from each source document
1 with a randomly chosen study from each document
with and without remediation studies
© 2016, Richard P PHELPS, Monica SILVA 23
Results Hierarchical WLSR:
Step Largest effect
size studies
% variance explainedWith remediation Without remediation
1 Background factors 41.4 39.02 Test factors 8.6 11.4
Total variance explained 50.0 50.4
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
Step Random effect
size studies
% variance explainedWith remediation Without remediation
1 Background factors 45.0 45.32 Test factors 8.8 9.5
Total variance explained 53.8 54.8
Step Smallest effect
size studies
% variance explainedWith remediation Without remediation
1 Background factors 29.6 24.82 Test factors 7.8 11.8
Total variance explained 37.4 36.6
© 2016, Richard P PHELPS, Monica SILVA 24
Results Hierarchical WLSR:Direction of effects
Background Moderator Largest Effects
Random Effects
Smallest Effects
Longitudinal study + + +Timing of treatment, before outcome + – nsCommercial test used + + nsShadow test used – – –Alignment – – +Large-scale program – – –
Test Moderator
Frequency (existence of testing) + + +
Stakes (consequences) + + ns
Feedback + + +
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 25
Relative independent contributionof test factors (post hoc)
When entered last in the equation:
Added variance explained
Testing frequency > 6%
Stakes 4%
Feedback 1%
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 26
With background moderators controlled:
Conclusions
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
• Frequency, stakes, and feedback significantly contribute to the prediction of achievement gains
• Frequency is strongest and most consistent predictor
• Stakes and feedback are less consistent predictors
• Substantial variability remains unexplained
© 2016, Richard P PHELPS, Monica SILVA 27
Where do we go from here?
• Optimal combinations of frequency, stakes, feedback?
• Tests should be more frequent? How much more?
• Which stakes work best when and on which targets: students, teachers, schools?
• Which feedback works best when and on which targets?
• Others?
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
© 2016, Richard P PHELPS, Monica SILVA 28
Where do we go from here?
• Would be nice if we could add more moderators
• Unfortunately, the moderators available are mostly determined by the original studies
18th Congress, World Assn. of Education Research, Eskisehir, Turkey, June, 2016
Testing, Stakes, and Feedback in Student Achievement: A Meta Regression Analysis
Richard P. PHELPS & Mónica SILVA