Upload
chelsea-haraway
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Effect Sizes in Education Research: Effect Sizes in Education Research: What They Are, What They Mean, What They Are, What They Mean,
and Why They’re Importantand Why They’re Important
Howard BloomHoward Bloom (MDRC; [email protected])(MDRC; [email protected])
Carolyn HillCarolyn Hill (Georgetown; [email protected])(Georgetown; [email protected])
Alison Rebeck Black Alison Rebeck Black (MDRC; [email protected])(MDRC; [email protected])
Mark LipseyMark Lipsey (Vanderbilt; [email protected])(Vanderbilt; [email protected])
Institute of Education Sciences 2006 Research ConferenceWashington DC
Today’s Session
Goal:Goal: introduce key concepts and issues introduce key concepts and issues Approach:Approach: focus on nexus between focus on nexus between
analytics and interpretationanalytics and interpretation AgendaAgenda
Core conceptsCore concepts Empirical benchmarksEmpirical benchmarks Important applications Important applications
Part 1: The Nature (and Pitfalls) Part 1: The Nature (and Pitfalls) of the Effect Sizeof the Effect Size
Howard BloomHoward Bloom
MDRCMDRC
Starting Point
Statistical significance vs. substantive Statistical significance vs. substantive importanceimportance
Effect size measures for continuous Effect size measures for continuous outcomes (our focus)outcomes (our focus)
Effect size measures for discrete outcomesEffect size measures for discrete outcomes
The standardized mean difference
20.050
400410 ES
CE YYES
__
Relativity of statistical effect sizes
Variance components framework
Decomposing the total national varianceDecomposing the total national variance
2222222.. errorstudentsubgroupschooldistrictstateSU
222222.. errorstudentsubgroupschooldistrictstateSU
Ratio of Student-level to school-level standard deviations
Students in a Students in a grade per school grade per school
(n)(n)
Intra-class correlationIntra-class correlation
(())
0.050.05 0.100.10 0.200.20
5050 3.813.81 2.912.91 2.152.15
100100 4.104.10 3.033.03 2.192.19
200200 4.274.27 3.093.09 2.212.21
400400 4.374.37 3.133.13 2.222.22
Unadjusted vs. regression-adjusted standard deviations
RR22 Ratio of unadjusted to Ratio of unadjusted to adjusted standard deviationsadjusted standard deviations
0.10.1 1.051.05
0.30.3 1.201.20
0.50.5 1.411.41
0.70.7 1.831.83
0.90.9 3.163.16
Career Academies andFuture Earnings for Young Men
Impact onImpact on EarningsEarnings
Dollars per month increaseDollars per month increase $212 $212Percentage increasePercentage increase 18 % 18 %Effect size Effect size 0.30 0.30
Aspirin and heart attacks
Rate of Rate of
Heart AttacksHeart Attacks
With placeboWith placebo 1.71 %1.71 %
With aspirin With aspirin 0.94 %0.94 %
Difference Difference 0.77 % 0.77 %
Effect SizeEffect Size 0.06 0.06
Measures of Effect Size,” in Harris Cooper and Larry V. Hedges, Measures of Effect Size,” in Harris Cooper and Larry V. Hedges, The The Handbook of Research SynthesisHandbook of Research Synthesis (New York: Russell Sage Foundation) (New York: Russell Sage Foundation)
Five-year impacts of the Tennessee class-size experiment
Treatment:Treatment: 13-17 versus 22-26 students per class13-17 versus 22-26 students per class
Effect sizes:Effect sizes: 0.110.11 to 0.22 to 0.22 for reading and math for reading and math
Findings were summarized from Nye, Barbara, Larry V. Hedges and Spyros Findings were summarized from Nye, Barbara, Larry V. Hedges and Spyros Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-Year Follow-up of the Tennessee Class Size Experiment,” Year Follow-up of the Tennessee Class Size Experiment,” Educational Educational Evaluation and Policy AnalysisEvaluation and Policy Analysis, Vol. 21, No. 2: 127-142., Vol. 21, No. 2: 127-142.
Part 2: What’s a Big Effect Size, Part 2: What’s a Big Effect Size, and How to Tell?and How to Tell?
Carolyn Hill, Georgetown UniversityCarolyn Hill, Georgetown University
Alison Rebeck Black, MDRCAlison Rebeck Black, MDRC
How Big is the Effect?
Need to interpret an effect size when:Need to interpret an effect size when:
• Designing an intervention studyDesigning an intervention study
• Interpreting an intervention studyInterpreting an intervention study
• Synthesizing intervention studiesSynthesizing intervention studies
To assess practical significance of an effect size:To assess practical significance of an effect size:• Compare to external criterion/standardCompare to external criterion/standard
• Related to outcome constructRelated to outcome construct• Related to contextRelated to context
Prevailing Practice for Interpreting Effect Size:
“Rules of Thumb”CohenCohen
(speculative(speculative))
Small = 0.20 Small = 0.20 Medium = 0.50 Medium = 0.50 Large = 0.80 Large = 0.80
Cohen, Jacob (1988) Cohen, Jacob (1988) Statistical Power Statistical Power Analysis for the Behavioral SciencesAnalysis for the Behavioral Sciences 2 2ndnd edition (Hillsdale, NJ: Lawrence edition (Hillsdale, NJ: Lawrence Erlbaum).Erlbaum).
LipseyLipsey(empirical)(empirical)
Small = 0.15 Small = 0.15 Medium = 0.45 Medium = 0.45 Large = 0.90 Large = 0.90
Lipsey, Mark W. (1990) Lipsey, Mark W. (1990) Design Design Sensitivity: Statistical Power for Sensitivity: Statistical Power for Experimental ResearchExperimental Research (Newbury Park, (Newbury Park, CA: Sage Publications). CA: Sage Publications).
Preferred Approaches for Assessing Effect Size (K-12)
Compare ES from the study with:Compare ES from the study with: ES distributions from similar studies ES distributions from similar studies Student attainment of performance Student attainment of performance
criterion without interventioncriterion without intervention Normative expectations for change Normative expectations for change Subgroup performance gapsSubgroup performance gaps School performance gapsSchool performance gaps
ES Distribution from Similar Studies
-0.06 0.07 0.16 0.25 0.39
Effect Size (σ)
25th
5th
50th
75th
95th
Percentile
Percentile distribution of 145 achievement effect sizes from meta-analysis of comprehensive school reform studies (Borman et al. 2003):
Attainment of Performance Criterion Based on Effect Size
Attainment of Performance Criterion (continued)
Normative Expectations for Change: Estimating Annual Reading and Math Gains
in Effect Size from National Norming Samples for Standardized Tests
Seven tests were used for reading and six tests Seven tests were used for reading and six tests were used for mathwere used for math
The mean and standard deviation of scale scores The mean and standard deviation of scale scores for each grade were obtained from test manualsfor each grade were obtained from test manuals
The standardized mean difference across The standardized mean difference across succeeding grades was computedsucceeding grades was computed
These results were averaged across tests and These results were averaged across tests and weighted according to Hedges (1982)weighted according to Hedges (1982)
Annual Reading and Math GrowthReading Math Reading Math
Grade Growth GrowthGrade Growth GrowthTransition Effect Size Effect SizeTransition Effect Size Effect Size------------------------------------------------------------------------------------------------------------------------------ K - 1 K - 1 1.59 1.59 1.13 1.13 1 - 2 0.94 1.02 1 - 2 0.94 1.02 2 - 3 0.57 0.83 2 - 3 0.57 0.83 3 - 4 0.37 0.503 - 4 0.37 0.50 4 - 5 0.40 0.59 4 - 5 0.40 0.59 5 - 6 0.35 0.41 5 - 6 0.35 0.41 6 - 7 0.21 0.30 6 - 7 0.21 0.30 7 - 8 0.25 0.32 7 - 8 0.25 0.32 8 - 9 0.26 0.19 8 - 9 0.26 0.19 9 - 10 0.20 0.22 9 - 10 0.20 0.22 10 - 11 0.21 0.15 10 - 11 0.21 0.15 11 - 12 0.03 0.0011 - 12 0.03 0.00--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Based on work in progress using documentation on the national Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10. MacGinitie, MAT8, Terra Nova CAT, and SAT10.
Annual Reading Gain in Effect Size from Seven Nationally Normed Tests
y = 0.0173x2 - 0.2863x + 1.2938
R2 = 0.856
00.20.40.60.8
11.21.41.61.8
0 1 2 3 4 5 6 7 8 9 10 11 12
Beginning Grade
Effe
ct S
ize
(σ)
Annual Math Gain in Effect Size from Six Nationally Normed Tests
y = 0.0082x2 - 0.1836x + 1.1544
R2 = 0.9576
0.00
0.200.40
0.60
0.801.00
1.20
1.401.60
1.80
0 1 2 3 4 5 6 7 8 9 10 11 12
Beginning Grade
Effe
ct S
ize
(σ)
Demographic Performance Gaps from Selected Tests
Interventions may aim to close Interventions may aim to close demographic performance gaps demographic performance gaps
Effectiveness of interventions can be judged Effectiveness of interventions can be judged relative to the size of gaps they are designed relative to the size of gaps they are designed to closeto close
Effect size gaps vary across grades, years, Effect size gaps vary across grades, years, tests, and districtstests, and districts
Demographic Performance Gap in Reading Long-Term Trend NAEP Scores
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
Age 9 Age 13 Age 17
Effe
ct S
ize
(σ)
White-Black White-Hispanic Female-Male
Demographic Performance Gap in Math Long-Term Trend NAEP Scores
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
Age 9 Age 13 Age 17
Eff
ect S
ize
(σ)
White-Black White-Hispanic Female-Male
Demographic Performance Gaps in Reading from Two School Districts
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
grade 1 grade 2 grade 3 grade 4 grade 5 grade 6 grade 7 grade 8 grade 9 grade 10 grade 11
Effe
ct S
ize
(σ)
Female/male gap White/black gap White/Hispanic gap
Demographic Performance Gaps in Math from Two School Districts
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
grade 1 grade 2 grade 3 grade 4 grade 5 grade 6 grade 7 grade 8 grade 9 grade10
grade11
Eff
ect S
ize
(σ)
Female/male gap White/black gap White/Hispanic gap
Performance Gaps between “Average” and “Weak” Schools
Main idea:Main idea: What is the performance gap (effect size) for the same What is the performance gap (effect size) for the same
types of students in different schools?types of students in different schools?
Approach: Approach: Estimate a regression model that controls for student Estimate a regression model that controls for student
characteristics: race/ethnicity, prior achievement, characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.gender, overage for grade, and free lunch status.
Infer performance gap (effect size) between schools at Infer performance gap (effect size) between schools at different percentiles of the performance distributiondifferent percentiles of the performance distribution
Performance Gap between "Average" (50th percentile) and "Weak" (10th percentile) Schools
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9 Grade 10 Grade 11
Effe
ct S
ize
(σ)
District I District II
Performance Gap between "Average" (50th percentile) and "Below Average" (25th percentile)
Schools
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9 Grade 10 Grade 11
Effe
ct S
ize
(σ)
District I District II
Interpreting the Magnitude of Effect Sizes
““One size” does not fit allOne size” does not fit all
Instead, interpret magnitudes of effects Instead, interpret magnitudes of effects in contextin context Of the interventions being studiedOf the interventions being studied Of the outcomes being measuredOf the outcomes being measured Of the samples/subsamples being examinedOf the samples/subsamples being examined
Consider different frames of reference in context, Consider different frames of reference in context, instead of a universal standard:instead of a universal standard: ES distributions, external performance criteria, ES distributions, external performance criteria,
normative change, subgroup/school gaps, etc.normative change, subgroup/school gaps, etc.
Part 3: Using Effect Sizes in Part 3: Using Effect Sizes in Power Analysis and Research Power Analysis and Research
SynthesisSynthesis
Mark W. LipseyMark W. Lipsey
Vanderbilt UniversityVanderbilt University
Statistical Power
The probability that a true intervention The probability that a true intervention effect will be found statistically effect will be found statistically significant.significant.
Estimating Statistical Power Prospectively: Finding the MDE
SpecifySpecify::
1.1. alpha level– conventionally .05alpha level– conventionally .05
2.2. sample size (at all levels if multilevel design)sample size (at all levels if multilevel design)
3.3. correlation between any covariates to be used and correlation between any covariates to be used and dependent variabledependent variable
4.4. intracluster correlation coefficients (ICCs) if intracluster correlation coefficients (ICCs) if multilevel designmultilevel design
5.5. target power level– conventionally set at .80target power level– conventionally set at .80
EstimateEstimate: minimum detectable effect size: minimum detectable effect size
Assessing the MDE
Compare with a target effect size-- the Compare with a target effect size-- the smallest ES judged to have practical smallest ES judged to have practical significance in the intervention contextsignificance in the intervention context
Design is underpowered if MDE > target Design is underpowered if MDE > target (back to the drawing board)(back to the drawing board)
Design is adequately powered if Design is adequately powered if
MDE MDE ≤ target value≤ target value
Where Do You Get the Target Value for Practical Significance? NOT some broad rule of thumb, e.g, Cohen’s NOT some broad rule of thumb, e.g, Cohen’s
“small,” “medium,” and “large”“small,” “medium,” and “large” Use a frame of reference appropriate to the Use a frame of reference appropriate to the
outcome, population, and interventionoutcome, population, and intervention meaningful success criterionmeaningful success criterion research findings for similar interventionsresearch findings for similar interventions change expected without interventionchange expected without intervention gaps between relevant comparison groupsgaps between relevant comparison groups et ceteraet cetera
Selecting the Target MDE
Identify one or more reference frames that may Identify one or more reference frames that may be applicable to the intervention circumstancesbe applicable to the intervention circumstances
Use that frame to guide selection of an MDE; Use that frame to guide selection of an MDE; involve other stakeholdersinvolve other stakeholders
Use different reference frames to consider:Use different reference frames to consider: which is most applicable to the contextwhich is most applicable to the context how sensitive the choice is to the frameshow sensitive the choice is to the frames what the most conservative selection might bewhat the most conservative selection might be
Power for Different Target MDEs(2-level design: students in classrooms)
ES=.20
ES=.50
ES=.80
Number of Classrooms of N=20
.80
ICC=.15
Power for Different Target MDEs(same with classroom covariate R2 =.50)
Number of Classrooms of N=20
ICC=.15
.80
ES=.20ES
=.50
ES=.80
Interpreting Effect Sizes Found in Individual Studies & Meta-Analysis The practical significance of empirically The practical significance of empirically
observed effect sizes should be interpreted observed effect sizes should be interpreted using approaches like those described hereusing approaches like those described here
This is especially important when This is especially important when disseminating research results to disseminating research results to practitioners and policymakerspractitioners and policymakers
For standardized achievement measures, the For standardized achievement measures, the practical significance of ES values will vary practical significance of ES values will vary by student population and grade.by student population and grade.
Example: Computer-Assisted Instruction for Beginning Reading (Grades 1-4)
Consider an MDE = .25Consider an MDE = .25 Mean ES=.25 found in Blok et al 2002 Mean ES=.25 found in Blok et al 2002
meta-analysismeta-analysis 27-65% increase over “normal” year-to-27-65% increase over “normal” year-to-
year growth depending on ageyear growth depending on age About 30% of the Grade 4 majority-About 30% of the Grade 4 majority-
minority achievement gapminority achievement gap
ReferencesReferences
Bloom, Howard S. 2005. “Randomizing Groups to Evaluate Place-Based Programs.” In Bloom, Howard S. 2005. “Randomizing Groups to Evaluate Place-Based Programs.” In Howard S. Bloom, editor. Howard S. Bloom, editor. Learning More from Social Experiments: Evolving Analytic Learning More from Social Experiments: Evolving Analytic ApproachesApproaches. New York: Russell Sage Foundation, pp. 115-172.. New York: Russell Sage Foundation, pp. 115-172.
Bloom, Howard S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Bloom, Howard S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.” Statistical Power of Experimental Designs.” Evaluation ReviewEvaluation Review 19(5): 547-56. 19(5): 547-56.
Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman, and Shelly Brown. 2003. Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman, and Shelly Brown. 2003. “Comprehensive School Reform and Achievement: A Meta-Analysis.” “Comprehensive School Reform and Achievement: A Meta-Analysis.” Review of Review of Educational Research Educational Research 73(2): 125-230.73(2): 125-230.
Hedges, Larry V. 1982. “Estimation of Effect Size from a Series of Independent Hedges, Larry V. 1982. “Estimation of Effect Size from a Series of Independent Experiments.” Experiments.” Psychological BulletinPsychological Bulletin 92(2): 490-499. 92(2): 490-499.
Kane, Thomas J. 2004. “The Impact of After-School Programs: Interpreting the Results of Kane, Thomas J. 2004. “The Impact of After-School Programs: Interpreting the Results of Four Recent Evaluations.” William T. Grant Foundation Working Paper, January 16. Four Recent Evaluations.” William T. Grant Foundation Working Paper, January 16. http://www.wtgrantfoundation.org/usr_doc/After-school_paper.pdfhttp://www.wtgrantfoundation.org/usr_doc/After-school_paper.pdf
Konstantopoulos, Spyros, and Larry V. Hedges. 2005. “How Large an Effect Can We Konstantopoulos, Spyros, and Larry V. Hedges. 2005. “How Large an Effect Can We Expect from School Reforms?” Working paper #05-04, Institute for Policy Research, Expect from School Reforms?” Working paper #05-04, Institute for Policy Research, Northwestern University. Northwestern University. http://www.northwestern.edu/ipr/publications/papers/2005/WP-05-04.pdfhttp://www.northwestern.edu/ipr/publications/papers/2005/WP-05-04.pdf..
Lipsey, Mark W. 1990. Lipsey, Mark W. 1990. Design Sensitivity: Statistical Power for Experimental ResearchDesign Sensitivity: Statistical Power for Experimental Research . . Thousand Oaks, CA: Sage Publications.Thousand Oaks, CA: Sage Publications.
Schochet, Peter Z. 2005. “Statistical Power for Random Assignment Evaluations of Schochet, Peter Z. 2005. “Statistical Power for Random Assignment Evaluations of Education Programs.” Project report submitted by Mathematic Policy Research, Inc. to Education Programs.” Project report submitted by Mathematic Policy Research, Inc. to Institute of Education Sciences, U.S. Department of Education. Institute of Education Sciences, U.S. Department of Education. http://www.mathematica-mpr.com/publications/PDFs/statisticalpower.pdfhttp://www.mathematica-mpr.com/publications/PDFs/statisticalpower.pdf
Contact Information
Howard Bloom Howard Bloom ([email protected])([email protected])
Carolyn Hill Carolyn Hill ([email protected])([email protected])
Alison Rebeck Black Alison Rebeck Black ([email protected])([email protected])
Mark Lipsey Mark Lipsey ([email protected])([email protected])