Effect Sizes in Education Research: What They Are, What They Mean, and Why They’re Important Howard Bloom (MDRC; [email protected]) Carolyn Hill (Georgetown;

Effect Sizes in Education Research: Effect Sizes in Education Research: What They Are, What They Mean, What They Are, What They Mean,

and Why They’re Importantand Why They’re Important

Howard BloomHoward Bloom (MDRC; [email protected])(MDRC; [email protected])

Carolyn HillCarolyn Hill (Georgetown; [email protected])(Georgetown; [email protected])

Alison Rebeck Black Alison Rebeck Black (MDRC; [email protected])(MDRC; [email protected])

Mark LipseyMark Lipsey (Vanderbilt; [email protected])(Vanderbilt; [email protected])

Institute of Education Sciences 2006 Research ConferenceWashington DC

Today’s Session

Goal:Goal: introduce key concepts and issues introduce key concepts and issues Approach:Approach: focus on nexus between focus on nexus between

analytics and interpretationanalytics and interpretation AgendaAgenda

Core conceptsCore concepts Empirical benchmarksEmpirical benchmarks Important applications Important applications

Part 1: The Nature (and Pitfalls) Part 1: The Nature (and Pitfalls) of the Effect Sizeof the Effect Size

Howard BloomHoward Bloom

MDRCMDRC

Starting Point

Statistical significance vs. substantive Statistical significance vs. substantive importanceimportance

Effect size measures for continuous Effect size measures for continuous outcomes (our focus)outcomes (our focus)

Effect size measures for discrete outcomesEffect size measures for discrete outcomes

The standardized mean difference

20.050

400410 ES

CE YYES

__

Relativity of statistical effect sizes

Variance components framework

Decomposing the total national varianceDecomposing the total national variance

2222222.. errorstudentsubgroupschooldistrictstateSU

222222.. errorstudentsubgroupschooldistrictstateSU

Ratio of Student-level to school-level standard deviations

Students in a Students in a grade per school grade per school

(n)(n)

Intra-class correlationIntra-class correlation

(())

0.050.05 0.100.10 0.200.20

5050 3.813.81 2.912.91 2.152.15

100100 4.104.10 3.033.03 2.192.19

200200 4.274.27 3.093.09 2.212.21

400400 4.374.37 3.133.13 2.222.22

Unadjusted vs. regression-adjusted standard deviations

RR22 Ratio of unadjusted to Ratio of unadjusted to adjusted standard deviationsadjusted standard deviations

0.10.1 1.051.05

0.30.3 1.201.20

0.50.5 1.411.41

0.70.7 1.831.83

0.90.9 3.163.16

Career Academies andFuture Earnings for Young Men

Impact onImpact on EarningsEarnings

Dollars per month increaseDollars per month increase $212 $212Percentage increasePercentage increase 18 % 18 %Effect size Effect size 0.30 0.30

Aspirin and heart attacks

Rate of Rate of

Heart AttacksHeart Attacks

With placeboWith placebo 1.71 %1.71 %

With aspirin With aspirin 0.94 %0.94 %

Difference Difference 0.77 % 0.77 %

Effect SizeEffect Size 0.06 0.06

Measures of Effect Size,” in Harris Cooper and Larry V. Hedges, Measures of Effect Size,” in Harris Cooper and Larry V. Hedges, The The Handbook of Research SynthesisHandbook of Research Synthesis (New York: Russell Sage Foundation) (New York: Russell Sage Foundation)

Five-year impacts of the Tennessee class-size experiment

Treatment:Treatment: 13-17 versus 22-26 students per class13-17 versus 22-26 students per class

Effect sizes:Effect sizes: 0.110.11 to 0.22 to 0.22 for reading and math for reading and math

Findings were summarized from Nye, Barbara, Larry V. Hedges and Spyros Findings were summarized from Nye, Barbara, Larry V. Hedges and Spyros Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-Year Follow-up of the Tennessee Class Size Experiment,” Year Follow-up of the Tennessee Class Size Experiment,” Educational Educational Evaluation and Policy AnalysisEvaluation and Policy Analysis, Vol. 21, No. 2: 127-142., Vol. 21, No. 2: 127-142.

Part 2: What’s a Big Effect Size, Part 2: What’s a Big Effect Size, and How to Tell?and How to Tell?

Carolyn Hill, Georgetown UniversityCarolyn Hill, Georgetown University

Alison Rebeck Black, MDRCAlison Rebeck Black, MDRC

How Big is the Effect?

Need to interpret an effect size when:Need to interpret an effect size when:

• Designing an intervention studyDesigning an intervention study

• Interpreting an intervention studyInterpreting an intervention study

• Synthesizing intervention studiesSynthesizing intervention studies

To assess practical significance of an effect size:To assess practical significance of an effect size:• Compare to external criterion/standardCompare to external criterion/standard

• Related to outcome constructRelated to outcome construct• Related to contextRelated to context

Prevailing Practice for Interpreting Effect Size:

“Rules of Thumb”CohenCohen

(speculative(speculative))

Small = 0.20 Small = 0.20 Medium = 0.50 Medium = 0.50 Large = 0.80 Large = 0.80

Cohen, Jacob (1988) Cohen, Jacob (1988) Statistical Power Statistical Power Analysis for the Behavioral SciencesAnalysis for the Behavioral Sciences 2 2ndnd edition (Hillsdale, NJ: Lawrence edition (Hillsdale, NJ: Lawrence Erlbaum).Erlbaum).

LipseyLipsey(empirical)(empirical)

Small = 0.15 Small = 0.15 Medium = 0.45 Medium = 0.45 Large = 0.90 Large = 0.90

Lipsey, Mark W. (1990) Lipsey, Mark W. (1990) Design Design Sensitivity: Statistical Power for Sensitivity: Statistical Power for Experimental ResearchExperimental Research (Newbury Park, (Newbury Park, CA: Sage Publications). CA: Sage Publications).

Preferred Approaches for Assessing Effect Size (K-12)

Compare ES from the study with:Compare ES from the study with: ES distributions from similar studies ES distributions from similar studies Student attainment of performance Student attainment of performance

criterion without interventioncriterion without intervention Normative expectations for change Normative expectations for change Subgroup performance gapsSubgroup performance gaps School performance gapsSchool performance gaps

ES Distribution from Similar Studies

-0.06 0.07 0.16 0.25 0.39

Effect Size (σ)

25th

5th

50th

75th

95th

Percentile

Percentile distribution of 145 achievement effect sizes from meta-analysis of comprehensive school reform studies (Borman et al. 2003):

Attainment of Performance Criterion Based on Effect Size

Attainment of Performance Criterion (continued)

Normative Expectations for Change: Estimating Annual Reading and Math Gains

in Effect Size from National Norming Samples for Standardized Tests

Seven tests were used for reading and six tests Seven tests were used for reading and six tests were used for mathwere used for math

The mean and standard deviation of scale scores The mean and standard deviation of scale scores for each grade were obtained from test manualsfor each grade were obtained from test manuals

The standardized mean difference across The standardized mean difference across succeeding grades was computedsucceeding grades was computed

These results were averaged across tests and These results were averaged across tests and weighted according to Hedges (1982)weighted according to Hedges (1982)

Annual Reading and Math GrowthReading Math Reading Math

Grade Growth GrowthGrade Growth GrowthTransition Effect Size Effect SizeTransition Effect Size Effect Size------------------------------------------------------------------------------------------------------------------------------ K - 1 K - 1 1.59 1.59 1.13 1.13 1 - 2 0.94 1.02 1 - 2 0.94 1.02 2 - 3 0.57 0.83 2 - 3 0.57 0.83 3 - 4 0.37 0.503 - 4 0.37 0.50 4 - 5 0.40 0.59 4 - 5 0.40 0.59 5 - 6 0.35 0.41 5 - 6 0.35 0.41 6 - 7 0.21 0.30 6 - 7 0.21 0.30 7 - 8 0.25 0.32 7 - 8 0.25 0.32 8 - 9 0.26 0.19 8 - 9 0.26 0.19 9 - 10 0.20 0.22 9 - 10 0.20 0.22 10 - 11 0.21 0.15 10 - 11 0.21 0.15 11 - 12 0.03 0.0011 - 12 0.03 0.00--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Based on work in progress using documentation on the national Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10. MacGinitie, MAT8, Terra Nova CAT, and SAT10.

Annual Reading Gain in Effect Size from Seven Nationally Normed Tests

y = 0.0173x2 - 0.2863x + 1.2938

R2 = 0.856

00.20.40.60.8

11.21.41.61.8

0 1 2 3 4 5 6 7 8 9 10 11 12

Beginning Grade

Effe

ct S

ize

(σ)

Annual Math Gain in Effect Size from Six Nationally Normed Tests

y = 0.0082x2 - 0.1836x + 1.1544

R2 = 0.9576

0.00

0.200.40

0.60

0.801.00

1.20

1.401.60

1.80

0 1 2 3 4 5 6 7 8 9 10 11 12

Beginning Grade

Effe

ct S

ize

(σ)

Demographic Performance Gaps from Selected Tests

Interventions may aim to close Interventions may aim to close demographic performance gaps demographic performance gaps

Effectiveness of interventions can be judged Effectiveness of interventions can be judged relative to the size of gaps they are designed relative to the size of gaps they are designed to closeto close

Effect size gaps vary across grades, years, Effect size gaps vary across grades, years, tests, and districtstests, and districts

Demographic Performance Gap in Reading Long-Term Trend NAEP Scores

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

Age 9 Age 13 Age 17

Effe

ct S

ize

(σ)

White-Black White-Hispanic Female-Male

Demographic Performance Gap in Math Long-Term Trend NAEP Scores

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

Age 9 Age 13 Age 17

Eff

ect S

ize

(σ)

White-Black White-Hispanic Female-Male

Demographic Performance Gaps in Reading from Two School Districts

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

grade 1 grade 2 grade 3 grade 4 grade 5 grade 6 grade 7 grade 8 grade 9 grade 10 grade 11

Effe

ct S

ize

(σ)

Female/male gap White/black gap White/Hispanic gap

Demographic Performance Gaps in Math from Two School Districts

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

grade 1 grade 2 grade 3 grade 4 grade 5 grade 6 grade 7 grade 8 grade 9 grade10

grade11

Eff

ect S

ize

(σ)

Female/male gap White/black gap White/Hispanic gap

Performance Gaps between “Average” and “Weak” Schools

Main idea:Main idea: What is the performance gap (effect size) for the same What is the performance gap (effect size) for the same

types of students in different schools?types of students in different schools?

Approach: Approach: Estimate a regression model that controls for student Estimate a regression model that controls for student

characteristics: race/ethnicity, prior achievement, characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.gender, overage for grade, and free lunch status.

Infer performance gap (effect size) between schools at Infer performance gap (effect size) between schools at different percentiles of the performance distributiondifferent percentiles of the performance distribution

Performance Gap between "Average" (50th percentile) and "Weak" (10th percentile) Schools

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9 Grade 10 Grade 11

Effe

ct S

ize

(σ)

District I District II

Performance Gap between "Average" (50th percentile) and "Below Average" (25th percentile)

Schools

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9 Grade 10 Grade 11

Effe

ct S

ize

(σ)

District I District II

Interpreting the Magnitude of Effect Sizes

““One size” does not fit allOne size” does not fit all

Instead, interpret magnitudes of effects Instead, interpret magnitudes of effects in contextin context Of the interventions being studiedOf the interventions being studied Of the outcomes being measuredOf the outcomes being measured Of the samples/subsamples being examinedOf the samples/subsamples being examined

Consider different frames of reference in context, Consider different frames of reference in context, instead of a universal standard:instead of a universal standard: ES distributions, external performance criteria, ES distributions, external performance criteria,

normative change, subgroup/school gaps, etc.normative change, subgroup/school gaps, etc.

Part 3: Using Effect Sizes in Part 3: Using Effect Sizes in Power Analysis and Research Power Analysis and Research

SynthesisSynthesis

Mark W. LipseyMark W. Lipsey

Vanderbilt UniversityVanderbilt University

Statistical Power

The probability that a true intervention The probability that a true intervention effect will be found statistically effect will be found statistically significant.significant.

Estimating Statistical Power Prospectively: Finding the MDE

SpecifySpecify::

1.1. alpha level– conventionally .05alpha level– conventionally .05

2.2. sample size (at all levels if multilevel design)sample size (at all levels if multilevel design)

3.3. correlation between any covariates to be used and correlation between any covariates to be used and dependent variabledependent variable

4.4. intracluster correlation coefficients (ICCs) if intracluster correlation coefficients (ICCs) if multilevel designmultilevel design

5.5. target power level– conventionally set at .80target power level– conventionally set at .80

EstimateEstimate: minimum detectable effect size: minimum detectable effect size

Assessing the MDE

Compare with a target effect size-- the Compare with a target effect size-- the smallest ES judged to have practical smallest ES judged to have practical significance in the intervention contextsignificance in the intervention context

Design is underpowered if MDE > target Design is underpowered if MDE > target (back to the drawing board)(back to the drawing board)

Design is adequately powered if Design is adequately powered if

MDE MDE ≤ target value≤ target value

Where Do You Get the Target Value for Practical Significance? NOT some broad rule of thumb, e.g, Cohen’s NOT some broad rule of thumb, e.g, Cohen’s

“small,” “medium,” and “large”“small,” “medium,” and “large” Use a frame of reference appropriate to the Use a frame of reference appropriate to the

outcome, population, and interventionoutcome, population, and intervention meaningful success criterionmeaningful success criterion research findings for similar interventionsresearch findings for similar interventions change expected without interventionchange expected without intervention gaps between relevant comparison groupsgaps between relevant comparison groups et ceteraet cetera

Selecting the Target MDE

Identify one or more reference frames that may Identify one or more reference frames that may be applicable to the intervention circumstancesbe applicable to the intervention circumstances

Use that frame to guide selection of an MDE; Use that frame to guide selection of an MDE; involve other stakeholdersinvolve other stakeholders

Use different reference frames to consider:Use different reference frames to consider: which is most applicable to the contextwhich is most applicable to the context how sensitive the choice is to the frameshow sensitive the choice is to the frames what the most conservative selection might bewhat the most conservative selection might be

Power for Different Target MDEs(2-level design: students in classrooms)

ES=.20

ES=.50

ES=.80

Number of Classrooms of N=20

.80

ICC=.15

Power for Different Target MDEs(same with classroom covariate R2 =.50)

Number of Classrooms of N=20

ICC=.15

.80

ES=.20ES

=.50

ES=.80

Interpreting Effect Sizes Found in Individual Studies & Meta-Analysis The practical significance of empirically The practical significance of empirically

observed effect sizes should be interpreted observed effect sizes should be interpreted using approaches like those described hereusing approaches like those described here

This is especially important when This is especially important when disseminating research results to disseminating research results to practitioners and policymakerspractitioners and policymakers

For standardized achievement measures, the For standardized achievement measures, the practical significance of ES values will vary practical significance of ES values will vary by student population and grade.by student population and grade.

Example: Computer-Assisted Instruction for Beginning Reading (Grades 1-4)

Consider an MDE = .25Consider an MDE = .25 Mean ES=.25 found in Blok et al 2002 Mean ES=.25 found in Blok et al 2002

meta-analysismeta-analysis 27-65% increase over “normal” year-to-27-65% increase over “normal” year-to-

year growth depending on ageyear growth depending on age About 30% of the Grade 4 majority-About 30% of the Grade 4 majority-

minority achievement gapminority achievement gap

ReferencesReferences

Bloom, Howard S. 2005. “Randomizing Groups to Evaluate Place-Based Programs.” In Bloom, Howard S. 2005. “Randomizing Groups to Evaluate Place-Based Programs.” In Howard S. Bloom, editor. Howard S. Bloom, editor. Learning More from Social Experiments: Evolving Analytic Learning More from Social Experiments: Evolving Analytic ApproachesApproaches. New York: Russell Sage Foundation, pp. 115-172.. New York: Russell Sage Foundation, pp. 115-172.

Bloom, Howard S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Bloom, Howard S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.” Statistical Power of Experimental Designs.” Evaluation ReviewEvaluation Review 19(5): 547-56. 19(5): 547-56.

Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman, and Shelly Brown. 2003. Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman, and Shelly Brown. 2003. “Comprehensive School Reform and Achievement: A Meta-Analysis.” “Comprehensive School Reform and Achievement: A Meta-Analysis.” Review of Review of Educational Research Educational Research 73(2): 125-230.73(2): 125-230.

Hedges, Larry V. 1982. “Estimation of Effect Size from a Series of Independent Hedges, Larry V. 1982. “Estimation of Effect Size from a Series of Independent Experiments.” Experiments.” Psychological BulletinPsychological Bulletin 92(2): 490-499. 92(2): 490-499.

Kane, Thomas J. 2004. “The Impact of After-School Programs: Interpreting the Results of Kane, Thomas J. 2004. “The Impact of After-School Programs: Interpreting the Results of Four Recent Evaluations.” William T. Grant Foundation Working Paper, January 16. Four Recent Evaluations.” William T. Grant Foundation Working Paper, January 16. http://www.wtgrantfoundation.org/usr_doc/After-school_paper.pdfhttp://www.wtgrantfoundation.org/usr_doc/After-school_paper.pdf

Konstantopoulos, Spyros, and Larry V. Hedges. 2005. “How Large an Effect Can We Konstantopoulos, Spyros, and Larry V. Hedges. 2005. “How Large an Effect Can We Expect from School Reforms?” Working paper #05-04, Institute for Policy Research, Expect from School Reforms?” Working paper #05-04, Institute for Policy Research, Northwestern University. Northwestern University. http://www.northwestern.edu/ipr/publications/papers/2005/WP-05-04.pdfhttp://www.northwestern.edu/ipr/publications/papers/2005/WP-05-04.pdf..

Lipsey, Mark W. 1990. Lipsey, Mark W. 1990. Design Sensitivity: Statistical Power for Experimental ResearchDesign Sensitivity: Statistical Power for Experimental Research . . Thousand Oaks, CA: Sage Publications.Thousand Oaks, CA: Sage Publications.

Schochet, Peter Z. 2005. “Statistical Power for Random Assignment Evaluations of Schochet, Peter Z. 2005. “Statistical Power for Random Assignment Evaluations of Education Programs.” Project report submitted by Mathematic Policy Research, Inc. to Education Programs.” Project report submitted by Mathematic Policy Research, Inc. to Institute of Education Sciences, U.S. Department of Education. Institute of Education Sciences, U.S. Department of Education. http://www.mathematica-mpr.com/publications/PDFs/statisticalpower.pdfhttp://www.mathematica-mpr.com/publications/PDFs/statisticalpower.pdf

Contact Information

Howard Bloom Howard Bloom ([email protected])([email protected])

Carolyn Hill Carolyn Hill ([email protected])([email protected])

Alison Rebeck Black Alison Rebeck Black ([email protected])([email protected])

Mark Lipsey Mark Lipsey ([email protected])([email protected])

Documents

Effect Sizes in Education Research: What They Are, What They Mean, and Why They’re Important Howard Bloom (MDRC; [email protected]) Carolyn Hill (Georgetown;