View
220
Download
4
Category
Tags:
Preview:
Citation preview
The Multiple Comparisons Problem in IES Impact Evaluations:
Guidelines and Applications
Peter Z. Schochet and John Deke
The Multiple Comparisons Problem in IES Impact Evaluations:
Guidelines and Applications
Peter Z. Schochet and John Deke
June 2009, IES Research ConferenceJune 2009, IES Research Conference
What Is the Problem?What Is the Problem?Multiple hypothesis tests are often
conducted in impact studies
– Outcomes– Subgroups – Treatment groups
Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions
Multiple hypothesis tests are often conducted in impact studies
– Outcomes– Subgroups – Treatment groups
Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions
2
Overview of Presentation Overview of Presentation
Background
Testing guidelines adopted by IES
Examples of their use by the RELs
New guidance on statistical methods for “between-domain” analyses
Background
Testing guidelines adopted by IES
Examples of their use by the RELs
New guidance on statistical methods for “between-domain” analyses
3
BackgroundBackground
Assume a Classical Hypothesis Testing Framework
Assume a Classical Hypothesis Testing Framework
Test H0j: Impactj = 0
Reject H0j if p-value of t-test < =.05
Chance of finding a spurious impact is 5 percent for each test alone
Test H0j: Impactj = 0
Reject H0j if p-value of t-test < =.05
Chance of finding a spurious impact is 5 percent for each test alone
5
But If Tests Are Considered Together and No True Impacts…
But If Tests Are Considered Together and No True Impacts…
Probability 1 t-test
Number of Testsa Is Statistically Significant
1 .05
5 .23
10 .40
20 .64
50 .92aAssumes independent tests
6
Impact Findings Can Be Misrepresented
Impact Findings Can Be Misrepresented
Publishing bias
A focus on “stars”
Publishing bias
A focus on “stars”
7
Adjustment Procedures Lower Levels for Individual Tests
Adjustment Procedures Lower Levels for Individual Tests
Methods control the “combined” error rate
Many available methods:
– Bonferroni: Compare p-values to (.05 / # of tests)
– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)
– Resampling methods (Westfall and Young 1993)
– Benjamini-Hochberg (1995)
Methods control the “combined” error rate
Many available methods:
– Bonferroni: Compare p-values to (.05 / # of tests)
– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)
– Resampling methods (Westfall and Young 1993)
– Benjamini-Hochberg (1995)
8
These Methods Reduce Statistical Power:
The Chances of Finding Real Effects These Methods Reduce Statistical Power:
The Chances of Finding Real Effects
Simulated Statistical Powera
Number of Tests Unadjusted Bonferroni
5 .80 .59
10 .80 .50
20 .80 .41
50 .80 .31
a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests
9
Basic Testing Guidelines
Balance Type I and II Errors
Basic Testing Guidelines
Balance Type I and II Errors
Problem Should Be Addressed by First Structuring the Data
Problem Should Be Addressed by First Structuring the Data
Structure will depend on the research questions, previous evidence, and theory
Adjustments should not be conducted blindly across all contrasts
Structure will depend on the research questions, previous evidence, and theory
Adjustments should not be conducted blindly across all contrasts
11
The Plan Must Be Specified Up Front
The Plan Must Be Specified Up Front
To avoid “fishing” for findings
Study protocols should specify:
– Data structure– Confirmatory analyses– Exploratory analyses – Testing strategy
To avoid “fishing” for findings
Study protocols should specify:
– Data structure– Confirmatory analyses– Exploratory analyses – Testing strategy
12
Delineate Separate Outcome Domains
Delineate Separate Outcome Domains
Based on a conceptual framework
Represent key clusters of constructs
Domain “items” are likely to measure the same underlying trait (have high correlations)
– Test scores– Teacher practices– Student behavior
Based on a conceptual framework
Represent key clusters of constructs
Domain “items” are likely to measure the same underlying trait (have high correlations)
– Test scores– Teacher practices– Student behavior
13
Testing Strategy: Both Confirmatory and Exploratory Components
Testing Strategy: Both Confirmatory and Exploratory Components
Confirmatory component
– Addresses central study hypotheses
– Used to make overall decisions about program
– Must adjust for multiple comparisons
Exploratory component
– Identify impacts or relationships for future study
– Findings should be regarded as preliminary
Confirmatory component
– Addresses central study hypotheses
– Used to make overall decisions about program
– Must adjust for multiple comparisons
Exploratory component
– Identify impacts or relationships for future study
– Findings should be regarded as preliminary
14
Focus of Confirmatory Analysis Is on Experimental Impacts
Focus of Confirmatory Analysis Is on Experimental Impacts
Focus is on key child outcomes, such as test scores
Targeted subgroups: eg. ELL students
Some experimental impacts could be exploratory
– Subgroups – Secondary child and teacher outcomes
Focus is on key child outcomes, such as test scores
Targeted subgroups: eg. ELL students
Some experimental impacts could be exploratory
– Subgroups – Secondary child and teacher outcomes
15
Confirmatory Analysis Has Two Potential Parts
Confirmatory Analysis Has Two Potential Parts
1. Domain-specific analysis
2. Between-domain analysis
1. Domain-specific analysis
2. Between-domain analysis
16
Domain-Specific Analysis: Test Impacts for Outcomes as a Group
Domain-Specific Analysis: Test Impacts for Outcomes as a Group
Create a composite domain outcome
– Weighted average of standardized outcomes
Equal weights Expert judgment Predictive validity weights Factor analysis weights MANOVA not recommended
Conduct a t-test on the composite
Create a composite domain outcome
– Weighted average of standardized outcomes
Equal weights Expert judgment Predictive validity weights Factor analysis weights MANOVA not recommended
Conduct a t-test on the composite
17
Between-Domain Analysis: Test Impacts for Composites Across Domains
Between-Domain Analysis: Test Impacts for Composites Across Domains
Are impacts significant in all domains? – No adjustments are needed
Are impacts significant in any domain? – Adjustments are needed
– Discussed later
Are impacts significant in all domains? – No adjustments are needed
Are impacts significant in any domain? – Adjustments are needed
– Discussed later
18
Application of Guidelines by the Regional Educational Labs
Application of Guidelines by the Regional Educational Labs
Basic Features of the REL Studies
Basic Features of the REL Studies
25 Randomized Control Trials
– Single treatment and control groups– Testing diverse interventions– Typically grades K-8– Fall-spring data collection, some longer– Collecting data on teachers and students
25 Randomized Control Trials
– Single treatment and control groups– Testing diverse interventions– Typically grades K-8– Fall-spring data collection, some longer– Collecting data on teachers and students
20
Each RCT Provided a Detailed Analysis Plan to IES
Each RCT Provided a Detailed Analysis Plan to IES
Confirmatory research questions
Confirmatory domains and outcomes
Within- and between-domain testing strategy
Study samples
Statistical power levels
Confirmatory research questions
Confirmatory domains and outcomes
Within- and between-domain testing strategy
Study samples
Statistical power levels
21
Each Plan Included Information on:
Key Features of Confirmatory Domains
Key Features of Confirmatory Domains
Student academic achievement domains are specified in all RCTs
Some domains pertain to:
– Behavioral outcomes
– A specific time period for longitudinal studies
– Subgroups: ELL students
Student academic achievement domains are specified in all RCTs
Some domains pertain to:
– Behavioral outcomes
– A specific time period for longitudinal studies
– Subgroups: ELL students
22
Most RCTs Have Specified Structured Research Questions
Most RCTs Have Specified Structured Research Questions
Most have fewer than 3 domains
– Some have only 1
– Most domains have a small number of outcomes
Main between-domain question:
“Are there positive impacts in any domain?”
Most have fewer than 3 domains
– Some have only 1
– Most domains have a small number of outcomes
Main between-domain question:
“Are there positive impacts in any domain?”
23
Adjustment Methods for Between-Domain
Confirmatory Analyses
Adjustment Methods for Between-Domain
Confirmatory Analyses
Focus on Methods to Control the Familywise Error Rate (FWER)
Focus on Methods to Control the Familywise Error Rate (FWER)
FWER = Prob (find ≥1 significant impact given that no impacts truly exist)
Preferred over the false discovery rate developed by Benjamini-Hochberg (BH)
– BH is a preponderance-of-evidence method
– BH does not control the FDR for all forms of dependencies across test statistics
FWER = Prob (find ≥1 significant impact given that no impacts truly exist)
Preferred over the false discovery rate developed by Benjamini-Hochberg (BH)
– BH is a preponderance-of-evidence method
– BH does not control the FDR for all forms of dependencies across test statistics
25
Consider Four FWER Adjustment Methods Consider Four FWER Adjustment Methods
Sidak: Exact adjustment when tests are independent
Bonferroni: Approximate adjustment when tests are independent
Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution
Resampling: Robust adjustment for correlated tests for general distributions
Sidak: Exact adjustment when tests are independent
Bonferroni: Approximate adjustment when tests are independent
Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution
Resampling: Robust adjustment for correlated tests for general distributions
26
Main Research QuestionsMain Research Questions
How do these four methods work?
Are the more complex methods likely to provide more powerful tests for between-domain analyses?
– There are no single-routine statistical packages for the complex methods under clustered designs
How do these four methods work?
Are the more complex methods likely to provide more powerful tests for between-domain analyses?
– There are no single-routine statistical packages for the complex methods under clustered designs
27
Basic Setup for the Between-Domain Analysis
Basic Setup for the Between-Domain Analysis
Assume N domain composites
Test whether any domain composite is statistically significant
Aim to control the FWER at = .05
All methods reduce the level for individual tests: * = .05/fact
Assume N domain composites
Test whether any domain composite is statistically significant
Aim to control the FWER at = .05
All methods reduce the level for individual tests: * = .05/fact
28
SidakSidak
Uses the relation that the FWER =[1 – Pr(correctly rejecting all N null hypotheses)]
For independent tests, FWER = 1 – (1- *)N
Sidak picks * so that FWER = 0.05
For example, if N = 3: – * = 0.017
– fact = 0.05/ 0.017 = 2.949
Uses the relation that the FWER =[1 – Pr(correctly rejecting all N null hypotheses)]
For independent tests, FWER = 1 – (1- *)N
Sidak picks * so that FWER = 0.05
For example, if N = 3: – * = 0.017
– fact = 0.05/ 0.017 = 2.949
29
The Bonferroni Method Tends to Be More Conservative
The Bonferroni Method Tends to Be More Conservative
* = (.05 / N); fact = N * = (.05 / N); fact = N
30
N Sidak Bonferroni
1 1 1
2 1.975 2
3 2.949 3
4 3.924 4
5 4.899 5
The Value of fact for the Sidak and Bonferroni
Sidak and Bonferroni Are Likely To Be Conservative with Correlated Tests
Sidak and Bonferroni Are Likely To Be Conservative with Correlated Tests
Correlated tests can occur if:
– Domain composites are correlated– Treatment effects are heterogeneous
Yields tests with lower power
Correlated tests can occur if:
– Domain composites are correlated– Treatment effects are heterogeneous
Yields tests with lower power
31
Generalized Tukey and Resampling Methods Adjust for Correlated TestsGeneralized Tukey and Resampling Methods Adjust for Correlated Tests
Let pi be the p-value from test i
Both methods use the relation: FWER = Pr(min(p1, p2, p3,…, pN)≤.05 | H0 is true)
Both methods calculate FWER using the distribution of min(p1, p2, p3,…, pN) or max(t1, t2, t3,…, tN)
Let pi be the p-value from test i
Both methods use the relation: FWER = Pr(min(p1, p2, p3,…, pN)≤.05 | H0 is true)
Both methods calculate FWER using the distribution of min(p1, p2, p3,…, pN) or max(t1, t2, t3,…, tN)
32
Generalized TukeyGeneralized Tukey
Assumes test statistics have multivariate t distributions with known correlations
The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008)
– Multi-stage procedure that requires user inputs
Assumes test statistics have multivariate t distributions with known correlations
The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008)
– Multi-stage procedure that requires user inputs
33
Using the MULTCOMP PackageUsing the MULTCOMP Package
Inputs are a vector of impact estimates and the corresponding variance-covariance matrix
Challenge is to get cross-equation covariances of the impact estimates
One option: use the suest command in STATA, then copy resulting covariance matrix to R
– Uses GEE rather than HLM to adjust for clustering
Inputs are a vector of impact estimates and the corresponding variance-covariance matrix
Challenge is to get cross-equation covariances of the impact estimates
One option: use the suest command in STATA, then copy resulting covariance matrix to R
– Uses GEE rather than HLM to adjust for clustering
34
Resampling/BootstrappingResampling/Bootstrapping
The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993)– Allows for general forms of correlations and
outcome distributions
Resampling must be performed “under the null hypothesis”
The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993)– Allows for general forms of correlations and
outcome distributions
Resampling must be performed “under the null hypothesis”
35
Homoskedastic Bootstrap Algorithm
Homoskedastic Bootstrap Algorithm
1. Calculate impacts and tstats using the original data
2. Define Y* as the residuals from these regressions
3. Repeat the following at least 10,000 times:
– Randomly sample schools, with replacement, from Y*
– Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data
– Calculate impacts and save the maximum absolute tstat
4. Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats
1. Calculate impacts and tstats using the original data
2. Define Y* as the residuals from these regressions
3. Repeat the following at least 10,000 times:
– Randomly sample schools, with replacement, from Y*
– Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data
– Calculate impacts and save the maximum absolute tstat
4. Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats
36
Example of Resampling MethodExample of Resampling Method
37
Original tstats are 0.793 and 3.247; Adjusted p-values are 0.89 and 0.00
tstat 1 tstat 2 Maximum abs(tstat)a
0.909 2.635 2.6351
0.892 1.227 1.2271
-2.768 1.342 2.7681
0.570 -0.237 0.570
-0.574 -1.472 1.4721
-1.245 -0.545 1.2451
0.798 0.083 0.7981
-0.138 0.027 0.1381
-1.810 0.494 1.8101
a1 = Max tstat > 0.793; 2 = Max tstat > 3.247
Implementation of ResamplingImplementation of Resampling
The MULTTEST procedure in SAS implements resampling, but only for non-clustered data
Simple approach: Aggregate data to the school level, and use MULTTEST
More complex approach: Write a program to implement the algorithm with clustering
The MULTTEST procedure in SAS implements resampling, but only for non-clustered data
Simple approach: Aggregate data to the school level, and use MULTTEST
More complex approach: Write a program to implement the algorithm with clustering
38
Comparing MethodsComparing Methods
Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80
Outcomes are normally distributed or heavily skewed normals (focus on skewed)
Four types of comparisons:– FWER– Values of fact – Minimum Detectable Effect Size (MDES)– “Goal Line” scenario
Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80
Outcomes are normally distributed or heavily skewed normals (focus on skewed)
Four types of comparisons:– FWER– Values of fact – Minimum Detectable Effect Size (MDES)– “Goal Line” scenario
39
FWER Values Are Similar by Method Except With Large Correlations
FWER Values Are Similar by Method Except With Large Correlations
40
FWER Values, by Method and Test Correlations
ρ=0.2 ρ=0.5 ρ=0.8
No Adjustment 0.146 0.125 0.097
Bonferroni 0.048 0.045 0.034
Sidak 0.050 0.048 0.036
Generalized Tukey 0.049 0.051 0.049
Bootstrap 0.054 0.052 0.051
Values of fact Are Similar by Method Except With Large Correlations
Values of fact Are Similar by Method Except With Large Correlations
41
Values of fact, by Method and Test Correlations
ρ=0.2 ρ=0.5 ρ=0.8
Bonferroni 3.00 3.00 3.00
Sidak 2.85 2.85 2.85
Generalized Tukey 2.84 2.58 2.02
Bootstrap 2.83 2.57 2.01
All Methods Yield Similar MDES All Methods Yield Similar MDES
42
MDE Values, by Method and Test Correlationsa
ρ=0.2 ρ=0.5 ρ=0.8
No Adjustment 0.21 0.21 0.21
Bonferroni 0.25 0.25 0.25
Sidak 0.24 0.24 0.24
Generalized Tukey 0.24 0.24 0.23
Bootstrap 0.24 0.24 0.23
aAssumes 60 schools, 60 students per school, R2=0.50, ICC=0.15
“Goal Line” Scenario: The Method Could Matter for Marginally Significant Impacts “Goal Line” Scenario: The Method Could Matter for Marginally Significant Impacts
43
Adjusted p-values, by Method and Test Correlationsa
aAssumes 60 schools, 60 students per School, R2=0.50, ICC=0.15
ρ=0.2 ρ=0.5 ρ=0.8
No Adjustment 0.019 0.019 0.019
Bonferroni 0.057 0.057 0.057
Sidak 0.054 0.054 0.054
Generalized Tukey 0.054 0.049 0.038
Bootstrap 0.054 0.049 0.038
Summary and ConclusionsSummary and Conclusions
Multiple comparisons guidelines:– Specify confirmatory analyses in study
protocols– Delineate outcome domains– Conduct hypothesis tests on domain
composites
RELs have implemented guidelines
Multiple comparisons guidelines:– Specify confirmatory analyses in study
protocols– Delineate outcome domains– Conduct hypothesis tests on domain
composites
RELs have implemented guidelines
44
Summary and ConclusionsSummary and Conclusions
Adjustments are needed for between-domain analyses
– For calculating MDEs in the design stage, using the Bonferroni is sufficient
– For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large
Adjustments are needed for between-domain analyses
– For calculating MDEs in the design stage, using the Bonferroni is sufficient
– For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large
45
References and Contact Information
References and Contact Information
Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008)– ies.ed.gov/ncee/pubs/20084018.asp
Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons)
pschochet@mathematica-mpr.com
jdeke@mathematica-mpr.com
Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008)– ies.ed.gov/ncee/pubs/20084018.asp
Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons)
pschochet@mathematica-mpr.com
jdeke@mathematica-mpr.com
46
Recommended