Building Evidence in Education: Conference for EEF Evaluators 11 th July: Theory 12 th July: Practice

Building Evidence in Education:Conference for EEF Evaluators

11th July: Theory12th July: Practice

www.educationendowmentfoundation.org.uk

The EEF by numbers

56 projects

funded to date

1,800 schools

participating in projects

33 topics in

the Toolkit

16 independent evaluation

teams

300,000 pupils involved in EEF projects11

members of EEF team

£200mestimated spend over lifetime of

the EEF

3,000 heads

presented to since launch

Research Design

Stephen Gorard

[email protected]

http://www.evaluationdesign.co.uk/

S t e p h e n G o r a r d

R esea rch D es ign

Rob ust A pproa ches for the S o cia l S ciences

mailto:[email protected]



Phase 4 Prototyping and trialling

Phase 1 Evidence synthesis

Phase 6 Definitive

testing

Phase 7 Dissemination

impact and monitoring

Phase 2 Development

of idea or artifact

Phase 3 Feasibility

study

Phase 5 Field studies and design

stage

Outline of a full cycle of research

A model of causation in social science

Association - For X (a possible cause) and Y (a possible effect) to be in a causal relationship they must be repeatedly associated. This association must be strong and clearly observable. It must be replicable, and it must be specific to X and Y.

Sequence – X and Y must proceed in sequence. X must always precede Y (where both appear), and the appearance of Y must be safely predictable from the appearance of X.

Intervention - It must have been demonstrated repeatedly that an intervention to change the strength or appearance of X strongly and clearly changes the strength or appearance of Y.

Explanatory mechanism - There must a coherent mechanism to explain the causal link. This mechanism must be the simplest available without which the evidence cannot be explained. Put another way, if the proposed mechanism were not true then there must be no simpler or equally simple way of explaining the evidence for it.

Red herrings and real problems. Some reflections on the evaluation of Aimhigher

http://www.heacademy.ac.uk/assets/documents/aim_higher/Aspire-Reflections_on_evaluation_of_Aimhigher.doc

In an influential review of Widening Participation (WP) research written for the HEFCE and published in July 2006, Gorard et al (2006) have harshly criticised the evaluation of WP initiatives. In their view, to date no convincing evidence of impact has been produced on pre-entry interventions for school pupils and partnership-based interventions, such as Aimhigher.

Gorard et al’s criticisms were addressed by the HEFCE in another review of WP research published later in the same year, in November 2006, and based on a survey of the evidence collected by the HEIs. It reasserted the value [of Aimhigher and other WP initiatives] as a monitoring and evaluating device and emphasised that, to date, attitudes of learners and teachers have been consistently and overwhelmingly positive. HEFCE feels satisfied that convincing and precise evidence has been produced on attainment by the national evaluation carried out by the National Foundation for Educational Research (NFER), and, to a lesser extent, on HE participation by the NFER and the HEIs. For example, it has been found that participating in Aimhigher activities was associated with ‘[a]n average improvement of 2.5 points in GCSE total point scores’ and a ‘3.9 percentage point increase in Year 11 pupils intending to progress to HE’ (HEFCE 2006: 23). Moreover, ‘[i]f the ‘evidence bar’ is set too high’, the HEFCE (2006: 6-7) pointed out, ‘we run the risk of discouraging any attempt to estimate the effectiveness of the interventions’. There seems no scope for setting up a social science experiment in which the experiences of a wp group is compared with a control group.



Session 1: Part 2: Trial design (45 mins.)

Professor David TorgersonDirector, York Trials Unit,

University of [email protected]

Professor Carole TorgersonSchool of Education, Durham

[email protected]



2008 Palgrave Macmillan

Key design issues

• Independent concealed randomisation• Type of randomisation• Types of trials• Sample size• Regression discontinuity design

Independent concealed randomisation

• One of the most important issues is the need to undertake independent allocation.

• Many methodological studies have shown that unless someone who is disinterested in the trial results undertakes the randomisation there is a serious risk of bias.

• In health trials it is the source of bias that has the most evidence.

Subversion of a health RCTClinician Experimental ControlAll p < 0.01 59 631 p =.84 62 612 p = 0.60 43 523 p < 0.01 57 724 p < 0.001 33 695 p = 0.03 47 72Others p = 0.99 64 59

0.0

5.1

.15

Den

sity

-10 -5 0 5logit (p-value)

Adequate InadequateUnclear

Hewitt et al. BMJ;2005:.

Type of randomisation• Simple or restricted?• Simple, similar to tossing a coin

» Advantages: difficult to go wrong; with large samples (n > 100) and combined with ANCOVA is efficient

» Disadvantages: for small samples can produce imbalance and inefficiency in analysis.

• Restricted, ensures better balance» Advantages: gets better balance and more efficient

for small samples» Disadvantages: more complicated; can go wrong

Restricted allocation

• Minimisation» Not strictly randomisation; uses algorithm to

ensure balance in covariates• Stratified

» Using blocks of repeating allocations produces balance on 1 or 2 variables

• Matched pairs» Matches units (e.g., schools) and allocates

one to each group; can reduce power in some cases and has other disadvantages

Discussion (5 mins.)

• Discuss how randomisation was undertaken in your EEF trial(s) and note whether this was independent and concealed, and whether it was restricted. If so, what method was used?

Types of trial

• Individual randomisation» Most powerful design for given sample size

• Cluster design» Randomises groups of individuals (classes;

schools; periods of time; geographical areas)• Stepped wedge

» Type of cluster design; randomises order of implementation so all schools eventually receive intervention

Individual allocation

• Appropriate when it is possible to separate intervention and control conditions

• DISCOVER summer school evaluation using individual randomisation as control children cannot gain access to intervention

• Many educational interventions are delivered at class or school level – so can’t use individual allocation

Variations on a theme

• Factorial designs» Two trials for the price of one

• Unequal allocation» When the sample size is fixed equal allocation

best; when costs are fixed unequal best – DISCOVER using unequal allocation for intervention to ensure efficient use of summer school resources.

Individual RCT: key points• Trial registration• Pre-test BEFORE randomisation• Independent allocation• Spill over/contamination must not exceed 30%

or cluster allocation more efficient• Post-testing done blindly or in exam conditions,

marking done blindly• Primary outcome specified before analysis• Statistical analysis plan written and approved

before data are examined

Cluster allocation

• More complex to design than individual RCT

• Many educational interventions need to use cluster allocation

• Cluster allocation usually avoids contamination and can make intervention delivery logistically easier

Cluster allocation: additional key points

• Small number of clusters – so usually need to use restricted randomisation

• Need to recruit participants and pre-test BEFORE cluster allocation

• Teachers must be linked to class BEFORE randomisation

• Analysis and sample size need to take clustering into account

• Best to have large numbers of clusters with small numbers per cluster than few clusters with large numbers

Variations on a theme• What level of randomisation?

» Pupil > class > year > school• Balanced design

» An efficient design is a balanced approach – Year 7 gets intervention in half schools and Year 8 gets intervention in other schools with each school’s adjacent year acting as control

» Or Year 7 in intervention schools get literacy intervention and Year 7s in control get maths

• Split plot» Cluster level allocation followed by individual randomisation. A

form of factorial. Exeter evaluation using partial split plot

Stepped wedge

• A form of cluster design, which may be more efficient than standard cluster design

• If we have 12 schools all are pre-tested; 4 randomised for first 6 months and all tested; another 4 are given intervention and all tested; final 4 given intervention and all tested

• Requires testing at every point

Discussion (5 mins.)

• Discuss the trial designs that have been used and the challenges associated with them.

Sample size calculation

• Most interventions will not work very well.» Effect sizes of 0.20 to 0.3 – likely» Effect sizes 0.30 to 0.50 – unusual» Effect sizes >0.50 – very unlikely

• Need large sample sizes to detect modest differences. Example: 512 for 0.25; 800 for 0.20 (not clustered design)

• Powerful covariate can reduce this» 0.70 correlation reduces sample size by 50%

How to do it?

• Free programmes on line» PSPower; Optimal Design Software

• In your head (back of envelope) using approximation formula (i.e., 32/Effect Size squared)

• Fixed sample size» Still good practice to estimate likelihood of

difference.

Pilot trials – sample size

• Modelling study suggests that a study with 10% of the main study’s sample will produce a 1 sided 80% confidence interval that will include the ‘true’ estimate if it exists

Cocks K, Torgerson DJ. Sample size calculations for pilot randomised trials: a confidence interval approach. Journal of Clinical Epidemiology 2013;66:197-201

Discussion

• Discuss how sample size calculations were undertaken and whether sample sizes are large enough to detect modest differences between groups.

Regression discontinuity

• Theoretically the most robust, non-randomised approach, is the RD design

• Rediscovered several times since Thistlewaite and Campbell first described it in the 1960s

What is it?

• Regression discontinuity, sometimes known as risk based cut-off design, selects people into a group on the basis of a measureable continuous variable

• For example, age, test scores, waiting list, income

How does it work?

• Selecting on a pre-test variable we then correlate post test outcomes with the pre-test variable and test to see if there is an interruption, break or discontinuity in the regression line

• Effective treatment

Ineffective treatment

Do summer schools work?

• Some states in the USA mandate summer schools for children who fall below a certain score in a high stakes test

• But will sending children off to have extra tuition during their summer break be effective?

• Because the children chosen are chosen in the basis of a cut point on a quantitative scale this ideal RD territoryJacob and Lefgren, Rev of Economics and Statistics, 2004,86:226-44.

Proportion treated by test scores

Treatment against outcomes

Evaluation of SHINE on secondaries

• Randomised controlled trial design not possible

• Regression discontinuity design with ‘tie-breaker randomisation’

• Advantages of this design• Challenges of this design

Documents

Building Evidence in Education: Conference for EEF Evaluators 11 th July: Theory 12 th July: Practice