28
The Impact of Selection of Student Achievement Measurement Instrument on Teacher Value-added Measures James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas Joshua B. McGee, Laura and John Arnold Foundation Nathan C. Jensen, Northwest Evaluation Association

James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

  • Upload
    derron

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

The Impact of Selection of Student Achievement Measurement Instrument on Teacher Value-added Measures. James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas Joshua B. McGee, Laura and John Arnold Foundation Nathan C. Jensen, Northwest Evaluation Association. - PowerPoint PPT Presentation

Citation preview

Page 1: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

The Impact of Selection of Student Achievement Measurement Instrument

on Teacher Value-added Measures

James L. Woodworth, CREDO Hoover Institute, Stanford

Wen-Juo Lo, University of Arkansas

Joshua B. McGee, Laura and John Arnold Foundation

Nathan C. Jensen, Northwest Evaluation Association

Page 2: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Presentation Outline

1. Purpose

2. Statistical Noisea. Why it matters

b. Sources

3. Data

4. Methods

5. Results

Page 3: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Purpose

The purpose of this paper is to present to a statistics lay population the extent to which psychometric properties of student test instruments impact teacher value-added measures.

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 4: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Question

What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 5: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Why it matters?

5th

6th

Below Basic

Basic AdvancedProficient

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 6: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Primary Sourcesof Statistical Noise

1. Test Design

2. Vertical Alignment

3. Student Sample Size

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 7: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Test Design

Proficiency Tests

• Focused around proficiency point

• Designed to differentiate between proficient and not proficient

• Larger variance in Conditional Standard Errors (CSE)

Growth Tests

• Questions measure across entire ability spectrum

• Designed to differentiate between all points on the distribution

• Smaller variance in CSE

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 8: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Test Design

Paper and Pencil Tests

• Limit item pool to control length

• Focused around proficiency point

• Large variance in CSE

Computer Adaptive Test

• Larger item pool for question selection

• Focused around student ability point

• Smaller variance in CSE

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 9: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Test DesignCSE Heteroskedasticity Due to Item Focusing: TAKS Reading Grade 5, 2009

CSE Range: 24 - 74Weighted average CSE = 38.96

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 10: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Vertical Alignment

• Year to year alignment can impact the results of VAM– Units must be equal across test sessions• Spring-Spring VAM are most affected

• Fall-Spring VAM using same test avoid much of problem

• Item alignment on computer adaptive tests can impact the results of VAM

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 11: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Student Sample Size

• Central Limit Theorem– Larger student n provides a more stable estimate of

teacher VAM.

– Typical single year student n’s are 25, 50, and 100 for elementary and middle school teachers.

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 12: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Question

What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 13: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Data Sets

TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading, 2009 Population Statistics– Proficiency test– Vertically aligned scale scores– Average yearly gain

• 24 vertical scale points at “Met Expectations”• 34 vertical scale points at “Commended”

– Standard Errors – Conditional Standard Errors reported by TEA for each vertical scale score• CSE Range: 24 - 74• Weighted average CSE = 38.96

– Highly skewed distribution– High variance

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 14: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Data Sets

TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading

N: 323,507μ: 701.49σ2: 10048.30σ: 100.24

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 15: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 16: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Data Sets

MAP – Measures of Academic Progress– Growth measure

– Computer Adaptive Test

– Single scale

– Average yearly gain• 5.06 RIT points

– Standard Errors – average standard errors range 2.5 - 3.5 RIT

– Slightly skewed distribution

– Small variance

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 17: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Data Sets

MAP – Measures of Academic Progress

N: 2,663,382

μ: 208.35

σ2: 161.82

σ: 12.72

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 18: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Simulated Data

As it is impossible to isolate true scores and error with real data, we created simulated data points.– True scores are known for all data points

– Every data point was given the same growth• All iterations have the same value-added

• Any deviation from expected is a function of measurement error only

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 19: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Simulated Data

We simulated 10,000 z-scores ~ N (0,1)

From this we selected nested, random samples of n=100, n=50, n=25.

Statistical Summary, z-Score Samples by n

Statistic Values

N 100 50 25

Mean -.13 -.09 .01

Std. Deviation .97 .97 1.00

Skewness -.12 .18 .10

Minimum -2.34 -1.85 -1.77

Maximum 2.09 2.09 2.09

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 20: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Data Generation

Pre-scores = P1 = z-score • σ +

Post-scores = P2 = P1 + controlled growth

Controlled Growth Values:TAKS = 24 (TAKS at “Commended” = 34) vertical scale points

MAP = 5.06 RIT points

Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)

CSE = Conditional Standard Errors as reported by TEA and NWEA

x

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 21: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Question

What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 22: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Monte Carlo Simulation

We ran 1,000 iterations for each simulation which was equivalent to the same students taking the test 1,000 times with the same true scores, but different levels of error.

Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)CSE = Conditional Standard Errors as reported by TEA and NWEA

Aggregated values by subgroup to determine average performance for each iteration. False Negative : Simulated Growth < .5 Controlled GrowthFalse Positive: Simulated Growth > 1.5 Controlled Growth

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 23: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Monte Carlo Results n=100 % False Negative

% False Positive

% Total Correct

IDTAKS Actual Distribution 1.7 2.5 95.8TAKS Normal Distribution at “Meets” Level .9 1.8 97.3TAKS Normal Distribution Avg SE 1.2 1.8 97.0TAKS Normal Distribution at “Commended” Level

.8 .2 99.0

TAKS Normal Grade Transition 1.4 2.1 96.5MAP Normal 0.0 0.0 100.0MAP Max CSE 0.0 0.0 100.0

Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 24: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Monte Carlo Results n=50 % False Negative

% False Positive

% Total Correct

IDTAKS Actual Distribution 7.4 9.6 83.0TAKS Normal Distribution at “Meets” Level 6.6 8.4 85.0TAKS Normal Distribution Avg SE 5.7 7.4 86.9TAKS Normal Distribution at “Commended” Level

4.4 1.7 93.9

TAKS Normal Grade Transition 6.5 8.1 85.4MAP Normal 0.0 0.0 100.0MAP Max CSE .7 .6 98.7

Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 25: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Monte Carlo Results n=25 % False Negative

% False Positive

% Total Correct

IDTAKS Actual Distribution 16.1 18.4 65.5TAKS Normal Distribution at “Meets” Level 16.8 18.0 65.2TAKS Normal Distribution Avg SE 14.5 16.0 69.5TAKS Normal Distribution at “Commended” Level

10.2 7.7 82.1

TAKS Normal Grade Transition 18.6 18.2 63.2MAP Normal .5 .5 99.0MAP Max CSE 3.0 4.2 92.8

Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 26: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

ResultsStudent Sample Size n=100 n=50 n=25

Descriptive Statistics VAM

Controlled Growth

Average Simulated Growth SD

Average Simulated Growth SD

Average Simulated Growth SD

TAKS Actual Distribution 24 24.29 6.02 24.26 8.78 24.18 12.28

TAKS Normal Distribution at “Meets”

24 24.08 5.45 24.45 8.37 24.14 12.39

TAKS Normal Distribution Avg SE

24 24.19 5.45 24.61 8.03 24.59 11.47

TAKS Normal Distribution at “Commended”

34 33.85 5.60 34.15 8.12 34.92 11.87

TAKS Normal Grade Transition

24 24.08 5.59 24.24 8.59 24.15 12.85

MAP Normal 5.06 5.07 .49 5.12 .72 5.12 1.03MAP Max CSE 5.06 5.05 .71 5.05 .99 5.08 1.37

Test

Percent misidentified at

n=100

Percent misidentified at

n=50

Percent misidentified at

n=25TAKS Normal Distribution at “Meets” 2.7 15.0 34.8MAP Normal 0.0 0.0 1.0

1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results

Page 27: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Conclusions

The Growth/Error ratio is the critical variable in VAM stability.

Necessary student n to achieve a stable VAM is sensitive to the Growth/Error ratio.

Stable VAMs are possible even with typical classroom n’s; however, careful attention must be paid to the suitability of the student assessment instrument.

Page 28: James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas

Limitations

No Differentiation between Student Effects, Teacher Effects, or School Effects

No Environmental Effects

No Interaction Terms

These are all areas for additional research.