Download pdf - 24375818 Class Notes Sessions5 6 (1)

One Way ANOVA

ANOVA stands for Analysis of Variance

ANOVA allows us to compare the means from more than two sets of scores.

A significant ANOVA indicates that changes in the independent variable affect the dependent variable.

ANOVA does not indicate which pairs of conditions are significantly different.

Use planned contrasts or unplanned (“post hoc”) contrasts to assess whether pairs of conditions are significantly different.

ANOVA Assumptions

1. Normally distributed populations2. Equal population variances3. Random sampling used4. Dependent variable uses an interval or ratio scale

Digression on Scales: Levels of Measurement

Interval: 0 doesn’t mean “none (e.g., IQ score)- distances between points on scale are equal but ratios aren’t meaningful (e.g., temperature)

Ratio: Same as interval scale, but 0 means “none” and

ratios are meaningful (e.g., weight or age: a person who is 50 is twice as old as one who is 25).

Nominal: Numbers are just labels for attributes (e.g., color)

Ordinal: categories have a logical order (e.g., ranks)

Digression on Scales: Types of Data

Continuous: Numerical data that can be fractional (e.g.,temperature)

Discrete: Numerical data that cannot be fractional (e.g.,number of World Cup trophies)

2

One-way ANOVANational Airlines

National Airlines recently introduce a daily early-morning nonstop flight between Houston and Chicago. The vice president of marketing for National Airlines decided to perform a statistical test to see whether National’s average passenger load on this new flight is different from that of each of its two major competitors (which we will call competitor 1 and competitor 2). Ten early-morning flights were selected at random from each of the three airlines and the percentage of unfilled seats on each flight was recorded. These data are stored in an EXCEL file on the website at “National Airlines (Excel).”

Is there evidence that National’s average passenger load on the new flight is different from that of its two competitors? Report a p value and interpret the results of the statistical test.

3

Raw Data (in Excel)

4

Raw Data (in SPSS)

5

Transform Data into Analysis-Ready Form

6

Analyze Compare Means One-Way ANOVA

7

Post Hoc Contrasts

8

Results

Descriptives

Unfilled

10 9.80 2.044 .646 8.34 11.26 7 13

10 11.30 2.003 .633 9.87 12.73 7 13

10 12.60 2.011 .636 11.16 14.04 9 15

30 11.23 2.269 .414 10.39 12.08 7 15

National

Competitor 1

Competitor 2

Total

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval forMean

Minimum Maximum

ANOVA

Unfilled

39.267 2 19.633 4.815 .016

110.100 27 4.078

149.367 29

Between Groups

Within Groups

Total

Sum ofSquares df Mean Square F Sig.

Post Hoc Tests

Multiple Comparisons

Dependent Variable: Unfilled

Bonferroni

-1.500 .903 .325 -3.81 .81

-2.800* .903 .013 -5.11 -.49

1.500 .903 .325 -.81 3.81

-1.300 .903 .484 -3.61 1.01

2.800* .903 .013 .49 5.11

1.300 .903 .484 -1.01 3.61

(J) AirlineCompetitor 1

Competitor 2

National

Competitor 2

National

Competitor 1

(I) AirlineNational

Competitor 1

Competitor 2

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound

95% Confidence Interval

The mean difference is significant at the .05 level.*.

9

The 2χ Test

For data in which each outcome is assigned to a single (and only one) mutually exclusive and exhaustive category

Derived from square of Z statistic

Assumes: Independence of observations (can't take several observations from 1 person and analyze with 2χ )

Compares the observed and expected values

Needs at least 5 expected observations per cell

Values range from 0 on up (no negative values)

10

One-way 2χ

Are There Differences Among the Levels of 1 Variable?

H0: P1 = P2 = P3 = P4 = 41

(where P1 + P2 + P3 + P4 = 1)

Ha: At least one Population Proportion 4

1≠

E

)E - (O =

22 ∑χ

E

)E - O( + . . . +

E

)E - O( =

m

mm2

1

112

2χ

Where Oi and Ei are the observed and expected # of occurrences for m (exhaustive & mutually exclusive)outcomes

Squared deviations in which large disparities “count” formore than small disparities

df = (# levels in the independent variable – 1)

11

100 Analysts Rate an IPO

Strong Buy Buy Hold Sell Strong Sell 24 33 22 16 5 H0 : PSB = PB = PH = PS = PSS

Ha : PSB ≠ PB ≠ PH ≠ PS ≠ PSS

Actual & Expected ( )Strong Buy Buy Hold Sell Strong Sell 24 (20)

33 (20)

22 (20)

16 (20)

5 (20)

Critical Value (df = 4, )05.=α : 49.9)4(

2 =χ

∑−=E

EO 22 )(χ = 50.21

20

15

20

4

20

2

20

13

20

4 22222

=++++

Reject H0 because the test statistic (21.50) is greater than the critical value (9.49).

12

100 Analysts Rate an IPO:Testing Unequal Categories

Test the null hypothesis that twice as many will offer some form of buy recommendation (either Strong Buy or Buy) than will offer either a hold or some form of sell recommendation (Sell or Strong Sell).

Collapse analysts’ recommendations into 3 categories:

Buy Hold Sell 57 22 21

H0 : PB = 2PH = 2PS Ha : PB ≠ 2PH ≠ 2Ps

Critical Value ( )05.=α : 99.5)2(2 =χ

Actual & Expected ( ) Buy Hold Sell 57 (50)

22 (25)

21 (25)

∑−=E

EO 22 )(χ

98.125

16

25

9

50

492 =++=χ

Do not reject H0 because the test statistic (1.98) is less than the critical value (5.99).

13

100 Analysts Rate an IPO:Testing Unequal Categories in SPSS

Analyze Nonparametric Chi-Square

14

Set up the expected values for each category

ResultsAnalysts' Recommendations

57 50.0 7.0

22 25.0 -3.0

21 25.0 -4.0

100

Buy

Hold

Sell

Total

Observed N Expected N Residual

Test Statistics

1.980

2

.372

Chi-Squarea

df

Asymp. Sig.

Analysts' Recommendations

0 cells (.0%) have expected frequencies less than5. The minimum expected cell frequency is 25.0.

a.

15

Two-way 2χ

Are There Differences Between 2 Variables?

H0: Variables A and B are independentHa: Variables A and B are dependent

df = (# rows - 1)(# columns - 1)

Tests nondirectional hypotheses only, using a single tail

16

SPSS: 2-Way 2χ Tests

Vioxx data file:

Industry [Industry ties? 1=no, 2=yes] Vioxx [Bring Vioxx back? 1=no, 2=yes]

17

Analyze Descriptives Crosstabs

Click on “Statistics”; select “chi-square”

18

Click OK. This is the 2χ output: Crosstabs

Case Processing Summary

32 100.0% 0 .0% 32 100.0%Industry Ties? *Bring Vioxx Back?

N Percent N Percent N Percent

Valid Missing Total

Cases

Industry Ties? * Bring Vioxx Back? Crosstabulation

Count

14 8 22

1 9 10

15 17 32

no

yes

IndustryTies?

Total

no yes

Bring Vioxx Back?

Total

19

Notice that the 2χ test below is significant (p = .005), but not entirely reliable because the expected cell count in one of the cells is less than 5. (You should be able to verify that the lower left cell in the Crosstabulation above is the one with the undercount.)

Chi-Square Tests

7.942b 1 .005

5.935 1 .015

8.893 1 .003

.007 .006

7.694 1 .006

32

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher's Exact Test

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

1 cells (25.0%) have expected count less than 5. The minimum expected count is 4.69.

b.

When one or more of your cells has an expected count less than 5, report Fisher's Exact Test (in the SPSS output). Fisher’s Exact Test has no test statistic, no critical value, and no confidence interval. Report it as follows: “p = .007, Fisher’s Exact Test, 2-tailed.”

20

Correlation

How do the scores on one variable change with the scores on another variable?

Correlations are concerned with measuring the direction and magnitude of a linear relationship between two variables.

The stronger the correlation, the more accurately we can predict Y from knowing X.

Scatterplot: A graph containing clusters of dots that represent all X-Y pairs of observations.

Involves an examination of pairs of X-Y scores (one-sample procedure).

21

Correlation Coefficients

Measures extent to which individual Xi-Yi scores that make up a pair occupy the same or opposite positions within their distributions.

- Pos relation: Pairs tend to occupy similar relative positions in their distributions

- Neg relation: Pairs tend to occupy opposite relative positions in their distributions

Two types (there are others as well):- Pearson r (continuous data): rxy

- Phi Coefficient (binary variables: 2 X 2 Tables): φ

Range from -1 to 1 1 = perfect pos relation-1 = perfect neg relation 0 = No relation

Failure to find strong r may mean:(a) chance (b) variables are unrelated, or (c) the variables are related nonlinearly.

22

R Computation (by hand)

1. Transform each Y score into a Z score (Zy)2. Transform each X score into a Z score (Zx)3. Determine correspondence between each of the paired Zs

- r indicates the average correspondence between the paired Zs.

r = Mean of the crossproduct of Z scores.

Population Sample

N

ZZ = r yx∑

1−∑N

ZZ = r yx

(note: Zs will differ for population & samples because the denominator for computing population Zs is σ and the denominator for computing sample Zs is s.)

When large pos correspondence: Z crossproduct is pos. & large

When small neg correspondence: Z crossproduct is neg & small - (lots of + and - canceling each other out)

Strength of Relationship

r2 = Proportion of variability of Y accounted for by X

23

Strong Correlation(population computation)

Student # High School # College Zx Zy A’s (X) A’s (Y)

Alejandro 13 14 1.50 0.50Bernardo 9 18 0.50 1.50Carlos 7 12 0.00 0.00Dominique 5 10 -0.50 -0.50Enrique 1 6 -1.50 -1.50

80.05

)5.1)(5.1()5.)(5.()0)(0()5.1)(5(.)5)(.5.1( =−−+−−+++=∑

=N

ZZr yx

24

Strong Correlation: Using SPSS

Analyze Correlate Bivariate

Correlation OutputCorrelations

1 .800

. .104

5 5

.800 1

.104 .

5 5

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

High School

College

High School College

25

Two Points of Caution with Correlations 1. Restriction of range (i.e., truncated range) problem

When the relevant range of X or Y scores is a truncated part of whole, then the truncated X-Y correlation will be smaller than the whole X-Y correlation.

2. Correlation does not mean causation- may be a correlated 3rd variable

- Even if no 3rd variable is involved, it’s not always clear which variable is the cause and which is the effect.

26

Phi Coefficient Φ

Correlation for Categorical Data (2 X 2 Tables):

a bc d

d)+c)(b+d)(a+b)(c+(abc - ad

= φ

Yes No

50 2010 4

0 = φ

27

Phi Coefficient Φ (using SPSS)

10 55 8

Analyze Descriptive Statistics Crosstabs Statistics

28

Click “Statistics” and check “Phi and Cramer’s V”

Symmetric Measures

.282 .136

.282 .136

28

Phi

Cramer's V

Nominal byNominal

N of Valid Cases

Value Approx. Sig.

Not assuming the null hypothesis.a.

Using the asymptotic standard error assuming the nullhypothesis.

b.

(ignore “Cramer’s V”)

29

Regression

Regression: The primary purpose of regression is prediction

Predictions about the linear relationship between independent anddependent variables.

Independent = predictor = explanatoryDependent = response = criterion

Types of Regression

1. Linear (least squares regression line)Simple regression: one predictor variableMultiple regression: multiple predictor variables

2. Nonlinear (can linearize many of these via transformation)- Positive curvilinear (e.g., diminishing marginal utility)- Polynomial (quadratic – parabola-shaped; cubic)- Exponential or negative curvilinear (L-shaped)

3. Logistic (when dependent variable is categorical)- Example: graduate or not; sales are weak/moderate/strong

30

Lines: Y = bo + b1X

bo and b1 are regression coefficients- can be positive or negative- b1 is more important than b0

bo = Y intercept (value of Y when X=0)b1 = Slope (how much Y changes when X changes by 1 unit)

Example #1: Suppose Aeromexico wants to examine the relation between number of flight delays and number of passenger complaints.

X = Number of flight delaysY = Number of passenger complaints.

Suppose that the data are as follows (X, Y):(0, 1), (1, 3), (2, 5)

31

Scatterplot: Delays vs. Complaints

2.001.501.000.500.00

delays

5.00

4.00

3.00

2.00

1.00

com

pla

ints

The line that fits these data perfectly is: Y = 1 + 2X# Complaints = 1 + 2 (# Flight Delays)

X Y0 1 + 2(0) = 1 (0,1)1 1 + 2(1) = 3 (1,3)2 1 + 2(2) = 5 (2,5)

32

But what if the scatterplot looked like this?

10.008.006.004.002.000.00

delays

20.00

15.00

10.00

5.00

0.00

com

pla

ints

We’ll need to estimate a line of best fit using linear regression

33

The regression line (also called “Least squares regression line”) minimizes the squared difference between the observed and predicted values of the response variable (as give by the regression line).

- The difference between the actual and predicted values is called the “residual.”

- Minimizing these squared residuals gives slope that is as close as possible to true slope.

10.008.006.004.002.000.00

delays

20.00

15.00

10.00

5.00

0.00

com

pla

ints

R Sq Linear = 0.475

(We’ll talk about what “R Sq Linear” means later on…)

34

Example #2: Suppose UT wants to examine relation between alumni donations to the school and number of football victories.

X = Number of football victoriesY = Amount of alumni donations the following year

Alumni Donations = $10,000,000 + $200,000 (# Football Victories)

Y = 10,000,000 + 200,000X

Caution #1: X-Y relation may not be causal

Caution #2: Regression line estimates most trustworthy near bulk of data (usually center).

35

Linear Regression Assumptions

1. Linearity- Linear relationship between X and Y

- Same expected change in Y moving from X1 to X2 vs. moving from X2 to X3

Test: X-Y Scatterplot (look for nonlinearities)

Correction: Insert curvilinear term (Usually quadratic: X2)Y = bo + b1X + b2X2

: Log transformation of X variable (if data are positive)- brings large values down, pushes small values further apart

2. Independence of Observations- Residuals across Xs are not correlated

Test: Durbin-Watson (0-4) (tests residual correlation among Xs)0.0 - 1.5 = pos correlation1.5 - 2.5 = no correlation2.5 - 4.0 = negative correlation

Correction: Transformation of Y variable (percentages or logs)

36

3. Normality - The distrib. at each Xi is normal- The errors have normal distribution

Test #1: Plot histogram of residuals (should be normal)

Test #2: Normal Prob. Plot- Plot of cumulative probabilities- Should follow diagonal (if residuals follow normal distrib)

Correction: Log transformation of Y

4. Constant Variance (homoskedasticity)- Each Yi distrib. has same variance- Means that effects of other factors does not depend on level of X.- Common problem: Variance up as X increases (funnel shape)

Test: Scatterplot of X vs. Residuals - Should not show funnel shape pattern

Correction: Log transformation of Y

37

Example: Simple Linear Regression Houston Astros Payroll

Identify a regression equation that predicts the median salary for a Houston Astros baseball player based on knowledge of the total team payroll

Independent variable: Total Payroll Dependent variable: Median Salary

Here are your data (figures are in thousands)

You can access this data file on the website as well (“Houston Astros salary data”)

38

1. Create X–Y Scatterplot

Graphs Scatter Simple Define OK

39

Median Salary – Total Payroll Scatterplot

10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 70000.00 80000.00

Total Payroll

0.00

300.00

600.00

900.00

1200.00

1500.00

Median Salary

This scatterplot shows that the linearity assumption is OK- we’ll check the other 3 assumptions shortly

40

2. Visual check for outliers (remove if necessary)

3. Add regression line:Double click on graph

Single click on a data point (it will enlarge and change color)

Elements Fit Line At Total

41

Fit Line at Total Linear

42

4. Conduct Regression Analysis

43

Put Independent and Dependent variables in the right boxes

Click Statistics

44

Click Plots

45

Click Save

By checking these boxes, you will create extra columns on your data file. You will get a Predicted Values (“PRE_1”) column and a Residual Values (“RES_1”) column.

46

5. Examine Regression Output

M o d e l S u m m a ryb

.7 5 4a .5 6 9 .5 4 0 2 2 0 .5 3 9 7 8 .5 6 9 1 9 .7 9 0 1 1 5 .0 0 0 2 .3 4 6M o d e l1

R R Sq u a reA d ju s te dR Sq u a re

S td . E r ro r o fth e Es tim a te

R S q u a reC h a n g e F C h a n g e d f1 d f2 S ig . F C h a n g e

C h a n g e S ta tis t ics

D u rb in -W a tso n

P re d ic to rs : (C o n s ta n t) , T o ta l P a y ro lla .

D e p e n d e n t V a r ia b le : M e d ia n S a la ryb .

ANOVAb

962530.2 1 962530.159 19.790 .000a

729566.9 15 48637.793

1692097 16

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Total Payrolla.

Dependent Variable: Median Salaryb.

Coefficientsa

110.736 111.951 .989 .338

.012 .003 .754 4.449 .000

(Constant)

Total Payroll

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Median Salarya.

47

6. Is model statistically significant?

Yes, because F = 19.79, p = .000 (i.e., p < .001).

7. Identify equation for the simple linear model (i.e., the regression line)

Coefficientsa

110.736 111.951 .989 .338

.012 .003 .754 4.449 .000

(Constant)

Total Payroll

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Median Salarya.

Y = Y Intercept + Beta * (X)

Median Salary = 110.736 + .012 (Total Payroll)

Or, in actual dollars:Median Salary = $110,736 + .012 (Total Payroll)

48

8. Check the other 3 linear regression assumptions

8a. Independence: D-W = 2.346 (OK, because it’s between 1.5 and 2.5)

8b. Normality: Histogram of Residual (is it normal?): Normal Prob. Plot (are points near the diagonal?)

-2 -1 0 1 2 3

Regression Standardized Residual

0

1

2

3

4

Frequency

Mean = -6.94E-17Std. Dev. = 0.968N = 17

Dependent Variable: Median Salary

Histogram

OK, because residuals have a roughly normal shape

49

0.0 0.2 0.4 0.6 0.8 1.0

Observed Cum Prob

0.0

0.2

0.4

0.6

0.8

1.0Expected Cum Prob

Dependent Variable: Median Salary

Normal P-P Plot of Regression Standardized Residual

OK, because points are near the diagonal

50

8c. Constant variance?: Is there an absence of a funnel shape in scatterplot of X vs. Residuals?

Go to your Modified Data File:

51

Here’s a look at your data file ordered from lowest to highest payroll, where some of the columns are rearranged to make it more readable:

52

Test the Constant Variance assumption by looking at the X vs. Residuals scatterplot. Check for funnel pattern.

10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 70000.00 80000.00

Total Payroll

-400.00000

-200.00000

0.00000

200.00000

400.00000

600.00000

Unstandardized Residual

There’s a hint of a bit of a funnel pattern here.(Consider a log transformation of Y variable – Median)

53

9. Search output for “Case Diagnostics” that describe outliers

None were found here, so nothing shows up in the SPSS output

But if you changed the Casewise Diagnostics (in “Statistics) toshow outliers beyond 1 sd …

Here’s what you’d get:

Casewise Diagnostics(a)

Case Number Std. Residual Median SalaryPredicted

Value Residual3 1.060 500.00 266.1707 233.829288 -1.320 185.00 476.0631 -291.0631014 2.229 1300.00 808.3513 491.6486815 -1.558 500.00 843.7011 -343.7011316 1.218 1200.00 931.4056 268.5943717 -1.051 750.00 981.7387 -231.73868

a Dependent Variable: Median Salary

54

Don’t Trust Your Model TOO Much…

Question: The Houston Astros payroll in 2005 = $76,779.000. What does theregression line predict the median salary will be?

Answer: Predicted Median Salary = $110,736 + (.012)(76,779,000) = $1,032,084

Actual: $500,000

Question: Why was the model so far off?

55

1988 Houston Astros (Total payroll = $13,455,000; Median = $500,000) T

56

2005 Houston Astros (Total payroll = $76,779,000; Median = $500,000)

57