24
Chapter 16 Data Analysis: Testing for Associations

Chapter 16 Data Analysis: Testing for Associations

Embed Size (px)

Citation preview

Page 1: Chapter 16 Data Analysis: Testing for Associations

Chapter 16Data Analysis: Testing for

Associations

Chapter 16Data Analysis: Testing for

Associations

Page 2: Chapter 16 Data Analysis: Testing for Associations

Relationships

DirectionDirection

PositivePositive

NegativeNegative

StrengthStrength

WeakWeak

StrongStrong

ModerateModerate

16-2

Page 3: Chapter 16 Data Analysis: Testing for Associations

Relationships

3

Types Types

Curvilinear relationship between two variables – the strength

and/or direction of the relationship changes over the

range of both variables.

Curvilinear relationship between two variables – the strength

and/or direction of the relationship changes over the

range of both variables.

Linear relationship between two variables – the strength and

nature of the relationship remains the same over the

range of both variables.

Linear relationship between two variables – the strength and

nature of the relationship remains the same over the

range of both variables.

16-3

Page 4: Chapter 16 Data Analysis: Testing for Associations

Relationships between Variables

ThreeQuestions

ThreeQuestions

Is there a relationship between the two variables we are interested in?

Is there a relationship between the two variables we are interested in?

How strong is the relationship?How strong is the relationship?

How can that relationship be best described?

How can that relationship be best described?

16-4

Page 5: Chapter 16 Data Analysis: Testing for Associations

No Relationship between X and Y

16-5

Page 6: Chapter 16 Data Analysis: Testing for Associations

Positive Relationship between X and Y

16-6

Page 7: Chapter 16 Data Analysis: Testing for Associations

Negative Relationship between X and Y

16-7

Page 8: Chapter 16 Data Analysis: Testing for Associations

Curvilinear Relationship between X and Y

16-8

Page 9: Chapter 16 Data Analysis: Testing for Associations

Pearson Correlation Coefficient

. . . statistical measure of the strength of a linear

relationship between two metric (interval or ratio level)

variables.

16-9

Page 10: Chapter 16 Data Analysis: Testing for Associations

It varies between –1.00 and +1.00, with 0 representing absolutely no association between two variables, and –1.00 and +1.00 representing perfect association between two variables.

It varies between –1.00 and +1.00, with 0 representing absolutely no association between two variables, and –1.00 and +1.00 representing perfect association between two variables.

The higher the absolute value of the correlation coefficient the stronger the level of association.

The higher the absolute value of the correlation coefficient the stronger the level of association.

The size of the correlation coefficient can be used to quantitatively describe the strength of the association between two variables.

The size of the correlation coefficient can be used to quantitatively describe the strength of the association between two variables.

Pearson Correlation Coefficient

16-10

Page 11: Chapter 16 Data Analysis: Testing for Associations

Null hypothesis states that there is no association between the two variables in the population and that the correlation coefficient

is zero.

Null hypothesis states that there is no association between the two variables in the population and that the correlation coefficient

is zero.

Pearson Correlation CoefficientPearson Correlation Coefficient

If correlation coefficient is statistically significant the null hypothesis is rejected and

the conclusion is that the two variables do share some association in the population.

If correlation coefficient is statistically significant the null hypothesis is rejected and

the conclusion is that the two variables do share some association in the population.

16-11

Page 12: Chapter 16 Data Analysis: Testing for Associations

Spearman Rank Order Correlation

. . . a statistical measure of the linear association between two variables where both have been

measured using ordinal (rank order) scales.

16-12

Page 13: Chapter 16 Data Analysis: Testing for Associations

If either one of the variables is represented by rank order (ordinal) data – use the

Spearman rank order correlation coefficient.

If either one of the variables is represented by rank order (ordinal) data – use the

Spearman rank order correlation coefficient.

Spearman Rank Order CorrelationSpearman Rank Order Correlation

Spearman rank order correlation coefficient tends to produce a lower coefficient and is

considered a more conservative measure. We should choose a Pearson Correlation when

we can.

Spearman rank order correlation coefficient tends to produce a lower coefficient and is

considered a more conservative measure. We should choose a Pearson Correlation when

we can.

16-13

Page 14: Chapter 16 Data Analysis: Testing for Associations

. . . a statistical technique that analyzes the linear

relationship between two variables by estimating

coefficients for an equation for a straight line. One

variable is designated as a dependent variable and

the other is called an independent or predictor

variable.

. . . a statistical technique that analyzes the linear

relationship between two variables by estimating

coefficients for an equation for a straight line. One

variable is designated as a dependent variable and

the other is called an independent or predictor

variable.

BivariateRegression

Analysis

16-14

Page 15: Chapter 16 Data Analysis: Testing for Associations

Relationship is linear. Relationship is linear.

Variables of interest are measured on interval or ratio scales (except in the case of dummy variables).

Variables of interest are measured on interval or ratio scales (except in the case of dummy variables).

Variables come from a bivariate normal population (distribution).

Variables come from a bivariate normal population (distribution).

The error terms associated with making predictions are normally and independently distributed.

The error terms associated with making predictions are normally and independently distributed.

Regression Assumptions

16-15

Page 16: Chapter 16 Data Analysis: Testing for Associations

Regression – formula for a straight line

y = a + bX + ei

where

y = the dependent variable

a = the intercept (point where the straight line intersects the y-axis when X = 0

b = the slope (the change in y for very 1-unit change in x)

X = the independent variable used to predict y

ei = the error for the prediction

What is Regression Analysis?

16-16

Page 17: Chapter 16 Data Analysis: Testing for Associations

Adjusted r-square – adjustment reduces the r2 by taking into account the sample size and the number of independent variables in the regression equation. It tells you when the multiple regression equation has too many independent variables.

Adjusted r-square – adjustment reduces the r2 by taking into account the sample size and the number of independent variables in the regression equation. It tells you when the multiple regression equation has too many independent variables.

Explained variance – amount of variation in the dependent variable that can be accounted for by the combination of independent variables (represented by r2 in a bivariate regression or adjusted r2 in a multivariate regression).

Explained variance – amount of variation in the dependent variable that can be accounted for by the combination of independent variables (represented by r2 in a bivariate regression or adjusted r2 in a multivariate regression).

Unexplained variance – amount of variation in the dependent variable that can not be accounted for by the combination of independent variables.

Unexplained variance – amount of variation in the dependent variable that can not be accounted for by the combination of independent variables.

Regression coefficient – indicator of the importance of an independent variable in predicting a dependent variable. Large coefficients are good predictors and small coefficients are weak predictors.

Regression coefficient – indicator of the importance of an independent variable in predicting a dependent variable. Large coefficients are good predictors and small coefficients are weak predictors.

Regression Analysis Terms

16-17

Page 18: Chapter 16 Data Analysis: Testing for Associations

Significant Model? – answers the first question about the relationship – “Is there a relationship

between the dependent and independent variable?”

Significant Model? – answers the first question about the relationship – “Is there a relationship

between the dependent and independent variable?”

How strong is the relationship? – the size of the coefficient of determination (r2) – tells what

percentage of the total variation in dependent variable is explained.

How strong is the relationship? – the size of the coefficient of determination (r2) – tells what

percentage of the total variation in dependent variable is explained.

r2 measure varies between .00 and 1.00 – the size of the r2 indicates the strength of the

relationship – the closer to 1.00 the stronger the relationship.

r2 measure varies between .00 and 1.00 – the size of the r2 indicates the strength of the

relationship – the closer to 1.00 the stronger the relationship.

Statistical Significance in Regression

16-18

Page 19: Chapter 16 Data Analysis: Testing for Associations

Multiple Regression Analysis

. . . a statistical technique that analyzes the linear relationship between a dependent variable and MULTIPLE independent

variables by estimating coefficients for the equation for a

straight line.

16-19

Page 20: Chapter 16 Data Analysis: Testing for Associations

If the independent variables are measured using a different scale (1 -5

for one 1 – 10 for another) then the different scales do not permit relative

comparisons between regression coefficients to see which independent variable has the most influence on the

dependent variable.

If the independent variables are measured using a different scale (1 -5

for one 1 – 10 for another) then the different scales do not permit relative

comparisons between regression coefficients to see which independent variable has the most influence on the

dependent variable.

Multiple Regression concernMultiple Regression concern

16-20

Page 21: Chapter 16 Data Analysis: Testing for Associations

Standardized regression coefficients (beta coefficients) correct this problem.

Standardized regression coefficients (beta coefficients) correct this problem.

Beta coefficient is an estimated regression coefficient that has been recalculated

(standardized) to have a mean of 0 and a standard deviation of 1.

Beta coefficient is an estimated regression coefficient that has been recalculated

(standardized) to have a mean of 0 and a standard deviation of 1.

Standardization removes the effects of different scales and enables independent variables with

different units of measurement to be directly compared for their predictive ability.

Standardization removes the effects of different scales and enables independent variables with

different units of measurement to be directly compared for their predictive ability.

16-21

Page 22: Chapter 16 Data Analysis: Testing for Associations

Assess the statistical significance of the overall regression model using the F statistic and its associated probability.

Assess the statistical significance of the overall regression model using the F statistic and its associated probability.

Examine the r2 to see how large it is. Examine the r2 to see how large it is.

FOR MULTIVARIATE: Evaluate the individual regression coefficients and their t-test statistic to see which are statistically significant.

FOR MULTIVARIATE: Evaluate the individual regression coefficients and their t-test statistic to see which are statistically significant.

FOR MULTIVARIATE: Look at the variables’ beta coefficients to assess relative influence (standardized if the variable scales are different).

FOR MULTIVARIATE: Look at the variables’ beta coefficients to assess relative influence (standardized if the variable scales are different).

When evaluating regression analysis results

16-22

Page 23: Chapter 16 Data Analysis: Testing for Associations

Used when independent variables you may want to use to predict a dependent variable may not be

measured using interval or ratio scales.

Used when independent variables you may want to use to predict a dependent variable may not be

measured using interval or ratio scales.

Dummy Variables – artificial variables introduced into a regression equation to represent the categories of

a nominally scaled variable.

Dummy Variables – artificial variables introduced into a regression equation to represent the categories of

a nominally scaled variable.

There will be one dummy variable for each of the nominal categories of the independent variable and

the values will typically be 0 or 1.

There will be one dummy variable for each of the nominal categories of the independent variable and

the values will typically be 0 or 1.

16-23

Page 24: Chapter 16 Data Analysis: Testing for Associations

Can result in difficulty in estimating independent regression coefficients for the correlated variables.

• It inflates the standard error of the coefficient and lowers the t statistic associated with it (makes the variables in question less likely to be significant).

• Impacts the individual regression coefficients (the independent variables).

• Does not impact the size of the r2 or the ability to predict values of the dependent variable.

Can result in difficulty in estimating independent regression coefficients for the correlated variables.

• It inflates the standard error of the coefficient and lowers the t statistic associated with it (makes the variables in question less likely to be significant).

• Impacts the individual regression coefficients (the independent variables).

• Does not impact the size of the r2 or the ability to predict values of the dependent variable.

Multicollinearity – independent variables are highly correlated with each other.

Multicollinearity – independent variables are highly correlated with each other.

16-24