Chapter 16 Data Analysis: Testing for Associations

Chapter 16Data Analysis: Testing for

Associations

Chapter 16Data Analysis: Testing for

Associations

Relationships

DirectionDirection

PositivePositive

NegativeNegative

StrengthStrength

WeakWeak

StrongStrong

ModerateModerate

16-2

Relationships

3

Types Types

Curvilinear relationship between two variables – the strength

and/or direction of the relationship changes over the

range of both variables.

Curvilinear relationship between two variables – the strength

and/or direction of the relationship changes over the


Linear relationship between two variables – the strength and

nature of the relationship remains the same over the


Linear relationship between two variables – the strength and

nature of the relationship remains the same over the


16-3

Relationships between Variables

ThreeQuestions

ThreeQuestions

Is there a relationship between the two variables we are interested in?

Is there a relationship between the two variables we are interested in?

How strong is the relationship?How strong is the relationship?

How can that relationship be best described?

How can that relationship be best described?

16-4

No Relationship between X and Y

16-5

Positive Relationship between X and Y

16-6

Negative Relationship between X and Y

16-7

Curvilinear Relationship between X and Y

16-8

Pearson Correlation Coefficient

. . . statistical measure of the strength of a linear

relationship between two metric (interval or ratio level)

variables.

16-9

It varies between –1.00 and +1.00, with 0 representing absolutely no association between two variables, and –1.00 and +1.00 representing perfect association between two variables.

It varies between –1.00 and +1.00, with 0 representing absolutely no association between two variables, and –1.00 and +1.00 representing perfect association between two variables.

The higher the absolute value of the correlation coefficient the stronger the level of association.

The higher the absolute value of the correlation coefficient the stronger the level of association.

The size of the correlation coefficient can be used to quantitatively describe the strength of the association between two variables.

The size of the correlation coefficient can be used to quantitatively describe the strength of the association between two variables.

Pearson Correlation Coefficient

16-10

Null hypothesis states that there is no association between the two variables in the population and that the correlation coefficient

is zero.

Null hypothesis states that there is no association between the two variables in the population and that the correlation coefficient

is zero.

Pearson Correlation CoefficientPearson Correlation Coefficient

If correlation coefficient is statistically significant the null hypothesis is rejected and

the conclusion is that the two variables do share some association in the population.

If correlation coefficient is statistically significant the null hypothesis is rejected and

the conclusion is that the two variables do share some association in the population.

16-11

Spearman Rank Order Correlation

. . . a statistical measure of the linear association between two variables where both have been

measured using ordinal (rank order) scales.

16-12

If either one of the variables is represented by rank order (ordinal) data – use the

Spearman rank order correlation coefficient.

If either one of the variables is represented by rank order (ordinal) data – use the

Spearman rank order correlation coefficient.

Spearman Rank Order CorrelationSpearman Rank Order Correlation

Spearman rank order correlation coefficient tends to produce a lower coefficient and is

considered a more conservative measure. We should choose a Pearson Correlation when

we can.

Spearman rank order correlation coefficient tends to produce a lower coefficient and is

considered a more conservative measure. We should choose a Pearson Correlation when

we can.

16-13

. . . a statistical technique that analyzes the linear

relationship between two variables by estimating

coefficients for an equation for a straight line. One

variable is designated as a dependent variable and

the other is called an independent or predictor

variable.

. . . a statistical technique that analyzes the linear

relationship between two variables by estimating

coefficients for an equation for a straight line. One

variable is designated as a dependent variable and

the other is called an independent or predictor

variable.

BivariateRegression

Analysis

16-14

Relationship is linear. Relationship is linear.

Variables of interest are measured on interval or ratio scales (except in the case of dummy variables).

Variables of interest are measured on interval or ratio scales (except in the case of dummy variables).

Variables come from a bivariate normal population (distribution).

Variables come from a bivariate normal population (distribution).

The error terms associated with making predictions are normally and independently distributed.

The error terms associated with making predictions are normally and independently distributed.

Regression Assumptions

16-15

Regression – formula for a straight line

y = a + bX + ei

where

y = the dependent variable

a = the intercept (point where the straight line intersects the y-axis when X = 0

b = the slope (the change in y for very 1-unit change in x)

X = the independent variable used to predict y

ei = the error for the prediction

What is Regression Analysis?

16-16

Adjusted r-square – adjustment reduces the r2 by taking into account the sample size and the number of independent variables in the regression equation. It tells you when the multiple regression equation has too many independent variables.

Adjusted r-square – adjustment reduces the r2 by taking into account the sample size and the number of independent variables in the regression equation. It tells you when the multiple regression equation has too many independent variables.

Explained variance – amount of variation in the dependent variable that can be accounted for by the combination of independent variables (represented by r2 in a bivariate regression or adjusted r2 in a multivariate regression).

Explained variance – amount of variation in the dependent variable that can be accounted for by the combination of independent variables (represented by r2 in a bivariate regression or adjusted r2 in a multivariate regression).

Unexplained variance – amount of variation in the dependent variable that can not be accounted for by the combination of independent variables.

Unexplained variance – amount of variation in the dependent variable that can not be accounted for by the combination of independent variables.

Regression coefficient – indicator of the importance of an independent variable in predicting a dependent variable. Large coefficients are good predictors and small coefficients are weak predictors.

Regression coefficient – indicator of the importance of an independent variable in predicting a dependent variable. Large coefficients are good predictors and small coefficients are weak predictors.

Regression Analysis Terms

16-17

Significant Model? – answers the first question about the relationship – “Is there a relationship

between the dependent and independent variable?”

Significant Model? – answers the first question about the relationship – “Is there a relationship

between the dependent and independent variable?”

How strong is the relationship? – the size of the coefficient of determination (r2) – tells what

percentage of the total variation in dependent variable is explained.

How strong is the relationship? – the size of the coefficient of determination (r2) – tells what

percentage of the total variation in dependent variable is explained.

r2 measure varies between .00 and 1.00 – the size of the r2 indicates the strength of the

relationship – the closer to 1.00 the stronger the relationship.

r2 measure varies between .00 and 1.00 – the size of the r2 indicates the strength of the

relationship – the closer to 1.00 the stronger the relationship.

Statistical Significance in Regression

16-18

Multiple Regression Analysis

. . . a statistical technique that analyzes the linear relationship between a dependent variable and MULTIPLE independent

variables by estimating coefficients for the equation for a

straight line.

16-19

If the independent variables are measured using a different scale (1 -5

for one 1 – 10 for another) then the different scales do not permit relative

comparisons between regression coefficients to see which independent variable has the most influence on the

dependent variable.

If the independent variables are measured using a different scale (1 -5

for one 1 – 10 for another) then the different scales do not permit relative

comparisons between regression coefficients to see which independent variable has the most influence on the

dependent variable.

Multiple Regression concernMultiple Regression concern

16-20

Standardized regression coefficients (beta coefficients) correct this problem.

Standardized regression coefficients (beta coefficients) correct this problem.

Beta coefficient is an estimated regression coefficient that has been recalculated

(standardized) to have a mean of 0 and a standard deviation of 1.

Beta coefficient is an estimated regression coefficient that has been recalculated

(standardized) to have a mean of 0 and a standard deviation of 1.

Standardization removes the effects of different scales and enables independent variables with

different units of measurement to be directly compared for their predictive ability.

Standardization removes the effects of different scales and enables independent variables with

different units of measurement to be directly compared for their predictive ability.

16-21

Assess the statistical significance of the overall regression model using the F statistic and its associated probability.

Assess the statistical significance of the overall regression model using the F statistic and its associated probability.

Examine the r2 to see how large it is. Examine the r2 to see how large it is.

FOR MULTIVARIATE: Evaluate the individual regression coefficients and their t-test statistic to see which are statistically significant.

FOR MULTIVARIATE: Evaluate the individual regression coefficients and their t-test statistic to see which are statistically significant.

FOR MULTIVARIATE: Look at the variables’ beta coefficients to assess relative influence (standardized if the variable scales are different).

FOR MULTIVARIATE: Look at the variables’ beta coefficients to assess relative influence (standardized if the variable scales are different).

When evaluating regression analysis results

16-22

Used when independent variables you may want to use to predict a dependent variable may not be

measured using interval or ratio scales.

Used when independent variables you may want to use to predict a dependent variable may not be

measured using interval or ratio scales.

Dummy Variables – artificial variables introduced into a regression equation to represent the categories of

a nominally scaled variable.

Dummy Variables – artificial variables introduced into a regression equation to represent the categories of

a nominally scaled variable.

There will be one dummy variable for each of the nominal categories of the independent variable and

the values will typically be 0 or 1.

There will be one dummy variable for each of the nominal categories of the independent variable and

the values will typically be 0 or 1.

16-23

Can result in difficulty in estimating independent regression coefficients for the correlated variables.

• It inflates the standard error of the coefficient and lowers the t statistic associated with it (makes the variables in question less likely to be significant).

• Impacts the individual regression coefficients (the independent variables).

• Does not impact the size of the r2 or the ability to predict values of the dependent variable.

Can result in difficulty in estimating independent regression coefficients for the correlated variables.

• It inflates the standard error of the coefficient and lowers the t statistic associated with it (makes the variables in question less likely to be significant).

• Impacts the individual regression coefficients (the independent variables).

• Does not impact the size of the r2 or the ability to predict values of the dependent variable.

Multicollinearity – independent variables are highly correlated with each other.

Multicollinearity – independent variables are highly correlated with each other.

16-24

Documents

Chapter 16 Data Analysis: Testing for Associations