Upload
frederica-sanders
View
241
Download
4
Embed Size (px)
Citation preview
CORRELATION:CORRELATION:
Correlation analysis Correlation analysis is used to measure the strength of
association (linear relationship) between two
quantitative variables
The analysis is only concerned with strength of the relationship ;
hence no causal effect is implied
A scatter plot (or scatter diagram) is used to show the relationship between two variables
2
Linear relationships
y
x
Curvilinear relationships
y
x
x
yy
x
Scatter Plot Examples
3
Strong relationships
Weak relationships
y
y
y
y
x
x
x
x
4
No relationship
5
• The population correlation coefficient ρ (rho) measures the strength of the association between the variables
• The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations.
Correlation coefficient:
• The value of r varies from sample to sample, its sampling distribution is student t distribution
6
• Are unit free
• Range between -1 and 1
• The closer to -1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship
Both and r
7
A general guideline on interpretation of correlation
8
Significance test for correlation:
Hypotheses tested are H0: ρ = 0 (no correlation)
H1: ρ ≠ 0 (correlation exists)
Test statistic is
If p-value is less than level of significance (); then there is evidence of a linear relationship between two variables.
)2(~
2nr1
rt
2
nt
9
Pincherle and Robinson (1974) note a marked inter-observer variation in blood pressure readings. They found that doctors who read high on systolic tended to read high on diastolic. The table below shows the mean systolic and diastolic blood pressure reading by 14 doctors.
Research question: Is the association between the two variables significant?
10
Example:
Scatter plot of blood pressure data:
11
Correlations
1 .418
. .136
14 14
.418 1
.136 .
14 14
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
SYSTOLIC
DIASTOLI
SYSTOLIC DIASTOLI
r= 0.418; low positive correlation between systolic anddiastolic blood pressure
p-value= 0.136; there isn’t sufficient evidence to indicate an association between systolic and diastolic blood pressure
12
Regression analysis Regression analysis is used to:
Predict the value of a dependent variable based on the value of at least one independent variable
Explain the impact of changes in an independent variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain the dependent variable
REGRESSION ANLYSIS:REGRESSION ANLYSIS:
13
SIMPLE LINEAR REGRESSIONSIMPLE LINEAR REGRESSION:
• Only one independent variable is used to explain the dependent variable
• Relationship between dependent and independent variables is described by a linear function
• Changes in dependent variable are assumed to be caused by changes in independent variable.
14
The model is of the form
The parameters and are called the regression coefficients; is the intercept and is the slope of the regression fit.
y is the dependent variable and X is the independent variable
is error term; it introduces randomness into the model.
0
0
1
1
Xy 10
15
Using sample information
And the estimated regression model fit is
0 1
1 11
2 2 2
1 1
ˆ ˆ
( )( )ˆ
( )
n n
i i i ii i
n n
i ii i
y x
x x y y x y n x y
x x x nx
Xy 10ˆˆˆ
16
is the estimated change in the average value of y as a result of a one-unit change in x
1
Interpretation of regression coefficient:
Example:Research question: is there a linear relationship between BP And age ?Answer this question using information on 30 individuals
17
BP is dependent variableAge is the independent variable
18
The estimated regression fit is
age971.0715.98ˆ y
For every additional year in age, the BP increases by 0.97 units
19
Test for overall significance fit:
Hypotheses for test for determining if the model fitted is Statistically significant are:
H0 : regression fit is not significant
H1 : regression fit is significant
Make use of ANOVA table to make a decision about the test
20
ANOVA table
SSR is explained variation attributable to the relationship between dependent and independent variablesSSE is unexplained variation; occurs due to chanceSST is the total variation
source of variation d.f S.S M.S F-ratio p-value
Regression 1 SSR MSR=SSR/1 Fc=MSR/MSE Pr(F > Fc)
Residual n-2 SSE MSE=SSE/(n-2)
Total n-1 SST
If p-value is less than level of significance, fitted model is a significant fit
21
ANOVA
Source of variation S.S df M.S F Sig.
Regression 6394.023 1 6394.023 21.330 .000Residual 8393.444 28 299.766Total 14787.467 29
p-value is <0.001 , fitted model is significant
22
Test for significance of predictor:
0 1 1 1: 0 . : 0H vs H
1
1
ˆ~ ( 2)
ˆ.t t n
s e
Hypotheses of the test is :
Test statistic is
If p-value is less than level of significance, the predictor is linearly associated with response
23
Coefficients t Sig.
B Std. ErrorIntercept 98.715 10.000 9.871 .000age .971 .210 4.618 .000
a. Dependent Variable: bp
Coefficients:
The value of test statistic is 4.618, p-value is <0.001; age is linearly associated with BP
24
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R2
SST
SSRR 2
1R0 2
Graded interpretation : r2 = 0.1-0.3 weak relationship ; 0.4-0.7 moderate relationship; 0.8-1 strong relationship
Coefficient of determination:
25
MULTIPLE LINEAR REGRESSION:MULTIPLE LINEAR REGRESSION:
Use two or more independent variables to explain the dependent variable
Multiple linear regression allows us to investigate the joint effect of several independent variables on the dependent
We relate a single outcome(dependent) variable to two or more independent variables simultaneously
26
Aim of fitting regression line is:
Identify independent variables that are associated with the
dependent variable in order to promote understanding of the
underlying process.
Determine the extent to which each independent variable is
linearly related to the dependent variable after adjusting for other
variables that may be related to it.
Predict the value of the dependent variable as accurately
as possible from the predictor values.
27
The regression model is of the form:
where are the independent variables and
are the regression coefficients.
0 1 1 2 2 3 3 p pY X X X X
21, ...., pX X X
0 1, ,...., p
28
Interpretation of regression coefficients:
The regression coefficient is the estimated
change in the average value of dependent variable for every unit
increase in the corresponding predictor , holding other factors
in the model constant.
Each of the estimates is adjusted for the effects of all other
predictors.
1,2,..,;k k p
kX
29
Inference on regression coefficients:
We can make inference on each regression coefficient ;
by carrying out statistical hypothesis test
The test statistic is
k
0 1H : 0 vs. H : 0k k
ˆ~ ( 1)
ˆ. .( )k
k
T t n ps e
30
If p-value is less than level of significance, the independent variable is linearly associated with dependent after adjusting for all other independent variables
Test for significance of model fit:
Analysis is same as that of simple model, our focus is on p-value in ANOVA table. The inference is the same; i.e p-value < ; model fitted is statistically significant.
31
32
Adjusted R statistic: related to coefficient of determination.
It also measures the proportion of variation of dependent variable that is accounted for by the independent variables.
1SST/n
1pSSE/n1Radj 2
33
A regression model is fitted to determine if a linear relationship exists between patient satisfaction level and :the patient's age(in years), severity of illness (an index) and anxiety level (an index) . The data used was for 30 patients selected at random.
For the data collected, larger values of patient satisfaction, severity of illness and anxiety level are , respectively associated with more satisfaction, increased severity in illness and more anxiety.
Example:
34
168.6078 1.2742age 6.0072anx.ˆ 0.8473sev.y
The estimated regression fit is
Adjusting for severity of illness and anxiety level of patients ; for every additional year in age, the satisfaction level on average decreases by 1.27 units
Interpretation of regression coefficients:
35
Adjusting for age and anxiety level; the satisfaction level on average decreases by 0.84 units for every unit increase in severity of illness .
Adjusting for age and anxiety level of patients; the satisfaction level on average decreases by 6 units for every unit increase in anxiety level.
36
Source of variation (d.f) S.S M.S F-ratio P-value
Regression 3 7256.3 2418.767 30.46 < 0.001
Residual 26 2063.2 79.4
Total 29 9319.5
ANOVA table
Overall the fit is significant since p-value is < 0.001
37
s.e() z p-valueage -1.2742 0.2406 -5.295 < 0.001severity -0.8473 0.4599 -1.842 0.077anxiety -6.0072 6.2042 -0.968 0.3418intercept 168.6078
Parameter estimates:
From results above; age is significant variable; i.e controlling for anxiety level and severity of illness of patients; age is significantly associated with satisfaction level