Upload
yolanda-puck
View
223
Download
1
Embed Size (px)
Citation preview
Jennifer Siegel
Statistical background
Z-Test
T-Test
Anovas
Science tries to predict the futureGenuine effect?
Attempt to strengthen predictions with stats
Use P-Value to indicate our level of certainty that result = genuine effect on whole population (more on this later…)
Develop an experimental hypothesisH0 = null hypothesisH1 = alternative hypothesis
Statistically significant result P Value = .05
Probability that observed result is trueLevel = .05 or 5%95% certain our experimental effect is
genuine
Type 1 = false positiveType 2 = false negativeP = 1 – Probability of Type 1 error
Let’s pretend you came up with the following theory…
Having a baby increases brain volume (associated with possible structural changes)
Z - test
T - test
Population
x
z
CostNot able to include everyoneToo time consumingEthical right to privacy
Realistically researchers can only do sample based studies
T = differences between sample means / standard error of sample means
Degrees of freedom = sample size - 1
21
21
xxs
xxt
meansbetweensdifferenceoferrordardtansestimated
meanssamplebetweensdifferencet
______
___
2
22
1
21
21 n
s
n
ss xx
H0 = There is no difference in brain size before or after giving birth
H1 = The brain is significantly smaller or significantly larger after giving birth (difference detected)
Before Delivery 6 Weeks After Delivery Difference1437.4 1494.5 57.11089.2 1109.7 20.51201.7 1245.4 43.71371.8 1383.6 11.81207.9 1237.7 29.81150.7 1180.1 29.41221.9 1268.8 46.91208.7 1248.3 39.6
Sum 9889.3 10168.1 278.8Mean 1236.1625 1271.0125 34.85SD 113.8544928 119.0413426 5.18685
T=(1271-1236)/(119-113)
T 6.718914454DF 7
http://www.danielsoper.com/statcalc/calc08.aspx
Women have a significantly larger brain after giving birth
One-sample (sample vs. hypothesized mean)
Independent groups (2 separate groups)Repeated measures (same group, different
measure)
ANalysis Of VArianceFactor = what is being compared (type of
pregnancy)Levels = different elements of a factor (age of
mother)F-Statistic Post hoc testing
1 Way Anova 1 factor with more than 2 levels
Factorial AnovaMore than 1 factor
Mixed Design AnovasSome factors are independent, others are
related
There is a significant difference somewhere between groups
NOT where the difference lies
Finding exactly where the difference lies requires further statistical analysis = post hoc analysis
Z-Tests for populationsT-Tests for samplesANOVAS compare more than 2 groups in
more complicated scenarios
Varun V.Sethi
Objective
Correlation
Linear Regression
Take Home Points.
Correlation- How much linear is the relationship of two variables? (descriptive)
Regression- How good is a linear model to explain my data? (inferential)
Correlation
Correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear
relationships (bottom).
Strength and direction of the relationship between variables
Scattergrams
Y
X
YY
X
YY Y
Positive correlation Negative correlation No correlation
Measures of Correlation
1)Covariance
2) Pearson Correlation Coefficient (r)
1) Covariance
- The covariance is a statistic representing the degree to which 2 variables vary together
{Note that Sx2 = cov(x,x) }
n
yyxxyx
i
n
ii ))((
),cov( 1
A statistic representing the degree to which 2 variables vary together
Covariance formula
cf. variance formula
n
yyxxyx
i
n
ii ))((
),cov( 1
n
xxS
n
ii
x
2
12
)(
2) Pearson correlation coefficient (r)
- r is a kind of ‘normalised’ (dimensionless) covariance
- r takes values fom -1 (perfect negative correlation) to 1 (perfect positive correlation). r=0 means no correlation
yxxy ss
yxr
),cov( (S = st dev of sample)
Limitations:
Sensitive to extreme values
Relationship not a prediction.
Not Causality
Regression: Prediction of one variable from knowledge of one or more other variables
How good is a linear model (y=ax+b) to explain the relationship of two variables?
- If there is such a relationship, we can ‘predict’ the value y for a given x.
(25, 7.498)
Linear dependence between 2 variables
Two variables are linearly dependent when the increase of one variable is proportional to the increase of the other one
x
y
Samples: - Energy needed to boil water - Money needed to buy coffeepots
Fiting data to a straight line (o viceversa): Here, ŷ = ax + b
– ŷ : predicted value of y– a: slope of regression line– b: intercept
Residual error (εi): Difference between obtained and predicted values of y (i.e. y i- ŷi)
Best fit line (values of b and a) is the one that minimises the sum of squared errors (SSerror) (yi- ŷi)2
ε i
εi = residual= yi , observed= ŷi, predicted
ŷ = ax + b
Adjusting the straight line to data:
• Minimise (yi- ŷi)2 , which is (yi-axi+b)2
• Minimum SSerror is at the bottom of the curve where the gradient is zero – and this can found with calculus
• Take partial derivatives of (yi-axi-b)2 respect parametres a and b and solve for 0 as simultaneous equations, giving:
• This can always be done
x
y
s
rsa xayb
We can calculate the regression line for any data, but how well does it fit the data?
Total variance = predicted variance + error variancesy
2 = sŷ2 + ser
2
Also, it can be shown that r2 is the proportion of the variance in y that is explained by our regression model
r2 = sŷ2 / sy
2
Insert r2 sy2 into sy
2 = sŷ2 + ser
2 and rearrange to get:
ser2 = sy
2 (1 – r2)
From this we can see that the greater the correlation the smaller the error variance,
so the better our prediction
Do we get a significantly better prediction of y from our regression equation than by just predicting the mean?
F-statistic
Prediction / Forecasting Quantify strength between y and Xj ( X1, X2,
X3 )
A General Linear Model is just any model that describes the data in terms of a straight line
Linear regression is actually a form of the General Linear Model where the parameters are b, the slope of the line, and a, the intercept.
y = bx + a +ε
Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc., on a single dependent variable, y
The different x variables are combined in a linear way and each has its own regression coefficient:
y = b0 + b1x1+ b2x2 +…..+ bnxn + ε
The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y. i.e. the amount of variance in y that is accounted
for by each x variable after all the other x variables have been accounted for
Take Home Points
- Correlated doesn’t mean related. e.g, any two variables increasing or decreasing over time would show a nice correlation: C02 air concentration in Antartica and lodging rental cost in London. Beware in longitudinal studies!!!
- Relationship between two variables doesn’t mean causality(e.g leaves on the forest floor and hours of sun)
Linear regression is a GLM that models the effect of one independent variable, x, on one dependent variable, y
Multiple Regression models the effect of several independent variables, x1, x2 etc, on one dependent variable, y
Both are types of General Linear Model
Thank You