34
Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population being studied In the real world, we hardly ever know the true values for the whole population (If we did, there would be no need to carry out a statistical survey…) We usually have to estimate the characteristics of the population from sample surveys

Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Embed Size (px)

Citation preview

Page 1: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Quantitative MethodsLecture 3 Populations and Samples Statistics books often assume we already know the

true mean or the true variance of the whole population being studied

In the real world, we hardly ever know the true values for the whole population

(If we did, there would be no need to carry out a statistical survey…)

We usually have to estimate the characteristics of the population from sample surveys

Page 2: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Estimating the Population Mean

Each time we take a sample and calculate the mean, we are only obtaining information about PART of the TOTAL POPULATION.

We have to use the sample mean (‘x-bar’) as an ESTIMATE of the population mean (‘mu’) which is usually unknown

As an estimate, x-bar is subject to a margin of error

Page 3: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Two more statistics The Standard Error and the Confidence

Interval measure the margins of error on our estimate of the true population mean.

The usual Confidence Interval is x-bar plus or minus approximately two standard errors

(1.96 standard errors to be precise) In other words, we reckon that our estimated

mean is probably within about ± 2 Standard Errors of the true mean

But there are complications…

Page 4: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Sampling Distributions enable us to make these estimates

Let’s draw a number of different samples from a normally distributed population

We can calculate the mean of each sample These sample means give us several different

estimates of the true population mean When plotted, the sample means group fairly

tightly round the population mean in a bell-shape which is much narrower than the normal distribution

The larger the samples from which the means are drawn, the tighter this bell-shape will be

Page 5: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

More on Sampling Distributions

The black curve is a Normal Distribution

The blue curve is a Sampling Distribution of various sample means

If we used larger samples, the means would group more tightly

If we used smaller samples, less tightly

f

x

Page 6: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

The Standard Error of the Mean

It has been found that the Standard Error varies in accuracy with the square root of the number in the sample

So the Standard Error = the Standard Deviation divided by N (“the square root of N”)

Thus for any given Standard Deviation, the larger the N (the number in the sample), the smaller the Standard Error will be.

We use the standard error to estimate the population mean from the sample mean - subject to a margin of error.

Page 7: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

The 95% Confidence Interval

95% of the Normal Distribution is within ± (plus or minus) 1.96 Standard Deviations of the Mean.

In the same way, probability theory shows that, 95% of the time, the true population mean will lie within ±1.96 Standard Errors of any mean calculated from a large sample.

(Small samples are more complicated!)

Page 8: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

95% probability is not certainty

Because we are estimating, we cannot be 100% certain

If something is 95% probable, it is only correct 19 times out of 20

So Confidence Intervals are not infallible, unlike Standard Deviations and Variances

But as long as our samples are large (more than 60) margins of error are fairly small

Page 9: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Example

A sample of 100 ball-bearings are weighed. They have a mean weight of 150 grams with a

standard deviation of 8 grams. Find the mean weight of the population as a whole,

within the 95% Confidence Interval. Calculate the Standard Error = Std Deviation / N = 8 / 100 = 8/10 = 0.8 We are 95% certain that the population mean will be

within ±1.96*0.8 of the sample mean. So the population mean will lie between 150 -

1.96*0.8 and 150 + 1.96*0.8 i.e. between 148.432 and 151.568

Page 10: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Meaning of the Confidence Interval

We call it the 95% Confidence Interval because we are fairly (95%) sure the true mean lies between 148.432 and 151.568

We can choose other Confidence Intervals If we want to be 99% sure of the true mean,

we use a WIDER Confidence Interval of ±2.57 Standard Errors

Then we say we are 99% sure that the true mean lies between 150 ± 2.056

Page 11: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Small Samples - A Complication

The smaller the sample, the less accurate the estimate

Instead of using 1.96 times the Standard Error, we have to widen the margin

‘T-tables’ show how much we should widen it In our example today, N-1 gives the

appropriate ‘degrees of freedom’ to be used. So, if we have a sample of 16 cases, the

degrees of freedom = 16-1 = 15 This gives us the row of the table to use

Page 12: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

See how T-distributions ‘flatten out’

Normal T-distribution 1 (N=30) T-distribution 2 (N=12)

(sketches not to scale) T-distributions change shape by sample size.

The normal distribution is shaped like a bell

The T-distributions are shaped more like a cymbal.

The larger the sample, the more bell-like the T-distribution becomes.

Page 13: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

T-tables show that for N=16, there are N-1=15 degrees of freedom; so we use 2.13 Standard Errors instead of 1.96 Standard Errors for the 95% CI

T-DISTRIBUTION CRITICAL VALUESDegrees of

P=0.05

P=0.01freedom

(for use with

(for use with

95% C.I.)

99% C.I.)1

12.71

63.662

4.30

9.933

3.18

5.844

2.78

4.605

2.57

4.036

2.45

3.717

2.37

3.508

2.31

3.369

2.26

3.2510

2.23

3.1711

2.20

3.1112

2.18

3.0613

2.16

3.0114

2.15

2.9815

2.13

2.9516

2.12

2.92

Page 14: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Large samples reduce margins of error

The smaller the sample, the wider the Confidence Interval becomes in terms of Standard Errors.

But if N is large (at least 60 and preferably more than 120), Standard Errors are reduced (because we divide by a sizeableN)

In addition, we do not have to increase the number of Standard Errors in the Confidence Interval from the basic ±1.96

Taken together, these factors push statisticians towards seeking large samples wherever possible, in order to reduce the margins of error.

Page 15: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Inferential Statistics

Putting our Descriptive Statistics to Work

Page 16: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Why Inferential Statistics differ from Descriptive Statistics

Means, variances, standard deviations and standard errors are Descriptive Statistics

Give anyone a set of figures and the formula and they should come up with the same answers

Inferential statistics can never tell you if something is true or not

They give you the balance of probabilities about whether something is true.

Page 17: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

How we make inferences

Provided that the sets of data we are examining are distributed normally (more or less), we can make a number of inferences about how likely (or unlikely) specific events will be

Confidence Intervals are a part of Inferential Statistics - they do not tell us what the population mean IS, only that the population mean is likely to fall between certain limits

Page 18: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Inferential Statistics help us to distinguish likely events from unlikely events

Thus it is possible to run statistical tests on measurable samples of data

We select a probability ‘cut-off’ value (e.g. 95% probable versus 5% probable) and make judgements how likely our outcome is

The ‘test statistic’ that we compute tells us whether we have observed a likely event (one that happens 95% of the time) or an unlikely one (one that only happens less than 5% of the time)

Page 19: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Null Hypothesis And Alternative Hypothesis

We start with the assumption that nothing is proved - that there is no connection between sets of data, and everything has occurred by chance. This is called the NULL HYPOTHESIS

The ALTERNATIVE HYPOTHESIS is that something unlikely or “significant” links the data

If our test statistic tells us that we have observed an unlikely event, we REJECT the Null Hypothesis and ACCEPT the Alternative Hypothesis

Page 20: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Example: the ‘Paired’ T-test

Suppose that we give people a ‘treatment’ (training, or medication, or lessons)

We want to measure whether the ‘treatment’ has improved their results

Provided we can measure the outcome, we can test the same sample of people Before and After Treatment and we use the ‘Paired’ T-test

Page 21: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

There are many other tests

The paired T-test is a simple test to explain

Others tests we will consider include tests for whether different samples have achieved different mean scores

And tests for whether a score on Variable 1 is linked (‘correlated’) to a score on Variable 2

Page 22: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Example: We give people some training and measure how scores differ after it

PERSON

A

B

C

D

E

F

G

H

I

J

SCORE BEFORE TRAINING

9

8

8

6

11

13

16

10

9

8

SCORE AFTER TRAINING

12

14

15

10

13

11

15

12

9

10

AFTER minus BEFORE

3

6

7

4

2

-2

-1

2

0

2

Page 23: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

To calculate our ‘Paired’ T-test

Set up the Null Hypothesis: Any difference in scores after training

has occurred by chance Set up the Alternative Hypothesis The difference in mean scores is

statistically significant Choose a decision level (‘alpha’) Normally 95% vs 5% (or 0.95 vs 0.05)

Page 24: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

When to reject the Null Hypothesis

If we can show that the probability that the Null Hypothesis is true has dropped BELOW 5%, we can reject the Null Hypothesis

In which case, we accept the Alternative Hypothesis that the training has made a ‘significant’ difference

Otherwise, we accept the Null Hypothesis that the training did not change the mean score

Page 25: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Calculating the ‘test statistic’

For each test, we calculate a ‘test statistic’

Then we look in our tables to find out whether that number indicates a likely or an unlikely event

In the case of the Paired T-test, the formula for the test statistic is

(X-) Standard Error

Page 26: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

The T-statistic (or ‘T-ratio’)

In (X-) Standard Error X is the mean difference between before and

after scores is the expected mean difference between

before and after scores assuming the Null Hypothesis is true

Standard Error is the Standard Deviation N

What will be?

Page 27: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Calculations for our example

SAMPLE MEAN of ‘AFTER minus BEFORE’ column = 23/10 = 2.30

STANDARD DEVIATION (calculated in the same manner as last week) = 2.87

STANDARD ERROR = STDEV/SQRT(N) =2.87/(SQRT(10)) 0.91

T-statistic = (SAMPLE MEAN (2.30) - EXPECTED MEAN (0)) divided by the STD ERROR (0.91)

(2.30-0)/0.91 = T = 2.53 Again, why is 0?

Page 28: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

What does all this mean? Now that we have calculated that the T-statistic

= 2.53, what happens? We check this number against the appropriate

row of the T-tables The appropriate row will be N-1, or 9 degrees of

freedom (because N=10) If our T-statistic is less than the ‘critical value’ in

the table, the Null Hypothesis stands If our T-statistic is greater than the ‘critical value’

in the T-table, the Null Hypothesis falls

Page 29: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Bother, there are two columns in the T-tables …

T-DISTRIBUTION CRITICAL VALUESDegrees of

P=0.05

P=0.01freedom

(for use with (for use with

95% C.I.)

99% C.I.) 8

2.31

3.369

2.26

3.2510

2.23

3.17

Remember we chose the .95 / .05 cut-off level in advance?

This means we use the left column

Our 2.53 ‘beats’ the Critical Value of 2.26 for 9 degrees of freedom

Page 30: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Concluding the test At our selected probability level, the T-statistic

we have calculated is greater than the number in the table

Remind me what this means … It means that we REJECT the Null

Hypothesis Our result is UNLIKELY to have occurred by

chance We conclude that the training HAS

significantly changed the mean score

Page 31: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

How much have we achieved?

Using probability theory and our test statistic, we have made an assessment of the effectiveness of our training

But note again that 95% significance is not certainty

We are going to be wrong 1 time in 20 In ‘life or death’ situations we may want to be

99% or even 99.9% sure To be 99% sure, we use the right-hand

column in the T-table for our ‘critical value’

Page 32: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Plenty to think about …

We have covered a lot of ground this week The Null Hypothesis / Alternative Hypothesis

approach is the same for all statistical tests So is the idea of selecting the acceptable

decision level (or ‘alpha’) in advance But in other tests, we use different statistical

calculations and different degrees of freedom to obtain our test statistic

Page 33: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

And finally:

Suppose we had chosen the 99% / 1% cut-off level for our example, what would the result have been?

(pause for thought)…

Page 34: Quantitative Methods Lecture 3 Populations and Samples Statistics books often assume we already know the true mean or the true variance of the whole population

Happy number-crunching!

[email protected]