Chapter 6

Sampling Distributions

CHAPTER 6

Sample Mean and Variance

A statistics is a function of sample observations that contains no unknown parameters.

The sampling distribution describes probabilities associated with a statistic when a random sample is drawn from a population.

The sampling distribution is the probability distribution or probability density function of the statistic.

Derivation of the sampling distribution is the first step in calculating a confidence interval or carrying out a hypothesis test for a parameter.

Example

Suppose that X1, ......., Xn are a simple random sample from a normally distributed population with expected value µ and known variance σ2 . Then the sample mean is a statistic used to give information about the population parameter µ; x¯ is normally distributed with expected value µ and variance σ /n.

Principle of centrality for sampling distributions of means: The sample means tend to center around the population mean.

Principle of variability: The variability among the sample means decreases as the sample size increases.

For a random variable X, the expected value E(X) = µ and the variance Var(X) = σ2 .

Since the random variable X can take on any of N values with probability 1/N, then the mean and variance becomes:

These random variables can be considered as elements of a random sample from an infinite population having a probability distribution with mean µ and variance σ2 .

The values xi denotes the ith possible value of X.

2

1

2

1

)(1

1

N

ii

N

ii

xN

xN

For a sample of size n, the sample mean is where xi is the ith sample observation.

To measure the variability of the sample, we might try a sample copy of the variance, namely

We might say, that is a good estimator of µ.

The sample variance is not a very good estimator for the population variance. Whereas ransom samples tend to center at their population mean, these samples tend to have less variability than the population that they came from. To compensate for this fact we change the denominator from n to n-1.

n

iix

Nx

1

_ 1

2_

1

2 )(1

xxn

sn

ii

_

x

2_

1

2 )(1

1xx

ns

n

ii

The sampling distribution of the mean is the probability distribution of the mean of a random sample. Its mean and variance can be easily calculated as follows:

The central limit theorem states that the sampling distribution of the mean, for any set of independent and identically distributed random variables, will tend towards the normal distribution as the sample size gets larger. This may be restated as follows:

Central Limit Theorem

Central Limit Theorem : When n is sufficiently large, the sampling distribution of is well approximated by a normal curve, even when the population distribution is not itself normal.

We sometimes abbreviate the CLT to the phrase “ is asymptotically distributed with mean µ and variance σ2/n”.

Therefore,

where Z is a standard normal random variable.

n

bZP

n

b

n

XPbXP

///)(

__

Example

The fracture strengths of a certain type of glass average 14 (in thousands of pounds per square inch) and have a standard deviation of 2.

A) What is the probability that the average fracture strength for 100 pieces of this glass exceeds 14.5?

B) Find an interval that includes the average fracture strength for 100 pieces of this glass with probability 0.95.

Example The average strength has approximately a normal distribution

with mean µ=14 and standard deviation

Thus,

2.0100

2

n

0062.04938.05.0)5.2(2.0

5.0

2.0

145.14

/

5.14

/)5.14(

__

ZPZPZP

elyapproximat

nn

XPXP

The probability of seeing an average value (n=100) more than 0.5 unit above the population mean is, in this case, very small.

B) We have seen that

for a normally distributed . In this problem,

Approximately 95% of the sample mean fracture strengths, for sample of size 100, should lie between 13.6 and 14.4.

95.096.196.1_

nX

nP

6.13100

296.11496.1

6.13100

296.11496.1

n

andn

Sampling Distribution of Sums

The Normal Approximation to the Binomial Distribution

Requirements for a Binomial Distribution

- Fixed Number of Trials - Trials are independent - Each trial has 2 possible outcomes - Probabilities remain constant (p and q, q = 1-p)

Formula: nCx pxqn-x

Mean= np Variance = npq

Example Example:

54% of people have answering machines. Sample 1000 households. What is the probability that more than 556 have answering machines.

P(X >556)= P(557)+P(558)+P(559)+...+P(999)+P(1000)

We are going to use the normal distribution to approximate a binomial distribution.

Requirements for Using a Normal Distribution as an Approximation to a Binomial Distribution:

If np≥5 and nq≥5, then the binomial random variable is approximately normally distributed with the mean and standard deviation of μ=np and σ = sqrt(npq).

Continuity correction factor We are using a Continuous model to approximate a Discrete model.

We need to make adjustment for continuity. This is the continuity correction factor.

Because the normal distribution can take all real numbers (is continuous) but the binomial distribution can only take integer values (is discrete), a normal approximation to the binomial should identify the binomial event "8" with the normal interval "(7.5, 8.5)" (and similarly for other integer values).

Continuity Correction Factor

0.5( )

(1 )

0.5( ) 1

(1 )

x npP X x

np p

x npP X x

np p

Continuity correction factor

Example: If n=20 and p=.25, what is the probability that X is greater than or equal to 8?

The normal approximation without the continuity correction factor yields Z=(8-20 × .25)/(20 × .25 × .75)^.5 = 1.55, hence P(X ≥ 8) ~ .0606

The continuity correction factor requires us to use 7.5 in order to include 8 since the inequality is weak and we want the region to the right. z = (7.5 - 5)/(20 × .25 × .75)^.5 = 1.29, hence the area under the normal curve is .0985.

The exact solution is .1019 approximation. Hence for small n, the continuity correction factor gives a much better answer.

Example: 54% of people have answering machines. Sample1000 households. Estimate the probability that more than 556 have answering machines.

1) Test if Normal is Appropriate np = .54 X 1000 = 540 and nq = .46 X 1000 =460 and both are greater than 5

2) Find the mean and the standard deviation µ = np = 540 and σ = 16.87

3) Draw the normal curve and identify the region representing the probability to be found.

Example 4) Find the continuity correction factor

5) Estimate the probability

P(N > 556) = P(N ≥ 555.5)=P( (X-µ)/σ ≥ (555.5– 540)/ 16.87) = P(Z ≥ 0.919) = 0.5-0.3212=0.1788

What about the probability that less than 519 have answering machines?

P(N < 519) = P(N ≤ 519.5)=P( (X-µ)/σ ≤ (519.5– 540)/ 16.87) = P(Z ≤ -1.22) =0.5 – 0.388=0.112

•A sample statistic used to estimate an unknown population parameter is called an estimate.

• The discrepancy between the estimate and the true parameter value is known as sampling error.

• A statistic is a random variable with a probability distribution, called the sampling distribution, which is generated by repeated sampling.

• We use the sampling distribution of a statistic to assess the sampling error in an estimate.

Sampling Distributions

The Sampling Distribution of the Sample Variance

There is no analog to the CLT for which gives an approximation for large samples for an arbitrary distribution.

The exact distribution for S2 can be derived for X ~ i.i.d. Normal.

The sampling distribution of S2 is needed to infer the variability of a population from the variability of its sample. The simplest case is when the population have a normal distribution.

Chi-Square Distribution Theorem. If S2 is the variance of a random sample of size n taken

from a normal population having the variance σ2 , then the statistic

has a chi-squared distribution with v = n - 1 degrees of freedom.

.

Mean of S2

Χ2 distribution has a mean equal to its degree of freedom n-1 and the variance equal to twice its degree of freedom 2(n-1).

From this information we find the mean and variance of S2.

22

2

2

2

2

2

)1()1()(

1)()1(

1)1(

nnSE

and

nSEn

therefore

nSn

E

Variance of S2

1-n

2

1)1(2)S(

)1(2)S( )1(

)1(2 S )1(

4222

22

2

2

2

nnV

and

nVn

therefore

nn

V

Example For a certain launching mechanism, the distances by which the

projectile misses the target center have a normal distribution with variance σ2 = 100square meters. An experiment involving n=25 launches is to be conducted. Let S2 denote the sample variance of the distances between the impact of the projectile and the target center.

Approximate P(S2 > 50). Find E(S2 ) and V(S2 ).

Let U = (n-1) S2 / σ2 , which then has a Χ2 (24) distribution for n=25. P(S2 >50) = P ((n-1) S2 / σ2 > 24 * 50 /100) = P(U > 12) from Table 6

is a little larger than 0.975.

Example We know that E(S2)= σ2 = 100 and V(S2) = 2 σ4 /(n-1) = 2 (100)^2 / 24 = 10000/12

Documents

Chapter 6