36
Lecture 3: The Normal Distribution and Statistical Inference Sandy Eckel [email protected] 24 April 2008 1 / 36

Lecture 3: The Normal Distribution and Statistical …eckel/biostat2/slides/lecture3.pdfLecture 3: The Normal Distribution and Statistical Inference Sandy Eckel [email protected] 24

Embed Size (px)

Citation preview

Lecture 3: The Normal Distribution andStatistical Inference

Sandy [email protected]

24 April 2008

1 / 36

A Review and Some Connections

The Normal Distribution

The Central Limit Theorem

Estimates of means and proportions: uses and properties

Confidence intervals and Hypothesis tests

2 / 36

The Normal Distribution

A probability distribution for continuous data

Characterized by a symmetric bell-shaped curve(Gaussian curve)

Symmetric about its mean µ

Under certain conditions, can be used to approximateBinomial(n,p) distribution

np>5n(1-p)>5

3 / 36

Normal Distribution

x

Nor

mal

Den

sity

, f(x

)

−− ∞∞ µµ ++ ∞∞

Takes on values between −∞ and +∞Mean = Median = Mode

Area under curve equals 1

Notation for Normal random variable: X ∼ N(µ, σ2)

Parametersµ = meanσ = standard deviation

4 / 36

Formula: Normal Probability Density Function (pdf)

x

Nor

mal

Den

sity

, f(x

)

−− ∞∞ µµ ++ ∞∞

The normal probability density function for X ∼ N(µ, σ2) is:

f (x) =1√2πσ

· e−(x−µ)2/2σ2,−∞ < x < +∞

Note: π ≈ 3.14 and e ≈ 2.72 are mathematical constants

5 / 36

Standard Normal

Definition: a Normal distribution N(µ, σ2) with parametersµ = 0 and σ = 1

Its density function is written as:

f (x) =1√2π· e−x2/2,−∞ < x < +∞

We typically use the letter Z to denote a standard normalrandom variable (Z ∼ N(0, 1))

Important! We use the standard normal all the time becauseif X ∼ N(µ, σ2), then X−µ

σ ∼ N(0, 1)

This process is called “standardizing” a normal randomvariable

6 / 36

68-95-99.7 Rule I

68% of the density is within one standard deviation of the mean

x

Nor

mal

Den

sity

, f(x

)

−− ∞∞ µµ ++ ∞∞µµ −− 1σσ µµ ++ 1σσ

0.68

0.16 0.16

7 / 36

68-95-99.7 Rule II

95% of the density is within two standard deviations of the mean

x

Nor

mal

Den

sity

, f(x

)

−− ∞∞ µµ ++ ∞∞µµ −− 2σσ µµ ++ 2σσ

0.95

0.025 0.025

8 / 36

68-95-99.7 Rule III

99.7% of the density is within three standard deviations of themean

x

Nor

mal

Den

sity

, f(x

)

−− ∞∞ µµ ++ ∞∞µµ −− 3σσ µµ ++ 3σσ

0.997

0.0015 0.0015

9 / 36

Different Means

x

Nor

mal

Den

sity

µµ1 µµ2 µµ3

Three normal distributions with different meansµ1 < µ2 < µ3

10 / 36

Different Standard Deviations

x

Nor

mal

Den

sity

σσ1

σσ2

σσ3

Three normal distributions with different standard deviationsσ1 < σ2 < σ3

11 / 36

Standard Normal N(0,1)

−4 −2 0 2 4

µµ=0

Nor

mal

Den

sity σσ=1

12 / 36

Example: Birthweights (in grams) of infants in apopulation

Weights

Den

sity

0 1000 2000 3000 4000 5000 6000

Continuous data

Mean = Median = Mode = 3000 = µ

Standard deviation = 1000 = σ

The area under the curve represents the probability(proportion) of infants with birthweights between certainvalues

13 / 36

Normal Probabilities

We are often interested in the probability that z takes on valuesbetween z0 and z1

P(z0 ≤ z ≤ z1) =

∫ z1

z0

1√2π· e−z2/2dz

How do we calculate this probability?

Equivalent to finding area under the curveContinuous distribution, so we cannot use sums to findprobabilitiesPerforming the integration is not necessary since tables andcomputers are available

14 / 36

Z Tables

15 / 36

But...we’ll use R

For standard normal random variables Z ∼ N(0, 1) we’ll use1 pnorm(?) to find P(Z ≤?)2 pnorm(?, lower.tail=F) to find P(Z ≥?)

<?

?

>?

?

For any normal random variable X ∼ N(µ, σ2)(but taking X ∼ N(2, 32) as an example) we’ll use

1 pnorm(?, mean=2, sd=3) to find P(X ≤?)2 pnorm(?, mean=2, sd=3, lower.tail=F) to find P(X ≥?)

16 / 36

Example: Birthweights (in grams)

Weights

Den

sity

0 1000 2000 3000 4000 5000 6000

µ = 3000

σ = 1000

X = birthweight

Z =X − µσ

17 / 36

Question I

What is the probability of an infant weighing more than 5000g?

P(X > 5000) = P(X − µσ

>5000− 3000

1000)

= P(Z > 2)

= 0.0228

Get this using pnorm(2, lower.tail=F) (since we standardized)

18 / 36

Question II

What is the probability of an infant weighing less than 3500g?

P(X < 3500) = P(X − µσ

<3500− 3000

1000)

= P(Z < 0.5)

= 0.6915

19 / 36

Question III

What is the probability of an infant weighing between 2500 and4000g?

P(2500 < X < 4000) = P(2500− 3000

1000<

X − µσ

<4000− 3000

1000)

= P(−0.5 < Z < 1)

= 1− P(Z > 1)− P(Z < −0.5)

= 1− 0.1587− 0.3085

= 0.5328

20 / 36

Statistical Inference

Populations and samples

Sampling distributions

21 / 36

Definitions

Statistical inference is “the attempt to reach a conclusionconcerning all members of a class from observations of onlysome of them.” (Runes 1959)

A population is a collection of observations

A parameter is a numerical descriptor of a population

A sample is a part or subset of a population

A statistic is a numerical descriptor of the sample

22 / 36

Population vs. Sample

Population

population size = N

µ = mean, a measure of center

σ2 = variance, a measure of dispersion

σ = standard deviation

Sample from the population is used to calculate sample estimates(statistics) that approximate population parameters

sample size = n

X̄ = sample mean

s2 = sample variance

s = sample standard deviation

Population: parameters

Sample: statistics23 / 36

Estimating the population mean, µ

Usually µ is unknown and we would like to estimate it

We use X̄ to estimate µ

We know the sampling distribution of X̄

Definition: Sampling distributionThe distribution of all possible values of some statistic, computedfrom samples of the same size randomly drawn from the samepopulation, is called the sampling distribution of that statistic

24 / 36

Sampling Distribution of X̄

Population Distribution of X

X

Den

sity

µµ

X~N(µµ,σσ2)

Distribution of Sample Mean X

X

Den

sity

µµ

X~N(µµ,σσ2 n)

n=10n=30n=100

When sampling from a normally distributed population

X̄ will be normally distributedThe mean of the distribution of X̄ is equal to the true mean µof the population from which the samples were drawnThe variance of the distribution is σ2/n, where σ2 is thevariance of the population and n is the sample sizeWe can write: X̄ ∼ N(µ, σ2/n)

When sampling from a population whose distribution is not normaland the sample size is large, use the Central Limit Theorem 25 / 36

The Central Limit Theorem (CLT)

Given a population of any distribution with mean, µ, and variance,σ2, the sampling distribution of X̄ , computed from samples of sizen from this population, will be approximately N(µ, σ2/n) whenthe sample size is large

In general, this applies when n ≥ 25

The approximation of normality becomes better as n increases

26 / 36

What if a random variable has a Binomial distribution?

First, recall that a Binomial variable is just the sum of nBernoulli variable: Sn =

∑ni=1 Xi

Notation:

Sn ∼ Binomial(n,p)Xi ∼ Bernoulli(p) = Binomial(1, p) for i = 1, . . . , n

In this case, we want to estimate p by p̂ where

p̂ =Sn

n=

∑ni=1 Xi

n= X̄

p̂ is just a sample mean!

So we can use the central limit theorem when n is large

27 / 36

Binomial CLT

For a Bernoulli variable

µ = mean = pσ2 = variance = p(1-p)

X̄ ≈ N(µ, σ2/n) as before

Equivalently, p̂ ≈ N(p, p(1−p)n )

28 / 36

Distribution of Differences

Often we are interested in detecting a difference between twopopulations

Differences in average income by neighborhood

Differences in disease cure rates by age

29 / 36

Distribution of Differences: Notation

Population 1:

Size = N1

Mean = µ1

Standard deviation = σ1

Population 2:

Size = N2

Mean = µ2

Standard deviation = σ2

Samples of size n1 from Population 1:

Mean = µX̄1= µ1

Standard deviation =σ1/√

n1 = σX̄1

Samples of size n2 from Population 2:

Mean = µX̄2= µ2

Standard deviation =σ2/√

n2 = σX̄2

30 / 36

Distribution of Differences: CLT result

Now by CLT, for large n:

X̄1 ∼ N(µ1, σ21/n1)

X̄2 ∼ N(µ2, σ22/n2)

and X̄1 − X̄2 ≈ N(µ1 − µ2,σ2

1n1

+σ2

2n2

)

31 / 36

Difference in proportions?

We’re done if the underlying variable is continuous. What ifthe underlying variable is Binomial?

Then X̄1 − X̄2 ≈ N(µ1 − µ2,σ2

1n1

+σ2

2n2

)is replaced by:

p̂1 − p̂2 ≈ N(p1 − p2,p1(1− p1)

n1+

p2(1− p2)

n2)

32 / 36

Summary of Sampling Distributions

Sampling Distribution

Statistic Mean Variance

X̄ µ σ2

n

X̄1 − X̄2 µ1 - µ2σ2

1n1

+σ2

2n2

p̂ p pqn

np̂ np npqp̂1 − p̂2 p1 − p2

p1q1n1

+ p2q2n2

33 / 36

Statistical inference

Two methods

Estimation (Confidence intervals)Hypothesis testing

Both make use of sampling distributions

Remember to use CLT

34 / 36

Rest of material moved to lecture 4

We didn’t get a chance to cover the rest of the material, so it hasbeen moved to lecture 4.

35 / 36

Lecture 3 Summary

The Normal Distribution

The Central Limit Theorem

Sampling distributions

Next time, we’ll discuss

Confidence intervals for population parameters

The t-distribution

Hypothesis testing (p-values)

36 / 36