Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability

Copyright (c) Bani Mallick 1

Lecture 4

Stat 651


Topics in Lecture #4 Probability

The bell-shaped (normal) curve

Normal probability plots (the q-q plot) to check for normality of continuous data

Use of Table 1 in the back of the book


Topics in Lecture #4 Normal probability calculations

Data Transformations

Sampling distributions: sample means are random variables!

Standard error of the sample mean

Central Limit Theorem

A simple confidence interval


Book Sections Covered in Lecture #4

Chapter 4.10, in detail

Chapter 4.11 (read on your own)

Chapter 4.12, in detail

Chapter 5.1

Chapter 5.2


Lecture 3 Review

Box plots are probably the best way to compare populations graphically

You can detect shifts and changes in variation

Also identifies outliers


Lecture 3 Review

q-q plots are a simple way to understand whether the data are approximately bell-shaped

Population Relative Frequency Histogram

Bell-shaped curve!!

X

43210-1-2-3-4

No

rma

l D

en

sity

.5

.4

.3

.2

.1

0.0

-.1


Lecture 3 Review

q-q plots are a simple way to understand whether the data are approximately bell-shaped

If they are sort of straight, then normality of the population relative frequency histogram is not too badly off


q-q plot for the healthy womenNormal Q-Q Plot of Log(Saturated Fat)

Observed Value

4.54.03.53.02.52.01.51.0

Exp

ect

ed

No

rma

l Va

lue

4.5

4.0

3.5

3.0

2.5

2.0

1.5


Lecture 3 Review

For bell-shaped populations, we have empirical rules

Approximately 68% (90%) (95%) of the population lies within 1 (1.645) (1.96) population standard deviations of the population mean


Lecture 3 Review

In many of our examples, we have seen that there look to be differences among populations. How can we tell if the differences are real?

We will say that populations are different if the differences we observe are more than can be expected by sample-to-sample variability.


Lecture 3 Review

Random variables are any outcome (qualitative or numerical) from an experiment involving random sampling from a population

The idea of a model is to write down a formula for the population histogram as a function of 1-2 parameters which are estimated from the data.

If you know the parameters of the model, then you know everything about probabilities in that population


Using the Normal Model

The entire point of the normal model is to make probability statements

In practice, we estimate the population mean by the sample mean

We estimate the population standard deviation by the sample standard deviation

Then we estimate probabilities, by pretending the sample quantities = the population ones


Various Cases

Suppose we want to know what % of a population lies below a specified value, c

We write this by asking: what is

Pr(X < c)

The value c is any arbitrary value, e.g., 6

X is any random variable with a population mean and a population standard deviation


Pr(X < c) for Normal Populations

Compute the z-score

Look up value in Table 1, page 1091

(white board explanation)

c-μz =

σ


Mechanics

NHANES: suppose healthy women’s ages are normally distributed with mean = 40 and standard deviation = 6

What is the chance that a randomly selected person from this population is aged c = 43.3 or less

We write this in symbols as pr(X < 43.3)


Mechanics

= 40, = 6

pr(X < 43.3) is what we want

z = (43.3 - )/ = 0.55 = z-score

Look up in Table 1:

The value 0.55 is on page 1092: first column is 0.5, first row is 0.05: add them to get 0.55, and look up the value

Pr(X < 43.3) = 0.7088


Various Cases

Suppose we want to know what % of a population lies above a specified value, c

We write this by asking: what is

Pr(X > c)

The value c is any arbitrary value, e.g., 6

X is any random variable with a population mean and a population standard deviation


Pr(X > c) for Normal Populations

This is simply 1 – Pr(X <= c).

Compute the z-score (c- )/

Look up the value for z in Table 1

Subtract this value from 1.0


Mechanics

= 40, = 6

Chance that a randomly selected person from this population is aged 46 or more

pr(X > 46)

z = (46 - )/ = 1

Look up in Table 1 for 1.00: get 0.8413

Because you are asking for > 46, subtract from 1 to get pr(X > 46) = 1 – 0.8413 = .1587


Mechanics

= 40, = 6

Chance that a randomly selected person from this population is aged 46 or less

pr(X <= 46)

z = (46 - )/ = 1

Look up in Table 1: chance is 84.13%


Mechanics

= 40, = 6

Chance that a randomly selected person from this population is aged 34 or less

pr(X <= 34)

z = (34 - )/ = -1

Look up in Table 1: chance is 0.1587 = 15.87%


Aortic Stenosis Data

Two populations: healthy kids and kids with aortic stenosis

Two outcomes: body surface area and aortic value area

Size adjusted aortic value areas is the ratio of aortic value area to body surface area


Stenosis Data, AVA to BSA

Ratio: Note the huge outlier in

the stenotic kids.

He/she has a huge aortic value area relative to

his/her body size

5670N =

Health Status

StenotiHealthy

AV

A t

o B

SA

Ra

tio

8

6

4

2

0

-2

88797299

125



Healthy kids and AVA/BSA Ratio

Sample mean = 1.38, s = 0.51

Let’s pretend the population has = 1.4, = 0.5

As it turns out, the sample mean of stenotic kids is 0.7

So, let’s ask: for healthy kids, what is

pr(X < 0.7)?



Healthy kids and AVA/BSA Ratio

= 1.4, = 0.5

For healthy kids, what pr(X <= 0.7)?

z = (0.7 - )/= -1.4

look up in Table 1

You should get 0.0808



For healthy kids, pr(X <= 0.7) = 0.0808

Stenotic kids have a mean ava/bsa ratio of 0.7

Thus, the average stenotic kid has a lower ava/bsa ratio than 91.92% of healthy kids

91.92% = 100% - 8.08%


Not all Data are Normally Distributed

“Time to an event”, e.g., time to a heart attack

Number of things that happen, e.g., number of heart attacks

These typically have a skew shape

X

6543210-1

DE

NS

ITY

.2

.1

0.0

-.1



These typically have a skew shape

Statisticians have special models to handle this (Gamma, Poisson)

You will usually try to eliminate some of the skewness by data transformation

X

6543210-1

DE

NS

ITY

.2

.1

0.0

-.1



The standard data transformations are

Square root

Logarithm: but if you have zeros in the data set, you have to add a small constant, since log(0) =


Inference

The basic building blocks for inference are statistics

Let’s start with the population mean , the sample mean and the sample standard deviation s

Standard error (of the mean) is

ns/


Inference

The sample mean is a random variable

This means that it varies from sample to sample

Of course, if we were able to “sample” the entire population, the sample mean would equal the population mean


Inference


Its own “population” mean is

It’s standard deviation is

Note how the standard deviation of the sample mean becomes smaller as the sample size becomes larger

Why does this make sense?

σ/ n


Central Limit Theorem


Its own “population” mean is

It’s standard deviation is

In “large enough” samples, the sample mean is very nearly normally distributed, i.e., has a bell--shaped histogram

What does this mean?

σ/ n


Warning

It is incredibly easy to have difficulty understanding that the sample mean is itself a random variable

But it is the crucial concept

If I take repeated samples and compute the sample mean each time, I will not get the same number.

Thus, the sample mean is a random variable


Women’s Interview Survey of Health

Funny case-control study

Seemed to indicate that those women who ate a lot of non-chocolate sweets were at higher risk of breast cancer

271 women controls were interview for their diets

They completed 6 24-hour recalls



271 women controls were interview for their diets and completed 6 24-hour recalls

Hawthorne effect: the more you ask people about their lives, the more they will change

Does this happen here?

If so, we’d expect that their caloric intake decreased the more they were asked about their diet



To test the Hawthorne effect, we took the average caloric intake from the first two interviews, and subtracted it from the average caloric intake from the last 2 interviews

X = (average of 5 & 6) – (average of 1 & 2)

Do you think the population mean of X is positive or negative?


WOMEN’S INTERVIEW SURVEY OF HEALTH (WISH)

My guess was that because of various factors (societal pressure, awareness of diet, Hawthorne effect), they will report fewer calories at the second time period

My hypothesis is that the population mean of X is < 0.


WISH: Change in Caloric Intake

271N =

Change in mean Energ

2000

1000

0

-1000

-2000

-3000

217239

208

247

Does it look like a big change?


WISH: Change in Calories

Normal Q-Q Plot of Change in mean Energy

Observed Value

200010000-1000-2000-3000

Exp

ect

ed

No

rma

l Va

lue

2000

1000

0

-1000

-2000

Does this look straight enough to be happy thinking that X is

approximately normally distributed?


WISH

Descriptives

-180.1262 37.2202

-253.4050

-106.8474

-171.6543

-128.2150

375428.7

612.7223

-2235.00

1567.96

3802.96

838.1900

-.253 .148

.608 .295

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Change in meanEnergy: last 2 recallsminus first 2 recalls

Statistic Std. Error

What does an IQR of 838

mean?


WISH

The sample size is n = 271

The sample mean change = -180 calories!

The sample standard deviation = 612

The sample standard error = 37

Empirical rule, the chance is 95% that the population mean is with 1.96 * 37 = 74 of -180, i.e., between - 254 and -106


WISH

Empirical rule, the chance is 95% that the population mean between

- 254 and -106

What does this mean?

Is there a Hawthorne effect going on?

Can you attach a probability to this?

Documents

Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability