View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Copyright (c) Bani Mallick 1
Lecture 4
Stat 651
Copyright (c) Bani Mallick 2
Topics in Lecture #4 Probability
The bell-shaped (normal) curve
Normal probability plots (the q-q plot) to check for normality of continuous data
Use of Table 1 in the back of the book
Copyright (c) Bani Mallick 3
Topics in Lecture #4 Normal probability calculations
Data Transformations
Sampling distributions: sample means are random variables!
Standard error of the sample mean
Central Limit Theorem
A simple confidence interval
Copyright (c) Bani Mallick 4
Book Sections Covered in Lecture #4
Chapter 4.10, in detail
Chapter 4.11 (read on your own)
Chapter 4.12, in detail
Chapter 5.1
Chapter 5.2
Copyright (c) Bani Mallick 5
Lecture 3 Review
Box plots are probably the best way to compare populations graphically
You can detect shifts and changes in variation
Also identifies outliers
Copyright (c) Bani Mallick 6
Lecture 3 Review
q-q plots are a simple way to understand whether the data are approximately bell-shaped
Population Relative Frequency Histogram
Bell-shaped curve!!
X
43210-1-2-3-4
No
rma
l D
en
sity
.5
.4
.3
.2
.1
0.0
-.1
Copyright (c) Bani Mallick 7
Lecture 3 Review
q-q plots are a simple way to understand whether the data are approximately bell-shaped
If they are sort of straight, then normality of the population relative frequency histogram is not too badly off
Copyright (c) Bani Mallick 8
q-q plot for the healthy womenNormal Q-Q Plot of Log(Saturated Fat)
Observed Value
4.54.03.53.02.52.01.51.0
Exp
ect
ed
No
rma
l Va
lue
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Copyright (c) Bani Mallick 9
Lecture 3 Review
For bell-shaped populations, we have empirical rules
Approximately 68% (90%) (95%) of the population lies within 1 (1.645) (1.96) population standard deviations of the population mean
Copyright (c) Bani Mallick 10
Lecture 3 Review
In many of our examples, we have seen that there look to be differences among populations. How can we tell if the differences are real?
We will say that populations are different if the differences we observe are more than can be expected by sample-to-sample variability.
Copyright (c) Bani Mallick 11
Lecture 3 Review
Random variables are any outcome (qualitative or numerical) from an experiment involving random sampling from a population
The idea of a model is to write down a formula for the population histogram as a function of 1-2 parameters which are estimated from the data.
If you know the parameters of the model, then you know everything about probabilities in that population
Copyright (c) Bani Mallick 12
Using the Normal Model
The entire point of the normal model is to make probability statements
In practice, we estimate the population mean by the sample mean
We estimate the population standard deviation by the sample standard deviation
Then we estimate probabilities, by pretending the sample quantities = the population ones
Copyright (c) Bani Mallick 13
Various Cases
Suppose we want to know what % of a population lies below a specified value, c
We write this by asking: what is
Pr(X < c)
The value c is any arbitrary value, e.g., 6
X is any random variable with a population mean and a population standard deviation
Copyright (c) Bani Mallick 14
Pr(X < c) for Normal Populations
Compute the z-score
Look up value in Table 1, page 1091
(white board explanation)
c-μz =
σ
Copyright (c) Bani Mallick 15
Mechanics
NHANES: suppose healthy women’s ages are normally distributed with mean = 40 and standard deviation = 6
What is the chance that a randomly selected person from this population is aged c = 43.3 or less
We write this in symbols as pr(X < 43.3)
Copyright (c) Bani Mallick 16
Mechanics
= 40, = 6
pr(X < 43.3) is what we want
z = (43.3 - )/ = 0.55 = z-score
Look up in Table 1:
The value 0.55 is on page 1092: first column is 0.5, first row is 0.05: add them to get 0.55, and look up the value
Pr(X < 43.3) = 0.7088
Copyright (c) Bani Mallick 17
Various Cases
Suppose we want to know what % of a population lies above a specified value, c
We write this by asking: what is
Pr(X > c)
The value c is any arbitrary value, e.g., 6
X is any random variable with a population mean and a population standard deviation
Copyright (c) Bani Mallick 18
Pr(X > c) for Normal Populations
This is simply 1 – Pr(X <= c).
Compute the z-score (c- )/
Look up the value for z in Table 1
Subtract this value from 1.0
Copyright (c) Bani Mallick 19
Mechanics
= 40, = 6
Chance that a randomly selected person from this population is aged 46 or more
pr(X > 46)
z = (46 - )/ = 1
Look up in Table 1 for 1.00: get 0.8413
Because you are asking for > 46, subtract from 1 to get pr(X > 46) = 1 – 0.8413 = .1587
Copyright (c) Bani Mallick 20
Mechanics
= 40, = 6
Chance that a randomly selected person from this population is aged 46 or less
pr(X <= 46)
z = (46 - )/ = 1
Look up in Table 1: chance is 84.13%
Copyright (c) Bani Mallick 21
Mechanics
= 40, = 6
Chance that a randomly selected person from this population is aged 34 or less
pr(X <= 34)
z = (34 - )/ = -1
Look up in Table 1: chance is 0.1587 = 15.87%
Copyright (c) Bani Mallick 22
Aortic Stenosis Data
Two populations: healthy kids and kids with aortic stenosis
Two outcomes: body surface area and aortic value area
Size adjusted aortic value areas is the ratio of aortic value area to body surface area
Copyright (c) Bani Mallick 23
Stenosis Data, AVA to BSA
Ratio: Note the huge outlier in
the stenotic kids.
He/she has a huge aortic value area relative to
his/her body size
5670N =
Health Status
StenotiHealthy
AV
A t
o B
SA
Ra
tio
8
6
4
2
0
-2
88797299
125
Copyright (c) Bani Mallick 24
Aortic Stenosis Data
Healthy kids and AVA/BSA Ratio
Sample mean = 1.38, s = 0.51
Let’s pretend the population has = 1.4, = 0.5
As it turns out, the sample mean of stenotic kids is 0.7
So, let’s ask: for healthy kids, what is
pr(X < 0.7)?
Copyright (c) Bani Mallick 25
Aortic Stenosis Data
Healthy kids and AVA/BSA Ratio
= 1.4, = 0.5
For healthy kids, what pr(X <= 0.7)?
z = (0.7 - )/= -1.4
look up in Table 1
You should get 0.0808
Copyright (c) Bani Mallick 26
Aortic Stenosis Data
For healthy kids, pr(X <= 0.7) = 0.0808
Stenotic kids have a mean ava/bsa ratio of 0.7
Thus, the average stenotic kid has a lower ava/bsa ratio than 91.92% of healthy kids
91.92% = 100% - 8.08%
Copyright (c) Bani Mallick 27
Not all Data are Normally Distributed
“Time to an event”, e.g., time to a heart attack
Number of things that happen, e.g., number of heart attacks
These typically have a skew shape
X
6543210-1
DE
NS
ITY
.2
.1
0.0
-.1
Copyright (c) Bani Mallick 28
Not all Data are Normally Distributed
These typically have a skew shape
Statisticians have special models to handle this (Gamma, Poisson)
You will usually try to eliminate some of the skewness by data transformation
X
6543210-1
DE
NS
ITY
.2
.1
0.0
-.1
Copyright (c) Bani Mallick 29
Not all Data are Normally Distributed
The standard data transformations are
Square root
Logarithm: but if you have zeros in the data set, you have to add a small constant, since log(0) =
Copyright (c) Bani Mallick 30
Inference
The basic building blocks for inference are statistics
Let’s start with the population mean , the sample mean and the sample standard deviation s
Standard error (of the mean) is
ns/
Copyright (c) Bani Mallick 31
Inference
The sample mean is a random variable
This means that it varies from sample to sample
Of course, if we were able to “sample” the entire population, the sample mean would equal the population mean
Copyright (c) Bani Mallick 32
Inference
The sample mean is a random variable
Its own “population” mean is
It’s standard deviation is
Note how the standard deviation of the sample mean becomes smaller as the sample size becomes larger
Why does this make sense?
σ/ n
Copyright (c) Bani Mallick 33
Central Limit Theorem
The sample mean is a random variable
Its own “population” mean is
It’s standard deviation is
In “large enough” samples, the sample mean is very nearly normally distributed, i.e., has a bell--shaped histogram
What does this mean?
σ/ n
Copyright (c) Bani Mallick 34
Warning
It is incredibly easy to have difficulty understanding that the sample mean is itself a random variable
But it is the crucial concept
If I take repeated samples and compute the sample mean each time, I will not get the same number.
Thus, the sample mean is a random variable
Copyright (c) Bani Mallick 35
Women’s Interview Survey of Health
Funny case-control study
Seemed to indicate that those women who ate a lot of non-chocolate sweets were at higher risk of breast cancer
271 women controls were interview for their diets
They completed 6 24-hour recalls
Copyright (c) Bani Mallick 36
Women’s Interview Survey of Health
271 women controls were interview for their diets and completed 6 24-hour recalls
Hawthorne effect: the more you ask people about their lives, the more they will change
Does this happen here?
If so, we’d expect that their caloric intake decreased the more they were asked about their diet
Copyright (c) Bani Mallick 37
Women’s Interview Survey of Health
To test the Hawthorne effect, we took the average caloric intake from the first two interviews, and subtracted it from the average caloric intake from the last 2 interviews
X = (average of 5 & 6) – (average of 1 & 2)
Do you think the population mean of X is positive or negative?
Copyright (c) Bani Mallick 38
WOMEN’S INTERVIEW SURVEY OF HEALTH (WISH)
My guess was that because of various factors (societal pressure, awareness of diet, Hawthorne effect), they will report fewer calories at the second time period
My hypothesis is that the population mean of X is < 0.
Copyright (c) Bani Mallick 39
WISH: Change in Caloric Intake
271N =
Change in mean Energ
2000
1000
0
-1000
-2000
-3000
217239
208
247
Does it look like a big change?
Copyright (c) Bani Mallick 40
WISH: Change in Calories
Normal Q-Q Plot of Change in mean Energy
Observed Value
200010000-1000-2000-3000
Exp
ect
ed
No
rma
l Va
lue
2000
1000
0
-1000
-2000
Does this look straight enough to be happy thinking that X is
approximately normally distributed?
Copyright (c) Bani Mallick 41
WISH
Descriptives
-180.1262 37.2202
-253.4050
-106.8474
-171.6543
-128.2150
375428.7
612.7223
-2235.00
1567.96
3802.96
838.1900
-.253 .148
.608 .295
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Change in meanEnergy: last 2 recallsminus first 2 recalls
Statistic Std. Error
What does an IQR of 838
mean?
Copyright (c) Bani Mallick 42
WISH
The sample size is n = 271
The sample mean change = -180 calories!
The sample standard deviation = 612
The sample standard error = 37
Empirical rule, the chance is 95% that the population mean is with 1.96 * 37 = 74 of -180, i.e., between - 254 and -106
Copyright (c) Bani Mallick 43
WISH
Empirical rule, the chance is 95% that the population mean between
- 254 and -106
What does this mean?
Is there a Hawthorne effect going on?
Can you attach a probability to this?