Upload
duongdien
View
217
Download
3
Embed Size (px)
Citation preview
2
Here in lesson 9 HIMT 350, we will learn about: Normal distributions, determining normal probabilities, finding values that correspond to normal probabilities and assessing departures from normality.
3
Normal random variables are the most common type of continuous random variable. They describe some (but not all) natural phenomena. Recall that they are really common for many continuous variables found in healthcare and in electronic health records. For example, height, weight, systolic blood pressure, fasting plasma glucose values, etc. are all generally normally distributed. It’s really important to recall however, that because there tends to be ill people using healthcare services, many of these variables are not strictly bell-‐shaped but instead are skewed to the right. They include more people with higher than normal values. It’s really important that you plot these variables and transform them with log or square roots to make them bell-‐shaped. You then run your analysis and back transform them to get the meaningful parameters. For example, if you calculate the mean of the log data you take the antilog of the log mean value and report this. A very important feature of normal probability distributions is that they describe the behavior of means. Recall that the continuous random variables are described with smooth probability density functions (pdfs) in lesson 7. Normal pdfs are recognized by their familiar bell-‐shape
4
In the top figure, the darker bars of the histogram correspond to ages less than or equal to 9 (~40% of observations). In the lower figure, this darker area under the curve also corresponds to ages less than 9 (~40% of the total area). Remember that the total area under the curve = 1.0.
5
Normal pdfs are a family of distributions. Family members are identified by parameters: μ (mean) and σ (standard deviation). μ controls location and σ controls spread. A family of distributions in this situation means they are all symmetrical but they can have different heights and widths.
6
Points of inflections (where the slopes of the curve begins to level) occur one σ below and above μ.
7
Remember the 68-‐95-‐99.7 rule? 68% of the AUC falls within ±1σ of μ, 95% of the AUC falls within ±2σ of μ and 99.7% of the AUC falls within ±3σ of μ. For example, using adult intelligence scores which are Normally distributed with μ = 100 and σ = 15; X ~ N(100, 15). Using the 68-‐95-‐99.7 rule: 68% of scores fall in μ ± σ = 100 ± 15 = 85 to 115 95% of scores fall in μ ± 2σ = 100 ± (2)(15) = 70 to 130 99.7% of scores μ ± 3σ = 100 ± (3)(15) = 55 to 145
8
Because the Normal curve is symmetrical and the total AUC adds to 1, we can determine the AUC in tails. Because 95% of curve is in μ ± 2σ, 2.5% is in each tail beyond μ ± 2σ.
9
For example, male height is approximately Normal with μ = 70.0˝ and σ = 2.8˝. Because of the 68-‐95-‐99.7 rule, 68% of population is in the range 70.0˝ ± 2.8˝ = 67.2 ˝ to 72.8˝. Because the total AUC adds to 100%, 32% are in the tails below 67.2˝ and above 72.8˝. Because of symmetry, half of this 32% (i.e., 16%) is below 67.2˝ and 16% is above 72.8˝.
10
Remember when I mentioned that many electronic health record continuous variables are non-‐normally distributed and we can often re-‐express non-‐Normal variables with a mathematical transformation to make them more Normal. Examples of mathematical transforms include logarithms, exponents, square roots, and so on. Here’s an example using prostate specific antigen (PSA). PSA is not normally distributed in 60 year olds or older but the log PSA, shown as ln(PSA) in the figure, is approximately Normal with μ = −0.3 and σ = 0.8. Approximately, 95% of the log PSA falls in μ ± 2σ = −0.3 ± (2)(0.8) = -1.9 to 1.3. Therefore, 2.5% are above the log PSA, 1.3. Take anti-‐log of 1.3 and that’s equal to 3.67. Since only 2.5% of population has values greater than 3.67 they use this as cut-‐point for suspiciously high results. This is done with a number of lab tests like PSA. This is often how they determine critical points for determining if someone likely has a condition.
11
The best way to determine if a variable is normally distributed is to plot the data using a histogram or a Q-‐Q plot. You can check the shape in the histogram and often statistical software packages allow a normal curve to be superimposed over the histogram (see the Marshfield BMI data from lesson 2). If it is skewed you can try transforming using a log, square root, etc. and then plot it again and check its shape. Another useful plot is the Q-‐Q plot. If the data are normally distributed it should adhere to a diagonal line on the Q-‐Q plot.
12
In most cases the Normal probability value does not fall directly on a ±1σ, ±2σ, or ±3σ landmark, so this is how you determine the normal probability: First state the problem; then standardize the value (z score); then sketch and shade the curve; and finally use Table B in your text book to determine the probability.
13
Ok here an example using birth weight. The first thing to do is to state the problem. Here we want to determine the percentage of human gestations that are less than 40 weeks in length. Uncomplicated human pregnancy from conception to birth is approximately Normally distributed with μ = 39 weeks and 1 σ = 2 weeks. So we want to know the probability of x<=40 weeks.
14
To solve the problem we will use a z-‐table. The z-‐table is a standard normal variable where the mean is set to = 0. The z-‐table is Table B in your text book. These are also called Z variables. Look at this table, a standard normal (Z) variable with a value of 1.96 has a cumulative probability of .9750. What this is saying is approximately 97.5% all the data points are below a standard deviation z-‐score of 1.96
15
Step 2 is to standardize, and calculate a z value. To do this subtract μ and divide by σ. The z-‐score tells you how the number of σ-‐units the value falls above or below μ. For our birth weight example, we take (40 – 39)/2 which equals 0.5. That’s the z-‐score.
16
The next step is to look up the cumulative area under the curve for the z value 0.5 on Table B. It corresponds to .6915. Which means approximately 69% of all births are <=40 weeks gestation.
17
You can also determine the probabilities between two points using z scores. If you want to determine the area between a and b find the z score for b and subtract the z score for a
18
You can also find values corresponding to normal probabilities. First you state the problem; then Use Table B to look up the z-‐percentile value. Then sketch what you are trying to do. Then you unstandardize with the formula shown.
19
Then you look up the z percentile value. Use Table B to look up the z percentile value, i.e., the z score for the probability in question. Then look inside the table for the entry closest to the associated cumulative probability. Then trace the z score to the row and column labels. Suppose you wanted the 97.5th percentile z score. Look inside the table for .9750. Then trace the z score to the margins which is 1.96.
20
Here’s an example of finding normal values. Suppose we want to know what gestational length is less than 97.5% of all gestations? Step 1. State the problem -‐ Let X represent gestations length. The prior problem established X ~ N(39, 2). So the mean is 39 and the standard deviation is 2 weeks. We want the gestation length that is shorter than .975 of all gestations. This is equivalent to the gestation that is longer than .025 of gestations.
21
We can then use Table B to look up the z value. Table B lists only “left tails”. The z value corresponding to .0250 (z.025 ) is -‐1.96