30
Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous variables, e.g., the amount of chloride of a water sample, the beta-carotene in a blood sample, the blood pressure are continuous variables. When we construct a relative frequency histogram, it is very likely that the shape of the distribution is bell-shaped, that is a few possible values are small, a few are large, and most of them are around the average. Such type of distribution is what we call NORMAL distribution. Fox example, Blood Pressure, the beta-carotene in a blood sample, amount of chloride of a water sample mostly follow normal curves.

Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Embed Size (px)

Citation preview

Page 1: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Module Four: Normal distribution and it’s applications to inter-laboratory testing

When we conduct an inter-laboratory testing, we often observe continuous variables,

e.g., the amount of chloride of a water sample, the beta-carotene in a blood sample, the blood pressure are continuous variables.

When we construct a relative frequency histogram, it is very likely that the shape of the distribution is bell-shaped, that is a few possible values are small, a few are large, and most of them are around the average.

Such type of distribution is what we call NORMAL distribution.

Fox example, Blood Pressure, the beta-carotene in a blood sample, amount of chloride of a water sample mostly follow normal curves.

Page 2: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

60 110 160 210

0

100

200

300

400

Systolic Blood Pressure

Fre

que

ncy

Curve, for indivduals age from 15-20Histogram of Systolic Blood Pressure, with Normal

The imposed smooth curve looks like a bell-shape. If the blood pressure follows a normal curve with mean 115 and s.d. 14,

We use the notation: X ~ N(

For this case, X ~N (115,14).

An immediate question is: How can we detect if the distribution indeed follows a normal curve.

A histogram with imposed normal curve for 1900 individuals’ systolic blood pressure

Our interest may be to check if the blood pressure follows a normal distribution, to find out what proportion of individuals whose blood pressure is at risk (150 ml or higher), or to identify extreme cases.

Page 3: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

When and How do you use Normal Distribution in real world situations?

Normal curve describes the probability of occurrences of many real situations.

•Most of statistical techniques, including the techniques used for analyzing inter-laboratory testing data, assume that the response variable approximately follows a normal curve.

•These methods may not be valid if the response does not follow a normal distribution. It is, therefore, important to learn how to check if a response variable follows a normal distribution or not. For this reason, we need to learn some basic properties of a normal distribution, to learn how to compute probabilities and percentiles for a normal distribution.

•In this module, we will discuss:

•The use of z-table and Minitab to compute probabilities and percentiles.

•Techniques of checking if a response variable follows a normal distribution.

Page 4: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

The normal probability distribution provides a good model for describing data that have mound-shaped frequency distributions.

The Normal Probability Distribution:

where e 2.718 and 3.142; and ( 0 ) are the parameters that represent the population mean and standard deviation.

We will use the notation: X ~ N(). This means

X is distributed as Normal with mean and standard deviation .

Some examples of normal random variables are :

X = Adult Height , X = Scores of s national test, X = Gas price, X = Blood pressure

NOTE: X = salary of individuals who are 40 years or old before retire does not follow a normal curve. It is a skewed to right distribution.

2

2

2

)(

2

1)(

x

exf

Page 5: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Properties of Normal Distribution This figure shows three such distributions with differing values of and

.

Mean determines the center. In this case, Standard deviation measures the variability. In this case, Large values of reduce the height of the curve and increase the spread. Small values of increase the height of the curve and reduce the spread.

1

2

3

Page 6: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Also: P(X> ) = P(X < ) = .5

X-a +a

P( -a <X<) P(<X<+a)=

P(X<-a)P(X > +a)

f(x)

Some properties for X ~ N()

Page 7: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Example:

Every year, universities recruit students using their SAT scores. Based on the previous information, we know that SAT scores follows a normal curve with the mean 1000 and standard deviation 180. In the past, CMU admits students with SAT 1090 or higher.

Q1: What is the percent of high school students who can receive CMU admission?

Q2: If CMU decides to higher the SAT admission limit to only admit the top 20% of high school graduates. What should be the new SAT admission limit?

Q3: A student scored 1200, and claim he is in the top 10%. Is this a correct claim?

Page 8: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Tabulated Areas of the Normal Probability Distributions

• How do you solve the SAT admission problem?First, we need to rewrite the problem using the notation we are familiar.Let call X = SAT scores. Then from the given information, we know:X ~ N(1000, 180).

Q1: asks for P( X > 1090)

Q2: asks for a value of X, call it: xo, the admission limit, so that

P( X > xo ) = .2

Q3: asks for comparing P(X > 1200) with .1

How do we solve these problems?

• The probability that a continuous random variable x assumes a value in the interval from a to b is the area under the probability density function between the points a and b.

Page 9: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

One can use computer such as Minitab, or use a standardized Z-table.

The Standard Normal Random Variable:

The standardized normal random variable z, is defined as

z (x )/ , or equivalently, x z .

The standard probability distribution has a mean of zero and a standard deviation of 1, that is Z ~ N(0,1)

The area under the standard normal curve between mean z = 0 and a specified positive value of z, say, z0 , is the probability

Some books use this

table. Some use other

type of tables.

)0( 0zzP

X zoZ

Page 10: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

The idea is to transform X~ N() to Z(0,1) using z = (x-

P(X > 1090) = P(Z > (1090-1000)/180 ) = P(Z > 0.5)

Now Z-table can be applied.

X ~ N(1000, 180)Back to the SAT score problem:

X, SAT score 1000 1090

P( X>1090)

Z=(x-1000)/180(1000-1000)/180 = 0 0.5 = (1090-1000)/180

Page 11: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous
Page 12: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Example Find P (0 < z < 1.63)

Solution

1. Draw a normal curve, shade the area of interest.

2. Rewrite the question in the way that the Z-table can be applies. That is in the forms of

P( 0 < Z < zo)

For this example, it is already in this form, so using the Z-table, we obtain: P (0 < z < 1.63) = .4484.

Some additional exercises:

Find P( Z < 1.96), Find P(-1.24< Z < .68), Find P( Z > -1.64)

Page 13: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Calculating Probabilities for a General Normal

Random Variable, X:

1. Draw a normal curve for X, shade the area of interest,

2. Transform X to Z.

- Standardize the interval of interest, write it as the equivalent interval in terms of z.

- The probability of interest is the area that you find using the standard normal probability distribution.

Page 14: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Now, Back to the the SAT example, do the following exercises:

SAT score, X follows a normal distribution with mean 1000 and s.d., 180. That is, X ~ N(1000, 180)

Find P(X < 800)

Find P(750 < X < 900)

Find P(1180 < X < 1360)

Page 15: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

How about the question of determining the SAT admission score for CMU so that the top 20% will receive admission from CMU.

Answer: X ~ N(1000, 180). The problem is to find the admission score, xo so that

P(X > x0) = .2

This is a problem we are looking for a score, not a probability. We are reversing the problem solving procedure, here.

Similar technique is applied here:

1. Draw a normal curve, shade the area of interest.

2. Transform from X to Z.

3. Rewrite the problem in terms of Z.

4. Solve for the standardized value, zo using Z-table reversely.

5. Transform zo back to xo by xo = (zo)

Page 16: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

To solve for the admission score xo so that P(X > xo) = .2

Draw the normal curve, shade the area of interest, transform to Z.

.2 = P(X > xo) = P(Z > zo) implies P(0 < Z < zo) = .3

This is a form we can use Z-table.

Looking inside the table, find the closed probability to .3, which is .2995.

By the Z-table, .2995 = P(0 < Z < .84).

Therefore, zo = .84, which is the standardized admission limit.

So, solving for xo, we have xo = (zo) = 1000 + (180)(.84) = 1151.2

The CMU SAT admission limit will be about 1151.2

(In actual application for setting up the policy, we can use 1150 as the new admission standard.)

Page 17: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Hands-on activities:

Q-a:For the SAT example, X ~ (1000, 180), suppose a university admits only top 5%. Find their admission limit.

Q-b: Find the 5th percentile of SAT score.

Q-c: Find the Q3 SAT score (75th percentile).

Page 18: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Use Minitab to compute cumulative probabilities and percentiles for a normal distribution

1. Go to Calc, choose Probability Distributions, then select Normal.

2. In the Dialog box, Density probability = f(x), Cumulative probability = P( X < a) for any given a, Inverse cumulative probability is the 100pth percentile, xo , so that P(X < xo) = p. Choose the one you are computing.

3. Enter Mean and s.d.. By default, it is N(0,1).

4. To compute cumulative probability, you need to provide ‘a’ values, which may be created and recorded in a column, e.g., C3, or simply to provide the constant ‘a’.

5. To compute inverse cumulative probability, you need to provide the cumulative probabilities, which must be in (0,1).

Page 19: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Methods for detecting the discrepancy of the distribution of a response variable from normal distribution.

Consider the example of Blood Pressure data. From the histogram and the normal curve imposed onto the histogram using Minitab, we can see that the blood pressure generally speaking follows a normal curve. However, there seems to have a few unusually high blood pressures. The question is ‘How well the blood pressure follows a normal curve?’.

The imposing normal curve helps us to quickly identify serious discrepancy from normal. However, if the discrepancy is not very serious, it is difficult to simply observe the shape of a histogram.

We will discuss three ways for checking the normality of a response:1. Imposing normal curve onto the histogram,2. Probability plot,3. Numerical methods for testing the degree of departure from

normal.

Page 20: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

1. Imposing a normal curve onto a histogram for the blood pressure data of 1900 young adults between 15-20 years old:

60 110 160 210

0

100

200

300

400

Systolic Blood Pressure

Fre

que

ncy

Curve, for indivduals age from 15-20Histogram of Systolic Blood Pressure, with Normal

How to construct this plot using Minitab:

• Go to Stat, choose Basic Statistics, choose Display Descriptive Statistics.

• Enter the variable. Click on the ‘Graphs’ option,

• In the Graphs option Dialog, you can have a variety of choices. One of them is Histogram with Normal Curve.

The normal curve indicates there are a few large blood pressure measurements. In fact, the descriptive statistics shows the highest is 210, which is much higher than 2 s.d. from the average. It suggests 210 is very rare. One should check immediately if there is a typo or not.

Page 21: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

2. Normal Probability Plot: It is a two-dimensional plot. The Y-axis is the estimated cumulative probabilities computed by:

The X-axis is the original data in ascending order. Diagnosis:

3/ 8

1/ 4

rank

n

When the data follow a normal curve, the dotted points should follow a straight line

When data are skewed-to-right, the plot would look like:

When data are skewed-to-left, the plot would look like:

Page 22: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Average: 114.590StDev: 14.0595N: 1909

Anderson-Darling Normality TestA-Squared: 11.502P-Value: 0.000

100 150 200

96.2

4196

.241

104.

929

114.

582

124.

235

132.

922

0.100000.10000

0.25000

0.50000

0.75000

0.90000

.001

.01

.05

.20

.50

.80

.95

.99

.999

Pro

babi

lity

Systolic Blood Pressure

Normal Probability Plot for the Blood Pressure Data

Page 23: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Based on the Normal probability plot, it indicates that the systolic blood pressure does not follow a normal curve. The pattern also shows that the distribution is somewhat skewed-to-the-right.

3. Test statistic for testing if the blood pressure follows a normal curve or not.Graphical methods are good to show the pattern and gives us pretty clear picture that the data do not follow normal. Numerically, there are methods that will test such a hypothesis. The test statistic is given in the same graph of the Normal Probability Plot. The Anderson-Darling’s Normality Test is presented here. The AD-value = 11.5, and the corresponding p-value is .000

Note: p-value tells us how far the distribution of blood pressure is away from normal. The smaller the p-value, the less likely the response variable follows a normal curve. A common cut-off point is 5%. In this case, p-value = .000, which is clear that the distribution of Systolic blood pressure does not follow normal.

Page 24: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

How to construct a Normal Probability Plot and carry out the Anderson-Darling’s Normality Test?

1. Go to Stat, choose Basic Statistics, then select Normality Test.

2. In the Dialog, enter variable name.

3. Reference Probabilities allow us to provide a column of cumulative probabilities so that the normal probability plot will show the percentiles for each given cumulative probability.

Page 25: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

• Note: As we have observed that all three methods give us similar results. Therefore, the systolic blood pressure for 15 to 20 years old young adults does not follow a normal distribution from the 1909 cases.

• Note: Once we find out the distribution is not normal, it is critical to take some further analysis:– carefully check the data to see if there are any typos,

– Examine the data using some descriptive measures or other plots to identify extreme cases (Details will be discussed in another module).

Hands-on Activity:

Use the above three methods to check the distribution of Diastolic Blood Pressure data.

Page 26: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Actions to deal with extreme cases

For observational studies (such as survey):

• The sample sizes are usually large, and that it is often impossible to find out possible causes that resulted the extreme data after the data are collect. Therefore, it is critical to collect background and environmental variables that may have potential impact to the results.

For experimental studies, such as inter-laboratory testing:

• It is important to look for possible causes that resulted the extremes. The study is usually conducted under a controlled experimental environment. It is more likely to find out causes for the extremes, or be able to explain the possible causes.

Deletion of extremes Vs. Making transformation to normal

One must be careful of deleting extremes. Especially when we are not able to find any causes and the values are reasonable within the context of the study.

This may be an indication that the distribution of the response is skewed. For situations such as this, an appropriate approach is to transform the data to be closer to normal.

Page 27: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Method for transforming a variable to normal

When the data show a skewed distribution, statistical methods such as Analysis of Variance may not be valid. An approach is to make a mathematical transformation of the variable so that the transformed variable will be closer to normal.

Some tips for variable transformation: If variable, Y, is skewed-to-right: Then, ln(Y), log10(Y), or

will be closer of normal. (If there are zero’s, add each data value by .5, first.

If variable, Y, is skewed-to-left: ln(1/Y), log10(1/Y), or Ya, a >1 will be closer to normal.

Y

Y/1

Page 28: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

An example of Transformation:The life time of 50 light bulbs are tested by letting them on all the time until it burns out. The

data recorded (in months). Here are the histogram and the normal probability test of the raw data, the ln transformed data and Square-root transformed data:

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0

5

10

Sqrt(Y)

Fre

que

ncy

Histogram of Sqrt(Y)

-0.4 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2

0

5

10

Ln(Y)

Fre

que

ncy

Histogram of Ln(Y)

0 10 20

0

1

2

3

4

5

6

7

8

9

Life Time

Fre

que

ncy

Histogram of Life Time

The raw data is skewed-to-right.

The Ln transformation does not work well.

The Square-root transformation works well.

Page 29: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

The normal probability plots and Anderson-Darling’s tests for the life-time data:

Average: 2.86016StDev: 1.09027N: 50

Anderson-Darling Normality TestA-Squared: 0.430P-Value: 0.297

1 2 3 4 5

.001

.01

.05

.20

.50

.80

.95

.99

.999

Pro

babi

lity

Sqrt(Y)

Normal Probability Plot for Sqrt(Y)

Average: 1.93005StDev: 0.886131N: 50

Anderson-Darling Normality TestA-Squared: 1.071P-Value: 0.007

0 1 2 3

.001

.01

.05

.20

.50

.80

.95

.99

.999

Pro

babi

lity

Ln(Y)

Normal Probability Plot for Ln(Y)Average: 9.34544StDev: 6.29248N: 50

Anderson-Darling Normality TestA-Squared: 0.906P-Value: 0.019

0 10 20

.001

.01

.05

.20

.50

.80

.95

.99

.999

Pro

babi

lity

Life Time

Normal Probability Plot for the Life Time Data

As the normal probability plots and the Normality test results indicate, the Sqrt(Y) is approximately normal. The other two are not.

Page 30: Module Four: Normal distribution and it’s applications to inter-laboratory testing When we conduct an inter-laboratory testing, we often observe continuous

Hands-on Activity

Analyze the distribution of variable GR36-Lab-Mean-1 in the TAPPI inter-laboratory testing study, and determine an appropriate transformation to make the data closer to a normal distribution.