26
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Embed Size (px)

Citation preview

Page 1: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Probability Distributions and Test of Hypothesis

Ka-Lok Ng

Dept. of Bioinformatics

Asia University

Page 2: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Normal Distribution

Page 3: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

• Distribution of a random variable• Statistical parameters – and

Normal Distribution

Page 4: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Central Limit Theorem

• Considered the following set of measurements for a given population: 55.20, 18.06, 28.16, 44.14, 61.61, 4.88, 180.29, 399.11, 97.47, 56.89, 271.95, 365.29, 807.80, 9.98, 82.73. The population mean is 165.570.

• Now, considered two samples from this population.• These two different samples could have means very different from

each other and also very different from the true population mean.• What happen if we considered, not only two samples, but all

possible samples of the same size ?• The answer to this question is one of the most fascinating facts in

statistics – Central limit theorem.• It turns out that if we calculate the mean of each sample, those

mean values tend to be distributed as a normal distribution, independently on the original distribution. The mean of this new distribution of the means is exactly the mean of the original population and the variance of the new distribution is reduced by a factor equal to the sample size n.

Page 5: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Central Limit Theorem• When sampling from a population with mean and variance , the

distribution of the sample mean (or the sampling distribution X) will have the following properties:

• The distribution of X will be approximately normal. The larger the sample is , the more will the sampling distribution resemble the normal distribution.

• The mean x of the distribution of X will be equal to , the mean of the population from which the samples were drawn.

• The variance s2 of distribution X will be equal to 2/n, the variance of the original population of X divided by the sample size. The quantity s is called the standard error of the mean.

http://cnx.org/content/m11131/latest/http://www.riskglossary.com/link/central_limit_theorem.htmhttp://www.indiana.edu/~jkkteach/P553/goals.html

Page 6: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Statistical hypothesis testing

• The expression level of a gene in a given condition is measured several times. A mean x of these measurements is calculated. From many previous experiments, it is known that the mean expression level of the given gene in normal conditions is . How can you decide which genes are significantly regulated in a microarray experiment? For instance, one can apply an arbitrary cutoff such as a threshold of at least twofold up or down regulation.

One can formulate the following hypotheses:

1. The gene is up-regulated in the condition under study: x>2. The gene is down-regulated in the condition under study: x<3. The gene is unchanged in the condition under study: x=4. Something has gone awry during the lab experiments and the gene

s measurements are completely off; the mean of the measurements may be higher or lower than the normal: x≠.

Page 7: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Statistical hypothesis testing

When a hypothesis test is viewed as a decision procedure, two types of error are possible, depending on which hypothesis, H0 or H1, is actually true. If a test rejects H0 (and accept H1) when H0 is true, it is called a type I error. If a test fails to reject H0 when H1 is true, it is called a type II error. The following shows the results of the different decisions.

Do not reject H0 Reject H0

True Correct decision Type I error

False Type II error Correct decision

H0

Decision

Page 8: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

• The next step is to generate two hypotheses. The two hypotheses must be mutually exclusive and all inclusive.

• Mutually exclusive – the two hypotheses cannot be true both at the same time• All inclusive means that their union has to cover all possibilities• Expression ratios are converted into probability values to test the hypothesis t

hat particular genes are significantly regulated• Null hypothesis H0 that there is no difference in signal intensity across the

conditions being tested• The other hypothesis (called alternate or research hypothesis) named H. If we

believe that the gene is up-regulated, the research hypothesis will be H1: x > , The null hypothesis has to be mutually exclusive and also has to include all other possibilities, therefore, the null hypothesis will be H0: x ≦ .

• One assigns a p-value for testing the hypothesis. The p-value is the probability of a measurement more extreme than a certain threshold occurring just by chance.

• The probability of rejecting the null hypothesis when it is true is the significance level , which is typically set at p<0.05, in other words we accept that 1 in 20 cases our conclusion can be wrong.

Statistical hypothesis testing

Page 9: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Statistical hypothesis testingOne-tail testing• The alternative hypothesis specifies that the parameter is g

reater than the values specified under H0, e.g. H1: >15. such a hypothesis is called upper one-tail testing.

Example• The expression level of a gene is measured 4 times in a gi

ven condition. The 4 measurements are used to calculate a mean expression level of x=90. it is known from the literature that the mean expression level of the given gene, measured with the same technology in normal conditions is =100 and the standard deviation is =10. We expect the gene to be down-regulated in the condition under study and we would like to test whether the data support this assumption.

• The alternative hypothesis H1 is “the gene is down-regulated” or

H0: x≧, therefore, H1 x<• This is an example of a one-tail hypothesis (left-tail) in whic

h we expect the values to be in one particular tail of the distribution.

Accept H0

Page 10: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Statistical hypothesis testing

• From the sampling theorem, the means of samples are distributed approximately as a normal distribution.

• Sample size = 4, Mean x = 90, = 100• Standard deviation = 10• Assuming a significance level of 5%• The null hypothesis is rejected if the computed p-value is lower than

the critical value (0.05)• We can calculate the value of Z as

24/10

10090

/

n

xZ

The probability of having such a value just by chance, i.e. the p-value, is :P(Z < -2) = 0.02275The computed p-value is lower than our significance threshold 0.02275 < 0.05, therefore we reject the null hypothesis. In other words, we accept the alternate hypothesis. We stated that “the gene is down-regulated at 5% significance level”.This will be understood by the knowledgeable reader as a conclusion that is wrong in 5% of the cases or fewer.

Page 11: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Normal distribution table

Page 12: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Normal distribution table

NORMDIST - Area under the curve start from left hand side

Z=0

Z=2

Page 13: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Statistical hypothesis testingTwo-tail testing• A novel gene has just been discovered. A

large number of expression experiments measured the mean expression level of this gene as 100 with a standard deviation of 10. Subsequently, the same gene is measured 4 times in 4 cancer patients. The mean of these 4 measurements is 109. Can we conclude that this gene is differential expressed in cancer?

• We do not whether the gene will be up-regulated or down-regulated.

• Null hypothesis H0: = 100, • Alternative hypothesis H1: ≠ 100• At a significant level of 5% 2.5% for the

left tail and 2.5% for the right tail• Z = (109 – 100)/(10/√4) = 9/(10)*2 = 1.8• P-value, P(Z 1.8)≧ = 1 – P(Z 1.8) = 1 – ≦

0.9641 = 0.0359 > 0.025 that is the P-value is higher than the significant level, so we cannot reject the null hypothesis

X

X

X

2.5% 2.5%

Page 14: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the mean – the t distribution

• Hypothesis testing• Parametric testing – where the data are known or assumed to f

ollow a certain probability distribution (e.g. normal distribution)• Non-parametric testing – where no a priori knowledge is availa

ble and no such assumptions are made.• The t distribution test or student’s t distribution test is a parame

tric test, it was discovered by William S. Gossett, a 32-year old research chemist employed by the famous Irish brewery ( 釀造,如啤酒 ) Guinness.

Page 15: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the mean – the t distribution

• Tests involving a single sample may focus on the mean of the sample (t-test, where variance of the population is not known) and the variance (2-test). The following hypotheses may be formulated if the testing regards the mean of the sample:

1. H0: = c, H1: ≠c

2. H0: c, H≧ 1: < c

3. H0: c, H≦ 1: > c

• The first hypotheses corresponds to a two-tail testing in which no a prior knowledge is available, while the second and the third correspond to a one-tail testing in which the measured value c is expected to be higher and lower than the population mean , respectively.

Page 16: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the mean – the t distribution• The expression level of a gene is known to have a mean expression level of 1

8 in the normal human population. The following expression values have been obtained in five measurements: 21, 18, 23, 20, 18. Is this data consistent with the published mean of 18 at a 5% significant level?

• Population s.d. is not known t-test, calculate sample s.d. s to estimate • H0 : = , H1 : ≠ 18 two-tail test

• Calculate the t-test statistics

11.25/12.2

1820

ns

xt

x x

Remember using n-1 when calculating standard deviation s.

Page 17: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the mean – the t distribution

Degree of freedom, , =5-1=4. Using a table of the t-distribution with four degree of freedom, the p-value associated with this test statistic is found to be between 0.05 and 0.1. The 5% two-tail test corresponds to a critical value of 2.776. Since the p-value is greater than 0.05 (t-value=2.11 < critical value=2.776), the evidence is not strong enough to reject the null hypothesis of mean 18 accept H0.

t-distribution is symmetric

Page 18: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

The t-distribution table- cumulative probability starting from left hand side

Two-tails=0.10, 0.05

Page 19: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

The t-distribution table – Excel – TINV gives the two-tails critical value

Two-tails

Page 20: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the mean – the t distribution

The expression level of a gene is known to have a mean expression level of 225 in the normal human population. The expression values have been obtained in sixteen measurements, in which the sample mean and s.d. are found to be 241.5 and 98.7259 respectively. Is this data higher than the published mean at a 5% significant level?

• This is a right-hand one-tail test

• Null hypothesis H0: x≦=225

• alternative hypothesis H1: x>=225

• t-score = (241.5-225)/[98.7259/sqrt(16)] = 0.6685 • Degree of freedom = 15

• The 5% level corresponds to a critical value (t0.05(15)) of 1.753

• The t-score is less than the critical value, i.e. 0.6685 < 1.753. • Based on the critical value, we can accept the null hypothesis. • The gene expression data set is not higher than the published mea

n of 225 at a 5% significant level

Page 21: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the variance – the chi-square distributionThe expression level of a gene is known to have a variance 2 = 5000 in the normal human

population. The same gene is measured 26 times and found to have a s2 = 9200 . Is there evidence that the new measurement different from the population at a 2% significant level?

• Unknown population mean, 2 test• Null hypotheses H0: s2 = 2 = 5000, that is the new measured variance is not different

from the population • The alternative hypotheses H1: s2 ≠ 2 = 5000 (two-tail test)• The new variable of score is

• This variable with the interesting that if all possible samples of size n are drawn from a normal population with a variance 2 and for each such sample the quantity is computed, these value will always form the same distribution. This distribution will be a sample distribution called a 2 (chi-square) distribution.

2

22 )1(

sn

accept H0reject H0

reject H0

two-tail test

p=0.99p=0.01

Page 22: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the variance – the chi-square distribution• If the sample s.d. s is close to the population s.d. , the value of 2 will be close to n-1

(degree of freedom)• If the sample s.d. s is very different to the population s.d. , the value of 2 will be very

different from n-1• Let us use the 2 distribution to solve the above problem.

• http://commons.bcit.ca/math/faculty/david_sabo/apples/math2441/section8/onevariance/chisqtable/chisqtable.htm

• The critical values for 20.01(25) = 44.314 and 2

0.99(25) = 11.524 (right-hand tail)• Reject areas are 2 ≦ 11.524 or 2 44.313 ≧• Since 46 > 44.313 reject null hypothesis• The measurement is different from the population at a 2% significant level

465000

9200)126()1(2

22

sn

Page 23: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

The chi-square distribution

Excel - CHIINV,uses right hand tail

Page 24: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving the variance – the chi-square distribution

The expression level of a gene is known to follow normal distribution and have a standard deviation (s.d.) of no more than 5 in the normal human population. The same gene is measured 9 times and found to has a s.d. of 7. Is this data set has a sample variance higher than the published variance at a 5% significant level?

• This is a left-hand one-tail test

• Null hypothesis H0: s2 25 ≦• Alternative hypothesis H1 : s2 > 25

• 2= (9-1)*49/25 = 15.68

• Degree of freedom = 8

• The 5% level corresponds to a critical value of 15.507

• The 2 value is larger than the critical value 15.507

• Based on the critical value, we can reject the null hypothesis.

• The gene does has a s.d. higher than the published value 5 at a 5% significant level.

Page 25: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving two samples – comparing means

The gene expression level of the gene AC002378 is measured for the patients and controls are given in the following:

geneID P1 P2 P3 P4 P5 P6AC002378 0.66 0.51 1.12 0.83 0.91 0.50geneID C1 C2 C3 C4 C5 C6AC002378 0.41 0.57 -0.17 0.50 0.22 0.71• H0: P = C, H1: P ≠ C

• Mean of gene expression level of patients, XP = 0.755• Mean of gene expression level of controls, XC = 0.373• sP

2 = 0.059, sC2 = 0.097

• To test whether the two samples have the same variance or not, we perform the F-test at a 5% level

• F = 0.059/0.097 = 0.60, d.o.f. = 10• F0.025(6,6) = 5.8198, F0.975(6,6) = 0.17183• In between 0.17183 and 5.8198 accept the null hypothesis the patient

s and controls have the same variances

Page 26: Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University

Tests involving two samples – comparing means• t-statistic of two independent samples with equal variances• The t-score is

• where

• the p-value, or the probability of having such a value by chance is 0.0400. This value is smaller than the significant level 0.05, and therefore we accept the null hypothesis, the gene AC002378 is expressed differently between cancer patients and healthy subjects.

359.2

)61

61

(078.0

0)373.0755.0(

)11

(

)()(

2

CPpool

CPCP

nns

XXt

078.0266

097.0)16(059.0)16(

2

)1()1( 222

CP

CCPPpool nn

snsns