Statistics for bioinformatics Filtering microarray data

Statistics for bioinformatics

Filtering microarray data

Aims of filtering

• Suppose– We have a set of 10000 genes whose expression is

measured on our microarray chips.– We are looking at an experiment where gene expression

is measured in 11 cancer patients and 7 normal individuals.

• We want to know which genes have altered expression in cancerous cells (maybe they can be used as drug targets).

• Genes whose expression is similar between cancer and normal individuals are not interesting and we want to filter them out.

What will be discussed

• General background on statistics– Distributions– P-values, significance– Hypothesis testing– T-test– Analysis of variance – Nonparametric statistics

• Application of statistics to filtering microarray data

Distributions

• Distributions help to assign probabilities to subsets of the possible set of outcomes of an experiment.

• The distribution function F:[0,1] of a random variable X is given by

• Random variables can be discrete or continuous. X is discrete if it takes values in a countable subset of (eg. number of heads in two coin tosses is 0,1 or 2) and continuous if its distribution can be written as the integral of an integrable function f:

• f is probability density function (pdf) of X (f=F’).

).()( xXPxF

.)()(

x

duufxF

Normal distribution• Also known as Gaussian• Symmetrical about the mean, “bell-shaped”• Completely specified by mean and variance –

denoted X is • Can transform to standard form,

- Z is N(0,1)

X

Z),( 2N

Pdf is:

x

xxf ,

2

)(exp

2

1)(

2

2

x

Central limit theorem

• A lot of the statistical tests that we will discuss apply specifically for normal distributions…

• …however, the central limit theorem says:• If are (independent) items from

a random sample drawn from any distribution with mean and positive variance 2 then

has a limiting distribution ( ) which is normal with mean 0 and variance 1, where

nXXX ,...,, 21

/)( nXn

.1

1

n

i inn XX

n

Central limit theorem

• For a sample drawn from a normal distribution, is exactly normally distributed with mean 0 and variance 1.

• For other distributions, is approximately normally distributed with mean 0 and variance 1, for large enough n.

• This approximate normal distribution can be used to compute approximate probabilities concerning the sample mean,

• In practice, convergence is v. rapid, eg. means of samples of 10 observations from uniform distribution on [0,1] are v. close to normal.

nXXX ,...,, 21

/)( nXn

/)( nXn

.nX

Chi-squared ( )distribution and F distribution

• If are independent and N(0,1) then has a Chi-squared distribution with r degrees

of freedom. • If you add chi-squared random variables, with ri degrees of

freedom, i=1,…,k, you get a chi-squared random variable with degrees of freedom.

• Let and be independent variates distributed as chi-squared with m and n degrees of freedom. The ratio has an F distribution with

parameters m and n. • NB F distribution completely determined by m & n.• Useful for statistical tests- see later.

2

rXXX ,...,, 21 2

1

2

r

i iX2i

k

i ir1

2m 2

n

21

21

,

nn

mmnmF

Statistics• What is a statistic? A function of one or more

random variables that does not depend on any unknown parameter– Eg. sample mean– Z=(X-)/ is not a statistic unless & are known

• If interested in a random variable X, may only have partial knowledge of its distribution. Can sample & use statistics to infer more info, eg. estimate unknown parameters.

• Primary purpose of theory of statistics: provide mathematical models for experiments involving randomness; make inferences from noisy data.

Hypothesis testing

• A statistical hypothesis is an assertion about the distribution of one or more random variables (r.v.s).

• In hypothesis testing, have a null hypothesis H0 (eg. suppose we have a r.v. which we know is N(,1) & our null is that =0) which we want to test against an alternative hypothesis H1 (eg. =1).

• The test is a rule for deciding based on an experimental sample – usually ask if a particular statistic is in some acceptance region or in the rejection (also called critical) region; if in acceptance region keep the null, else reject.

• Test has power function which maps a potential underlying distribution for the r.v. to the probability of rejecting the null hypothesis given that distribution.

Significance and P-values• Significance level of a hypothesis test is the maximum value

(actually supremum) of the power function of the test if H0 is true- ie. the worst case probability of rejecting the null if it is true. Typical values are 0.01 or 0.05 (often expressed as 1% or 5%)– NB some texts refer to 95% significance, which by my definition would

be 5%.

• P-value = The probability that a statistic would assume a value greater than or equal to the observed value strictly by chance. – Eg. suppose we sample 1 value from our normal distn. with variance 1

and use this as our statistic. If the sample value is 0.9, this has P-value 0.184, since P(X0.9)=0.816 for the null hypothesis N(0,1). If we were testing at 5% significance, we would keep the null, since our P-value is > 0.05.

Student t-test• Suppose you have a sample X1,…Xn of independent

random variables each with distribution, N(,), then – Sample mean, , has distribution N(,/n) – Sample variance, , nS2/2 has distribution – and S2 are stochastically independent

• Suppose you don’t know the actual mean and variance. If you want to test (at some significance level) whether the actual mean takes a certain value then you can’t look up P-values directly from the sample mean because you don’t know /n.

n

i in XX1

1

n

i i XXn

S1

22 )(1

)1(2 n

X

Student’s t-test

• Consider instead t-ratio (t-statistic) is given by

where is N(0,1) and is [S is the sample standard deviation]• So by dividing (by an estimate of the standard deviation of

), we have eliminated the unknown .• This statistic has a “t distribution

with n-1 degrees of freedom”.

)1/(1/

nV

U

nS

XT

/)( XnU 22 /nSV ).1(2 n

X

)1/(

nV

UT

Student’s t-test• A one-sample t-test compares the mean of a single

column of numbers against a hypothetical mean you define:– H0: =0– H1: 0

• Assume H0 is true and calculate the t-statistic:

• A P-value is calculated from the t-statistic, using the pdf. This value is a measure of the significance of the deviation of the sample (column of numbers) from the mean. Normal way of assessing significance is to use a look-up table [cf example in next section].

).1//( nSX

Two-sample T-test

• A two-sample t-test compares the means of two columns of numbers (independent samples) against one another on the assumption that they are normally distributed with the same (although unknown) variance: .

• Suppose we have a sample X1, …, Xn and another Y1, …, Ym drawn from N(1, 2) and N(2, 2) respectively, then the difference in sample means is distributed as N(1- 2,2(1/n+1/m)) and the t-ratio is given by

mnmnmSnS

YXT

YX 112

)()(22

21

Two-sample T-test

• We lay out our null and alternative hypotheses:– H0: 1= 2 – H1: 1 2

• Assume H0 is true and calculate the T statistic:

• The T statistic follows a t-distribution with n+m-2 degrees of freedom.

mnmnmSnS

YXT

YX 112

)(22

Two-sample T test

• From the T statistic can calculate a P-value, using the p.d.f. of a t-distribution with n+m-2 degrees of freedom. If the P-value is smaller than the desired significance level (T greater than a critical value), then reject the null hypothesis (there is a significant difference in means between the two samples).

• Usually we just see if the T statistic exceeds a critical value, corresponding to some significance level, by looking up in a table. (Often significance is 5%- sometimes written 95%).

• [Example in next section of lecture].

Two-sample T-test

http://trochim.human.cornell.edu/kb/stat_t.htm

Are the sample means different?

The significance of the difference in means depends on the variances.

Analysis of variance (ANOVA)

• Another test to work out if the means of a set of samples are the same is called analysis of variance (ANOVA). – Eg used for working out whether the expression of

gene A in a microarray experiment is significantly different in cells from patients of cancer type A, cancer type B and in normal patients.

• For two groups (eg. cancer and normal), ANOVA turns out to be equivalent to a T test, but can use ANOVA for more than two samples.

One-way ANOVA• The assumptions of analysis of variance are that the samples

of interest are normally distributed, independent & have same variances, however research shows that the results of hypothesis test using ANOVA are pretty robust to the assumptions being violated. If this happens, ANOVA tends to be conservative, ie. will not reject the null hypothesis of equal means when it actually should – thus will tend to underestimate significant effects of eg. drug response.

• Suppose we have m samples, with the jth sample given by , …, , from distributions N(j , 2), where 2 is the same for each but unknown.

• The null hypothesis is H0: 1= 2=…= m= , unspecified.,

• H1: at least one mean is different.jn jX

jX1

One-way ANOVA• We will test the hypothesis using two different estimates of the

variance.• One estimate (called the Mean Square Error or "MSE" for

short) is based on the variances within the samples. The MSE is an estimate of 2 whether or not the null hypothesis is true.

• 2nd estimate (Mean Square Between or "MSB" for short) is based on the variance of the sample means. The MSB is only an estimate of 2 if the null hypothesis is true.

• If the null hypothesis is true, then MSE and MSB should be about the same since they are both estimates of the same quantity (2); however, if H0 is false then MSB can be expected to be > MSE since MSB is estimating a quantity larger then 2.

Variance between groups• Let represent the sample mean of the jth group

(sample) and the “grand mean” of all elements from all the groups. The variance between groups measures the deviations of the group means around the grand mean.

• Sum of squares between groups (SSB):

• [where .] • The variance between groups, also known as Mean

square between (MSB) is given by sum of squares divided by the degrees of freedom between (dfB):

where dfB=m-1.

jXX

m

j jjB XXnSS1

2)(

m

j j

m

j jjN

n

i ijnj nNXnXXX j

j 111

11 ,,

B

B

df

SSMSB

Variance within groups

• Here we want to know the total variance due to deviations within groups.

• Sum of squares within groups (SSW):

• To get the variance within, also known as mean squared error (MSE), we must divide by the degrees of freedom within dfW = N-m. Roughly speaking this is because we have used up m degrees of freedom in estimating the group means (by their sample values) and so only have N-m independent ones left to estimate this variance:

m

j

n

i jijwj XXSS

1 1

2)(

W

W

df

SSMSE

F-statistics• The F-statistic is the ratio of the variance between groups to the

variance within groups:

• If the F-statistic is sufficiently large then we will reject the null hypothesis that the means are equal.

• The F-statistic is distributed according to an F distribution with degree of freedom for the numerator = dfB and degree of freedom for the denominator = dfW, ie. Fm-1,N-m. We can look up in an F table or calculate using the probability density function the P-value corresponding to a given value of the statistic on the distribution with parameters as given. We reject the null if this P-value is less than our significance level.

. within varianceof estimate

between varianceof estimate

MSE

MSBF

Two-way analysis of variance

• What analysis of variance actually does is to split the squared deviation from the grand mean into 2 parts:

• In order to estimate the mean from a sample we actually find a value which minimizes the sum of squared residuals. Eg. to find group means we use values which minimize the second term above and to find the grand mean we minimize the LHS term.

• The values of these sum of squared residuals when the means take their maximum likelihood values (the variance terms above) gives a measure of the likelihood of the means taking those values. So, as we have seen, the variances can be used to see how likely certain hypotheses about the mean are.

ji ji ji jjijij XXXXXX

, , ,

222 )()()(

jX

Two-way analysis of variance

• Measures of the relative sizes of the LHS term to the 2nd term tell us how good a fit the single parameter model with all means equal is compared to the multiple means model.

• We use some degrees of freedom (independent sample data) to estimate the means and other d.o.f.s to see how good our hypotheses about the means are (via estimation of the variances).

• Suppose now that we have 2 different factors affecting our microarray samples: eg. yeast cell in different concentrations of glucose at different temperatures.

• Our model for the expression of gene A might involve both factors influencing the mean…

Two-way analysis of variance• We suppose that the sample at temperature j with glucose

concentration k is N(jk,2) with

• According to our model, the mean expression level can vary both with temperature and with glucose concentration.

• If we want to test whether temperature affects gene expression level at 5% significance, then we take:

H0: bj=0 for all jand proceed in a similar manner (although with different components of variance in the F-statistic) to before.

• Clearly this can be extended to more than 2 factors- see Kerr et al (handout for homework).

,kjjk cba j k kj cb 0

Nonparametric statistics

• So far we have look at statistical tests which are valid for normally-distributed data.

• If we know that our data is (approximately) Gaussian (eg. in large sample-size limits by Central Limit Theorem) these are useful and easy to use.

• If our data deviates a lot from normal then we need other techniques.

• Nonparametric techniques make no assumptions about the underlying distributions.

• We will briefly discuss such an example: a rank randomization test equivalent to the Mann-Whitney U-test.

Randomization test

• Best described by example:• Group 1 Group 2

11 2 14 9

7 0 8 5

Mean 10 4

• Want to know if the two groups have significantly different means

1. Work out difference in means2. Work out how many ways there are of dividing the total sample

into two groups of four.[?]3. Count how many of these lead to a bigger differences than the

original two groups.

Randomization tests

• Difference in means is 6• There are 70=8!/(4!4!) ways of dividing the data• There are only two other combinations that give a

difference in means which is as large or larger: • Probability of getting a difference in mean in favour of group 1 (one-tailed test) as high as the original is approximately 3/70=0.0429. There are also 3 combinations that give differences in favour of group 2 of 6. So the 2-tailed p-value is 6/70=0.0857.

Mann-Whitney U test

• The problem with randomization tests is that as the number of samples and groups increases, the number of possible ways of dividing becomes extremely large- thus randomization tests are hard to compute.

• A simplification involves replacing the data by ranks (ie. the smallest value is replaced by 1, the next by 2, …). A randomization test is then performed on the ranks: Group 1 Group 2

11 2 14 9

7 0 8 5

Group 1 Group 2 7 2 8 6 4 1 5 3

Rank randomization test• Calculate the difference in the summed ranks of the two groups: 12

here.• The problem is then to work out how many of the 70 ways of

rearranging the numbers 1,…,8 into two groups give a difference in group sum which is 12 (one-tailed; has modulus 12 for two-tailed).

• This problem doesn’t depend on the exact data, so standard values can be tabulated. For a given data set just use a lookup table.

• The rank randomization test for the differences between two groups is called the Wilcoxon Rank Sum test. It is the same as the Mann-Whitney U-test although this uses a different test statistic.

• Clearly information is lost in converting from real data to ranks so the test is not as powerful as randomization tests, but is easier to compute.

Statistics summary

• We have discussed several ways of assessing significant differences between the means of two or more samples.

• For normally distributed samples with equal variances, we have two methods:– T test (for comparing two samples)– Analysis of variance (for comparing two or more groups)

• The central limit theorem shows that the mean of a very large sample follows an approximately normal distribution, however, for small & non-normally distributed samples non-parametric methods may be necessary.

Statistics summary

• These techniques are useful in analysing microarray data because we want to infer from noisy data which genes vary significantly in their expression over a variety of conditions - NB since the conditions correspond to the groups, we will generally need several repeats of the microarray experiments under the “same” conditions in order to apply these techniques.

[References for more info on statistics, esp. statistical tests: Introduction to mathematical statistics by Hogg & Craig (Maxwell Macmillan); http://davidmlane.com/hyperstat/index.html ]

http://davidmlane.com/hyperstat/index.html





Documents

Statistics for bioinformatics Filtering microarray data