26
R course Tuesday, March 12 2013 SOME STATISTICAL TESTS

SOME STATISTICAL TESTS - evol.bio.lmu.deevol.bio.lmu.de/_statgen/Rcourse/ws1213/slides/R_Course_2013_Day-6...Overview Theory of statistical tests Test for a difference in mean Test

Embed Size (px)

Citation preview

R courseTuesday, March 12 2013

SOME STATISTICAL TESTS

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Theory of statistical tests

● Read the § from the lecture notes

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Test for a difference in mean : T test

● Underline of the test– What is given? Independent observations (x1 , . . . , xn )

and (y1 , . . . , ym ).

– Null hypothesis: x and y are samples from distributions having the same mean.

– Test: t-test– R command: t.test( x, y )

– Idea of the test: If the sample means are too far apart, then reject the null hypothesis.

– Approximative test but rather robust

Test for a difference in mean : T test

● Ex 1: marsians– Dataset containing

height for marsians of different colors

– Reject the null hypo

– It was an unpaired t test (no dependence between the 2 samples)

> mars <- read.table("mars.txt",header=TRUE)> head(mars) size color1 65.67974 red2 65.90436 red3 67.34730 red4 60.42924 red5 55.34526 red6 62.85024 red> attach(mars)> t.test(size[color=="green"],size[color=="blue"])

Two Sample t-testdata: size[color == "green"] and size[color == "blue"]t = -3.4244, df = 19.419, p-value = 0.002775alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-16.875514 -4.083647sample estimates:mean of x mean of y60.86840 71.34798

Test for a difference in mean : T test

● Ex 1: marsians– Dataset containing

height for marsians of different colors

– Reject the null hypo

– It was an unpaired t test (no dependence between the 2 samples)

> mars <- read.table("mars.txt",header=TRUE)> head(mars) size color1 65.67974 red2 65.90436 red3 67.34730 red4 60.42924 red5 55.34526 red6 62.85024 red> attach(mars)> t.test(size[color=="green"],size[color=="blue"])

Two Sample t-testdata: size[color == "green"] and size[color == "blue"]t = -3.4244, df = 19.419, p-value = 0.002775alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-16.875514 -4.083647sample estimates:mean of x mean of y60.86840 71.34798

Test for a difference in mean : T test

● Ex 2: shoe wear– Dataset containing

wear of shoes of 2 materials A and B

– Paired test because some boys will cause more damage to the shoe than others

– Reject the null hypo

> data(shoes,package=’MASS’) > attach(shoes) > head(shoes) $A [1] 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3 $B [1] 14.0 8.8 11.2 14.2 11.8 6.4 9.8 11.3 9.3 13.6> t.test(A,B,paired=TRUE)

Paired t-testdata: A and Bt = -3.3489, df = 9, p-value = 0.008539alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.6869539 -0.1330461sample estimates:mean of the differences -0.41

Test for a difference in mean : T test

● Ex 2: shoe wear– Dataset containing

wear of shoes of 2 materials A and B

– Paired test because some boys will cause more damage to the shoe than others

– Reject the null hypo

> data(shoes,package=’MASS’) > attach(shoes) > head(shoes) $A [1] 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3 $B [1] 14.0 8.8 11.2 14.2 11.8 6.4 9.8 11.3 9.3 13.6> t.test(A,B,paired=TRUE)

Paired t-testdata: A and Bt = -3.3489, df = 9, p-value = 0.008539alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.6869539 -0.1330461sample estimates:mean of the differences -0.41

Test for a difference in mean : T test

● Linked tests that might be of interest– var.test() to test for equality in variance

→ this way you can change the option var.equal in t.test()

– shapiro.test() to test for normality for example before doing a Pearson correlation

The null hypothesis of the shapiro test is normal distribution

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Test for dependence

● The test depends from the data type– Nominal variables (not ordered like eye color or

gender)

– Ordinal variables (ordered but not continuous like result of a dice)

– Continuous variables (like body height)

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Test for dependenceNominal (count) variables

● Underline of the test– What is given? Pairwise observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

– Null hypothesis: x and y are independent

– Test: χ2 -test for independence

– R command: chisq.test( x, y ) or chisq.test( contingency.table )

– Idea of the test: Calculate the expected abundancies under the assumption of independence. If the observed abundancies deviate too much from the expected abundancies, then reject the null hypothesis.

– Approximate test, see the conditions on the lecture notes

Test for dependenceNominal (count) variables

● Ex 1: χ2 -test

> contingency <- matrix( c(47,3,8,42,60,15,8,33,3), nrow=3 )

> chisq.test(contingency)$expected

[,1] [,2] [,3]

[1,] 25.689498 51.82192 19.488584

[2,] 25.424658 51.28767 19.287671

[3,] 6.885845 13.89041 5.223744

# expected abundancies are all above 5, so we may apply the test

> chisq.test(contingency)

Pearson’s Chi-squared test

data: contingency

X-squared = 58.5349, df = 4, p-value = 5.892e-12

● Reject the null hypo that the two variables are independent

Test for dependenceNominal (count) variables

● Fisher´s exact test– 2*2 contingency tables

– Example:

> table <- matrix( c(14,10,21,3), nrow=2 )

> fisher.test(table)

Fisher’s Exact Test for Count Data

data: table

p-value = 0.04899

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

0.03105031 0.99446037

sample estimates:

odds ratio 0.2069884

● We reject the null hypo

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Test for dependenceContinuous variables

● Underline of the test– What is given? Pairwise observations (x1 , y1 ),

(x2 , y2 ), . . . , (xn , yn ); all values in some interval are possible

– Null hypothesis: x and y are independent

– Test: Pearson’s correlation test for independence

– Assumption: x and y are samples from a normal distribution

– R command: cor.test( x, y )

Test for dependenceContinuous variables

● Ex: – Distance needed to

stop from a certain speed for cars

– Reject the null hypo

> data(cars)> attach(cars)> str(cars)> ?cars> plot(speed,dist)> cor.test(speed, dist)Pearson’s product-moment correlationdata: speed and distt = 9.464, df = 48, p-value = 1.49e-12alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.6816422 0.8862036sample estimates:cor 0.8068949

Test for dependenceContinuous variables

● Ex: – Distance needed to

stop from a certain speed for cars

– Reject the null hypo

> data(cars)> attach(cars)> str(cars)> ?cars> plot(speed,dist)> cor.test(speed, dist)Pearson’s product-moment correlationdata: speed and distt = 9.464, df = 48, p-value = 1.49e-12alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.6816422 0.8862036sample estimates:cor 0.8068949

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

Test for dependenceOrdinal variables

● Underline of the test– What is given? Pairwise

observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ); values can be ordered

– Null hypothesis: x and y are uncorrelated

– Test: Spearman’s rank correlation rho

– R command: cor.test( x, y, method="spearman")

> data(cars)> attach(cars)> cor.test(speed, dist, method=”spearman”)

Spearman's rank correlation rho

data: speed and dist S = 3532.819, p-value = 8.825e-14alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8303568

Warning message:In cor.test.default(speed, dist, method = "spearman") : Cannot compute exact p-values with ties

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom

The power of a test

● Alternative hypothesis H1– Ex: H0: µ=0 and H1: µ≠0

● 2 types of error– Type I error (or “first kind” or “α error” or “false positive”): rejecting H0 when it is

true

– Type II error (or “second kind” or “β error” or “false negative”): failing to reject H0 when it is not true

● Power is 1-β– If power=0 you will never reject H0

– Ex: if the true value is close to 0, the test has no chance to reject H0: rather choose |µ|>=0.5

● In general the power increase with sample size– Use power.test() or power.fisher.test() to calculate the min sample size needed

Overview

● Theory of statistical tests● Test for a difference in mean● Test for dependence

– Nominal variables

– Continuous variables

– Ordinal variables

● Power of a test● Degrees of freedom → see your lecture notes