Statistical foundations of machine learning. Hypothesis Testing.pdf · •A test of a statistical hypothesis is a rule or procedure for deciding whether to reject the assertion on

Statistical foundations of machinelearning

INFO-F-422

Gianluca Bontempi

Département d’Informatique

Boulevard de Triomphe - CP 212

http://www.ulb.ac.be/di

Machine learning – p. 1/45

Testing hypothesis• Hypothesis testing is the second major area of statistical inference.

• A statistical hypothesis is an assertion or conjecture about the distributionof one or more random variables.

• A test of a statistical hypothesis is a rule or procedure for decidingwhether to reject the assertion on the basis of the observed data.

• The basic idea is formulate some statistical hypothesis and look to see ifthe data provides any evidence to reject the hypothesis.


An hypothesis testing problem• Consider the model of the traffic in the boulevard.

• Suppose that the measures of the inter-arrival times areDN = {10, 11, 1, 21, 2, . . . } seconds.

• Can we say that the mean inter-arrival time θ is different from 10?

• Consider the grades of two different school sections.• Section A had {15, 10, 12, 19, 5, 7}.• Section B had {14, 11, 11, 12, 6, 7}.• Can we say that Section A had better grades than Section B?

• Consider two protein coding genes and their expression levels in a cell.Are the two genes differentially expressed ?

A statistical test is a procedure that aims to answer such questions.


Types of hypothesisWe start by declaring the working (basic, null) hypothesis H to be tested, in theform θ = θ0 or θ ∈ ω ⊂ Θ, where θ0 or ω are given .The hypothesis can be

Simple. It fully specifies the distribution of z.

Composite. It partially specifies the distribution of z.

Example : if DN constitutes a random sample of size N from N (µ, σ2) thehypothesis H : µ = µ0, σ = σ0, (with µ0 and σ0 known values) is simple whilethe hypothesis H : µ = µ0 is composite since it leaves open the value of σ in(0,∞).


Types of statistical testSuppose we have collected N samples DN = {z1, . . . , zN} from a distributionFz and we have declared a null hypothesis H about F .Three are the most common types of statistical test:

Pure significance test: data DN are used to assess the inferential evidenceagainst H.

Significance test: the inferential evidence against H is used to judge whetherH is inappropriate. In other words it is a rule for rejecting H.

Hypothesis test: data DN are used to assess the hypothesis H against aspecific alternative hypothesis H . In other words this is a rule forrejecting H in favour of H.


Pure significance test• Suppose that the null hypothesis H is simple.

• Let t(DN ) be a statistic such that the larger its value the more it castsdoubt on H.

• The quantity t(DN ) is called test statistic or discrepancy measure .

• Let tN = t(DN ) the value of t calculated on the basis of the sample dataDN .

• Let us consider the p-value quantity

p = Prob {t(DN ) > tN |H}

• If p is small the sample data DN are highly inconsistent with H and p

(significance probability or significance level ) is the measure of suchinconsistency.


Some considerations• p is the proportion of situations under the hypothesis H where we would

observe a degree of inconsistency at least to the extent represented bytN .

• tN is the observed value of the statistic for a given DN . Different DN

yield different values of p ∈ (0, 1).

• it is essential that the distribution of t(DN ) under H is known.

• We cannot say that p is the probability that H is true but better that p is the

probability that the dataset DN is observed given that H is true

• Open issues

1. What if H is composite?

2. how to choose t(DN ).


Tests of significance• Suppose that the value p is known. If p is small either a rare event has

occured or perhaps H is not true.

• Idea: if p is less than some stated value α, we reject H.

• We choose a critical level α, we observe DN and we reject H at level α if

P{t(DN ) > tN |H) ≤ α

• This is equivalent to choose some critical value tα and we reject H iftN > tα.

• We obtain two regions in the space of sample data:

critical region S0 where if DN ∈ S0 we reject H.

non-critical region S1 where the sample data DN gives us no-reason toreject H on the basis of the level-α test.


Some considerations• The principle is that we will accept H unless we witness some event that

has sufficiently small probability of arising when H is true.

• If H were true we could still obtain data in S0 and consequently wronglyreject H with probability

Prob {DN ∈ S0|H} = Prob {t(DN ) > tα|H} = α

• The significance level α provides an upper bound to the maximumprobability of incorrectly rejecting H.

• The p-value is the probability that the test statistic is more extreme thanits observed value. The p-value changes with the observed data (i.e. it isa random variable) while α is a level fixed by the user.


Standard normal distribution

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Normal distribution function (µ=0, σ=1)

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Normal density function (µ=0, σ=1)

Remember that z0.05 ≈ 1.64.This means that, if z ∼ N (0, 1), then Prob {z ≥ z0.05} = 0.05 and also that

Prob {|z| ≥ z0.05} = 2 ∗ 0.05 = 0.1

For a generic z ∼ N (µ, σ2)

Prob {|z− µ| ≥ σ ∗ z0.05} = 2 ∗ 0.05 = 0.1


TP: example• Let DN consist of N independent observations of x ∼ N (µ, σ2), with

known variance σ2.

• We want to test the hypothesis H : µ = µ0 with µ0 known.

• Consider as test statistic t(DN ), the quantity |µ− µ0| where µ is thesample average estimator . If H is true we know that µ ∼ N (µ0, σ

2/N).

• Let us calculate the value t(DN ) = |µ− µ0| and assume that therejection region is S0 = {|µ− µ0|||µ− µ0| > tα}.

• Let us put a significance level α = 10% = 0.1. This means that tα shouldsatisfy

Prob {t(DN ) > tα|H} = Prob {|µ− µ0| > tα|H} =

Prob {(µ− µ0 > tα) (OR) (µ− µ0 < −tα)|H} = 0.1


TP: example (II)• For a normal variable x ∼ N (µ, σ2)

Prob {x− µ > 1.645σ} = 1− Fx(1.645σ) = 0.05

and consequently

Prob {x− µ > 1.645σ (OR) x− µ < −1.645σ} = 0.05 + 0.05 = 0.1

• It follows that being µ ∼ N (µ0, σ2/N) (i.e. µ−µ0

σ/√

N∼ N (0, 1)) once we put

tα = 1.645σ/√

N

we haveProb {|µ− µ0| > tα|H} = 0.1

and that the critical region is

S0 ={

DN : |µ− µ0| > 1.645σ/√

N}


TP: example (III)• Suppose that σ = 0.1 and that we want to test if µ = µ0 = 10 with a

significance level 10%.

• After N = 6 observations we have DN = {10, 11, 12, 13, 14, 15}.• On the basis of the dataset we compute

µ =10 + 11 + 12 + 13 + 14 + 15

6= 12.5

andt(DN ) = |µ− µ0| = 2.5

• Since tα = 1.645 ∗ 0.1/√

6 = 0.0672, and t(DN ) > tα, the observationsDN are in the critical region.

• The hypothesis is rejected .


Hypothesis testing: types of errorSo far we considered a single hypothesis. Let us now consider twoalternative hypothesis: H and H.

Type I error. It is the error we make when we reject H if it is true.Significance level represents the probability of making the type I error.

Type II error. It is the error we make when we accept H if it is false.In order to define this error, we are forced to declare an alternativehypothesis H as a formal definition of what is meant by H being “false”.The probability of type II error is the probability that the test leads toacceptance of H when in fact H prevails.When the alternative hypothesis is composite, there is no unique Type IIerror.


An analogy• Consider the analogy with a murder trial, where we have as suspect Mr.

Bean.

• The null hypothesis H is “Mr. Bean is innocent”.

• The dataset is the amount of evidence collected by the police againstMr. Bean.

• The Type I error is the error that we make if, being Mr. Bean innocent,we send him to penalty death.

• The Type II error is the error that we make if, being Mr. Bean guilty, weacquit him.


Hypothesis testing• Suppose we have some data {z1, . . . , zN} ∼ F from a distribution F .

• H and H represent two hypotheses about F .

• On the basis of the data, one is accepted and one is rejected .

• Note that the two hypotheses have different philosophical status(asymmetry).

• H is a conservative hypothesis, not to be rejected unless evidence isclear. This means that a type I error is more serious than a type II error(benefit of the doubt).

• It is often assumed that F belongs to a parametric family F (z, θ). Thetest on F becomes a test on θ.

• A particular example of hypothesis test is the goodness of fit test wherewe test H : F = F0 against H : F 6= F0.


The five steps of hypothesis testing1. Declare the null (e.g. H: honest student) and the alternative hypothesis

(H: cheat student)

2. Choose the numeric value of the type I error (e.g. the risk I want to run).

3. Choose a procedure to obtain test statistic (e.g. number of similar lines).

4. Determine the critical value of the test statistic (e.g. 4 identical lines) thatleads to a rejection of H. This is done in order to ensure the Type I errordefined in Step 2.

5. Obtain the data and determine whether the observed value of the teststatistic leads to an acceptation or rejection of H.


Quality of the testSuppose that

• N students took part to the exam,

• NN did not copy,

• NP copied,

• NN were considered not guilty and passed the exam

• NP were considered guilty and rejected

• FP honest students were refused

• FN cheat students passed.


Confusion matrixThen we have

Not refused Refused

H: Not guilty student (-) TN FP NN

H: Guilty student (+) FN TP NP

NN NP N

• FP is the number of False Positives and the ratio FP /NN represents thetype I error.

• FN is the number of False Negatives and the ratio FN/NP representsthe type II error.


Specificity and sensitivitySpecificity: the ratio (to be maximized)

SP =TN

FP + TN=

TN

NN=

NN − FP

NN= 1− FP

NN, 0 ≤ SP ≤ 1

It increases by reducing the number of false positive.

Sensitivity: the ratio (to be maximized)

SE =TP

TP + FN=

TP

NP=

NP − FN

NP= 1− FN

NP, 0 ≤ SE ≤ 1

It increases by reducing the number of false negatives and correspondsto the power of the test (i.e. it estimates the quantity 1-Type II error).


Specificity and sensitivity (II)There exists a trade-off between these two quantities.

• In the case of a test who return always H (e.g. very kind professor) wehave NP = 0,NN = N , FP = 0, TN = NN and SP = 1 but SE = 0.

• In the case of a test who return always H (e.g. very suspiciousprofessor) we have NP = N ,NN = 0, FN = 0, TP = NP and SE = 1 butSP = 0.


False Positive and False Negative RateFalse Positive Rate:

FPR = 1 − SP = 1 −TN

FP + TN

=FP

FP + TN

=FP

NN

, 0 ≤ FPR ≤ 1

It decreases by reducing the number of false positive and estimates the Type I error.

False Negative Rate

FNR = 1 − SE = 1 −TP

TP + FN

=FN

TP + FN

=FN

NP

0 ≤ FPR ≤ 1

It decreases by reducing the number of false negative.


Predictive valuePositive Predictive value: the ratio(to be maximized)

PPV =TP

TP + FP=

TP

NP

, 0 ≤ PPV ≤ 1

Negative Predictive value: the ratio (to be maximized)

PNV =TN

TN + FN=

TN

NN

, 0 ≤ PNV ≤ 1

False Discovery Rate: the ratio (to be minimized)

FDR =FP

TP + FP=

FP

NP

= 1− PPV, 0 ≤ FDR ≤ 1


Receiver Operating Characteristic curveThe Receiver Operating Characteristic (also known as ROC curve) is a plotof the true positive rate (i.e. sensitivity or power) against the false positiverate (Type I error) for the different possible decision thresholds of a test.

Consider an example where t+ ∼ N (1, 1) and t− ∼ N (−1, 1). Suppose thatthe examples are classed as positive if t > THR and negative if t < THR,where THR is a threshold.

• If THR = −∞, all the examples are classed as positive: TN = FN = 0

which implies SE = TP

NP= 1 and FPR = FP

FP +TN= 1.

• If THR =∞, all the examples are classed as negative: TP = FP = 0

which implies SE = 0 and FPR = 0.


ROC curve

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

SE

R script roc.R


Choice of testThe choice of test and consequently the choice of the partition {S0, S1} isbased on two steps

1. Define a significance level α, that is the probability of type I error

Prob {reject H|H} = Prob {DN ∈ S0|H} ≤ α

that is the probability of incorrectly rejecting H

2. Among the set of tests {S0, S1} of level α, choose the test that minimizesthe probability of type II error

Prob{

accept H|H}

= Prob{DN ∈ S1|H

}

that is the probability of incorrectly accepting H. This is equivalent tolook for maximizing the power of the test

Prob{

reject H|H}

= Prob{DN ∈ S0|H

}= 1− Prob

{DN ∈ S1|H

}

which is the probability of correctly rejecting H. The higher the power,the better !


TP example• Consider a r.v. z ∼ N (µ, σ2), where σ is known and a set of N iid

observations are given.

• We want to test the null hypothesis µ = µ0 = 0, with α = 0.1

• Consider the 3 critical regions S0

1. |µ− µ0| > 1.645σ/√

N

2. µ− µ0 > 1.282σ/√

N

3. |µ− µ0| < 0.126σ/√

N

• For all these tests Prob {DN ∈ S0|H} ≤ α, hence the significance levelis the same.

• However if H : µ = 10 the type II error of the three tests is significantlydifferent.

• What is the best one?


TP example (II)H

H

S S

µ:

µ:

��

��

0 10

1 0

��

��

On the left: distribution of the test statistic µ if H : µ0 = 0 is true. On the right: distribution of the

test statistic µ if H : µ1 = 10 is true. The interval marked by S1 denotes the set of observed µ

values for which H is accepted (non-critical region). The interval marked by S0 denotes the set

of observed µ values for which H is rejected (critical region). The area of the black pattern

region on the right equals Prob {DN ∈ S0|H}, i.e. the probability of rejecting H when H is true

(Type I error). The area of the grey shaded region on the left equals the probability of accepting

H when H is false (Type II error).


TP example (III)

��

��

S

H

H

SS

µ:

µ:

10

0 10

1

On the left: distribution of the test statistic µ if H : µ0 = 0 is true. On the right: distribution of the

test statistic µ if H : µ1 = 10 is true. The two intervals marked by S1 denote the set of observed

µ values for which H is accepted (non-critical region). The interval marked by S0 denotes the set

of observed µ values for which H is rejected (critical region). The area of the pattern region

equals Prob {DN ∈ S0|H}, i.e. the probability of rejecting H when H is true (Type I error).

Which area corresponds to the probability of the Type II error?


Type of parametric testsConsider random variables with a parametric distribution F (·, θ).One-sample vs. two-sample: in the one-sample test we consider a single r.v.

and we formulate hypothesis about its distribution. In the two-samplestest we consider 2 r.v. z1 and z2 and we formulate hypothesis about theirdifferences/similarities.

Simple vs composite: the test is simple if H describes completely thedistributions of the involved r.v. otherwise it is composite.

Single-sided (or one-tailed) vs Two-sided (or two-tailed) : in the single-sided test theregion of rejection concerns only one tail of the distribution of the nulldistribution. This means that H indicates the predicted direction of thedifference (e.g. H : θ > θ0) . In the two-sided test, the region of rejectionconcern both tails of the null distribution. This means that H does notindicate the predicted direction of the difference (e.g. H : θ 6= θ0) .


Example of parametric test• Consider a parametric test on the distribution of a gaussian r.v., and

suppose that the null hypothesis is H : θ = θ0 where θ0 is given andrepresents the mean.

• The test is one-sample and composite.

• In order to know whether it is one or two-sided we have to define thealternative configuration: if H : θ < θ0 the test is one-sided down, ifH : θ > θ0 the test is one-sided up, if H : θ 6= θ0 the test is double-sided.


z-test (one-sample and one-sided)Consider a random sample DN ← x ∼ N (µ, σ2) with µ unknown et σ2 known.STEP 1:

Consider the null hypothesis and the alternative (composite and one-sided)

H : µ = µ0; H : µ > µ0

STEP 2: fix the value α of the type I error.STEP 3: choose a test statistic:If H is true then the distribution of µ is N (µ0, σ

2/N). This means that thevariable z is

z =(µ− µ0)

√N

σ∼ N (0, 1)

It is convenient to rephrase the test in terms of the test statistic z.


z-test (one-sample and one-sided) (II)STEP 4: determine the critical value for z.

We reject the hypothesis H is rejected if zN > zα where zα is such thatProb {N (0, 1) > zα} = α.

Ex: for α = 0.05 we would take zα = 1.645 since 5% of the standard normaldistribution lies to the right of 1.645.

R command: zα=qnorm(alpha,lower.tail=FALSE)

STEP 5: Once the dataset DN is measured, the value of the test statistic is

zN =(µ− µ0)

√N

σ


TP: example z-test• Consider a r.v. z ∼ N (µ, 1).

• We want to test H : µ = 5 against H : µ > 5 with significance level 0.05.

• Supose that the data is DN = {5.1, 5.5, 4.9, 5.3}.• Then µ = 5.2 and zN = (5.2− 5) ∗ 2/1 = 0.4.

• Since this is less than zα = 1.645, we do not reject the null hypothesis.


Two-sided parametric testsAssumption: all the variables are normal!

Name one/two sample known H H

z-test one σ2 µ = µ0 µ 6= µ0

z-test two σ21 = σ2

2 µ1 = µ2 µ1 6= µ2

t-test one µ = µ0 µ 6= µ0

t-test two µ1 = µ2 µ1 6= µ2

χ2-test one µ σ2 = σ20 σ2 6= σ2

0

χ2-test one σ2 = σ20 σ2 6= σ2

0

F-test two σ21 = σ2

2 σ21 6= σ2

2


Student’s t-distributionIf x ∼ N (0, 1) and y ∼ χ2

N are independent then the Student’s t-distribution

with N degrees of freedom is the distribution of the r.v.

z =x√y/N

We denote this with z ∼ tN .If z1, . . . , zN are i.i.d. N (µ, σ2) then

√N(µ− µ)√

SS/(N − 1)

=

√N(µ− µ)

σ∼ tN−1


t-test: one-sample and two-sidedConsider a random sample from N (µ, σ2) with σ2 unknown . Let

H : µ = µ0; H : µ 6= µ0

Let

t(DN ) = T =

√N(µ− µ0)√

1

N−1

∑Ni=1

(zi − µ)2=

(µ− µ0)√σ2

N

a statistic computed using the data set DN .


t-test: one-sample and two-sided (II)• It can be shown that if the hypothesis H holds, T ∼ TN−1 is a r.v. with a

Student distribution with N − 1 degrees of freedom.

• The size α t-test consists in rejecting H if

|T | > k = tα/2,N−1

where tα/2,N−1 is the upper α point of a T -distribution on N − 1 degreesof freedom, i.e.

Prob{|tN−1| > tα/2,N−1

}= α/2.

where tN−1 ∼ TN−1.

• In other terms H is rejected when T is large.

• R command: tα/2,N−1=qt(alpha/2,N-1,lower.tail=TRUE)


TP exampleDoes jogging lead to a reduction in pulse rate? Eight non jogging volunteersengaged in a one-month jogging programme. Their pulses were taken beforeand after the programme

pulse rate before 74 86 98 102 78 84 79 70

pulse rate after 70 85 90 110 71 80 69 74

decrease 4 1 8 -8 7 4 10 -4

Suppose that the decreases are samples from N (µ, σ2) for some unknownσ2.We want to test H : µ = µ0 = 0 against H : µ 6= 0 with a significance α = 0.05.We have N = 8, µ = 2.75, T = 1.263, tα/2,N−1 = 2.365

Since |T | ≤ tα/2,N−1, the data is not sufficient to reject the hypothesis H. Inother terms we have not enough evidence to show that there is a reduction inpulse rate.


The chi-squared distributionFor a N positive integer, a r.v. z has a χ2

N distribution if

z = x21 + · · ·+ x2

N

where x1,x2,. . . ,xN are i.i.d. random variables N (0, 1).

• The probability distribution is a gamma distribution with parameters( 1

2N, 1

2).

• E[z] = N and Var [z] = 2N .

• The distribution is called “a chi-squared distribution with N degrees offreedom”.


χ2-test: one-sample and two-sided• Consider a random sample from N (µ, σ2) with µ known.

• LetH : σ2 = σ2

0 ; H : σ2 6= σ20

• Let SS =∑

i(zi − µ)2.

• It can be shown that if H is true then SS/σ20 ∼ χ2

N

• The size α χ2-test rejects H if SS/σ20 < a1 or SS/σ2

0 > a2 where

Prob

{SS

σ20

< a1

}+ Prob

{SS

σ20

> a2

}= α

• If µ is unknown, you must

1. replace µ with µ in the quantity SS

2. use a χ2N−1

distribution.


t-test: two-samples, two-sidedConsider two r.v.s x ∼ N (µ1, σ

2) and y ∼ N (µ2, σ2) with the same variance.

Let DxN and Dy

M two independent sets of samples .We want to test H : µ1 = µ2 against H : µ1 6= µ2.Let

µx =

∑Ni=1

xi

N, SSx =

N∑

i=1

(xi− µx)2, µy =

∑Mi=1

yi

M, SSy =

M∑

i=1

(yi− µy)2

Once defined the statistic

T =µx − µy√(

1

M + 1

N

) (SSx+SSy

M+N−2

) ∼ TM+N−2

it can be shown that a test of size α rejects H if

|T | > tα/2,M+N−2


F-distributionLet x ∼ χ2

M and y ∼ χ2N be two independent r.v.. A r.v. z has a F-distribution

Fm,n with M and N degrees of freedom if

z =x/M

y/N

• If z ∼ FM,N then 1/z ∼ FN,M .

• If z ∼ TN then z2 ∼ F1,N .


F-distribution

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

FM,N

density: M=20 N=10

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FM,N

cumulative distribution: M=20 N=10

R script s_f.R.


F-test: two-samples, two-sidedConsider a random sample x1, . . . , xM from N (µ1, σ

21) and a random sample

y1, . . . , yN from N (µ2, σ22) with µ1 and µ2 unknown. Suppose we want to test

H : σ21 = σ2

2 ; H : σ21 6= σ2

2

Let us consider the statistic

f =σ

2

1

σ2

2

=SS1/(M − 1)

SS2/(N − 1)∼ σ2

1χ2M−1

/(M − 1)

σ22χ2

N−1/(N − 1)

=σ2

1

σ22

FM−1,N−1

It can be shown that if H is true, the ratio f has a F-distribution FM−1,N−1

We reject H if the ratio f is large, i.e. f > Fα,M−1,N−1 where

Prob {z > Fα,M−1,N−1} = α

if z ∼ FM−1,N−1.


Documents

Statistical foundations of machine learning. Hypothesis Testing.pdf · •A test of a statistical hypothesis is a rule or procedure for deciding whether to reject the assertion on