51
1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have questions, please do not hesitate to contact me: [email protected]

Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

  • Upload
    others

  • View
    11

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

1

Introduction to probability and statistics (2)

Andreas Hoecker (CERN)CERN Summer Student Lecture, 17–21 July 2017

If you have questions, please do not hesitate to contact me: [email protected]

Page 2: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

2

Outline (4 lectures)

1st lecture:• Introduction • Probability (…some catch-up to do)

2nd lecture:• Probability axioms and hypothesis testing• Parameter estimation• Confidence levels

3rd lecture:• Maximum likelihood fits• Monte Carlo methods• Data unfolding

4th lecture:• Multivariate techniques and machine learning

Page 3: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Catch-up from yesterday

3

Page 4: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

10 Fundamental concepts

F(x) = L P(xd· (1.16)

A useful concept related to the cumulative distribution is the so-called quan-tile of order a or a-point. The quantile Xa is defined as the value of the random variable x such that F(xa) = 0', with 0 ::; 0' ::; 1. That is, the quantile is simply the inverse function of the cumulative distribution,

(1.17)

A commonly used special case is xl/2, called the median of x. This is often used as a measure of the typical 'location' of the random variable, in the sense that there are equal probabilities for x to be observed greater or less than xl/2.

Another commonly used measure of location is the mode, which is defined as the value of the random variable at which the p.d.f. is a maximum. A p.d.f. may, of course, have local maxima. By the most commonly used location parameter is the expectation value, which will be introduced in Section 1.5.

Consider now the case where the result of a measurement is characterized not by one but by several quantities, which may be regarded as a multidimensional random vector. If one is studying people, for example, one might measure for each person their height, weight, age, etc. Suppose a measurement is characterized by two continuous random variables x and y. Let the event A be 'x observed in [x, x + dx] and y observed anywhere', and let B be 'y observed in [y, y + dy] and x observed anywhere', as indicated in Fig. 1.4.

y 10

",I--- event A 8

4 ... .. 1' . '\ B •. -. .. ... dy

.' .. '. . 2 ... ' ..

... : . -7 E- dx

o o 2 4 6 8

x

The joint p.d.f. f(x, y) is defined by

10

Fig. 1.4 A scatter plot of two ran-dom variables x and y based on 1000 observations. The probability for a point to be observed in the square given by the intersection of the two bands (the event A n B) is given by the joint p.d.f. times the area element, f(x, y)dxdy.

P(A n B) probability of x in [x, x + dx] and y in [y, y + dy] f(x, y)dxdy. (1.18)

i

some event “𝑨”

some event “𝑩”

Multidimensional random variables

What if a measurement consists of two variables?

Let:𝑨 = measurement 𝒙 in [𝒙,𝒙 + 𝒅𝒙]𝑩 = measurement 𝒚 in [𝒚,𝒚 + 𝒅𝒚]

Joint probability: 𝑃 𝐴 ∩ 𝐵 = 𝑝/0 𝑥, 𝑦 𝑑𝑥𝑑𝑦(where 𝑝/0 𝑥, 𝑦 is joint PDF)

If the two variables are independent:𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 5 𝑃 𝐵𝑝/0 𝑥, 𝑦 = 𝑝/ 𝑥 5 𝑝0 (𝑦)

Marginal PDF: if one is not interested in dependence on 𝒚 (or cannot measure it),

→ integrate out (“marginalise”) 𝒚, ie, project onto 𝒙→ resulting one-dimensional PDF: 𝑝/ 𝑥 = ∫ 𝑝/0 𝑥, 𝑦 𝑑𝑦

4

From: Glen Cowan, Statistical data analysis

Page 5: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Y

Xl

10

B

X2

" " " ". " 1.1.

,'II

" "

(a)

0'--_-L.J __ L----Jc...u.... __ ..1...-_----'

o 2 4 6 8 10

X

Functions of random variables 13

:5 (b) 0.4

0.3

0.2

0.1

" 0

0 2 4 6 8 10

y

Fig. 1.6 (a) A scatter plot of random variables x and y indicating two infinitesimal bands in x of width dx at Xl (solid band) and X2 (dashed band). (b) The conditional p.d.f.s h(ylxt) and h(ylx2) corresponding to the projections of the bands onto the y axis.

I: g(xly)fy(y)dy, I: h(Ylx)fx(x)dx.

(1.27)

(1.28)

These correspond to the law of total probability given by equation (1.7), gener-alized to the case of continuous random variables.

If 'x in [x,x+dx] with any y' (event A) and 'y in [y+dy] with any x' (event B) are independent, i.e. P(A n B) = P(A) P(B), then the corresponding joint p.d.f. for x and y factorizes:

f(x, y) = fx(x) fy(y)· (1.29)

From equations (1.24) and (1.25), one sees that for independent random variables x and y the conditional p.d.f. g(xly) is the same for all y, and similarly h(ylx) does not depend on x. In other words, having knowledge of one of the variables does not change the probabilities for the other. The variables x and y shown in Fig. 1.6, for example, are not independent, as can be seen from the fact that h(ylx) depends on x.

1.4 Functions of random variables Functions of random variables are themselves random variables. Suppose a(x) is a continuous function of a continuous random variable x, where x is distributed according to the p.d.f. f(x). What is the p.d.f. g(a) that describes the distribution of a? This is determined by requiring that the probability for x to occur between

Conditioning versus marginalisation

Conditional probability 𝑷 𝑨 𝑩 : [ read: 𝑃(𝐴|𝐵) = “probability of 𝐴 given 𝐵” ]

Rather than integrating over the whole 𝑦 region (marginalisation), look at one-dimensional (1D) slices of the two-dimensional (2D) PDF 𝑝/0 𝑥, 𝑦 :

𝑷 𝑨 𝑩 =𝑃 𝐴 ∩ 𝐵𝑃 𝐵 =

𝑝/0 𝑥, 𝑦 𝑑𝑥𝑑𝑦𝑝0 𝑦 𝑑𝑥

𝑝0 𝑦 𝑥; = 𝑝/0(𝑥 = const = 𝑥;, 𝑦)

From: Glen Cowan, Statistical data analysis

5

𝑝 0𝑦𝑥

𝑝0 𝑦 𝑥; 𝑝0 𝑦 𝑥A

𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝐵 5 𝑃 𝐵 ⇔

Page 6: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Covariance and correlation

Recall, for 1D PDF 𝒑𝒙 𝒙 we had: 𝐸 𝑥 = 𝜇/; 𝑉[𝑥] = 𝜎/A

For a 2D PDF 𝒑𝒙𝒚 𝒙, 𝒚 , one correspondingly has: 𝜇/, 𝜇0, 𝜎/, 𝜎0

How do 𝒙 and 𝒚 co-vary ? ®

From this define the scale / dimension invariant correlation coefficient:

C/0 = covariance/0 = 𝐸 𝑥 − 𝜇/ 𝑦 − 𝜇0 = 𝐸 𝑥𝑦 − 𝜇/𝜇0

𝜌/0 = C/0𝜎/𝜎0

• If 𝑥, 𝑦 are independent: 𝜌/0 = 0, ie, they are uncorrelated (or they factorise)Proof: 𝐸 𝑥𝑦 = ∬𝑥𝑦 5 𝑝/0 𝑥, 𝑦 𝑑𝑥𝑑𝑦�

� = ∬𝑥𝑦 5 𝑝/ 𝑥 𝑝0 𝑦 𝑑𝑥𝑑𝑦�� = ∫ 𝑥 5 𝑝/ 𝑥 𝑑𝑥 5 ∫ 𝑦 5 𝑝0 𝑦 𝑑𝑦 = 𝜇/𝜇0

• Note that the contrary is not always true: non-linear correlations can lead to 𝜌/0 = 0, ® see next page

, where 𝜌/0 ⊂ [−1,+1]

6

Page 7: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Correlations

Figure from: https://en.wikipedia.org/wiki/Correlation_and_dependence

…and non-linear correlation patterns are not or only approximately captured by 𝜌/0 (see above figures)

…it does not measure the slope 𝜌/0 (see above figures)

The correlation coefficient measures the noisiness and direction of a linear relationship:

7

𝑥

𝑦 = 𝜌/0

Page 8: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Correlations

Non-linear correlation can be captured by the “mutual information” quantity 𝑰𝒙𝒚:

𝐼/0 = [𝑝/0 𝑥, 𝑦 5 ln𝑝/0 𝑥, 𝑦𝑝/ 𝑥 𝑝0 𝑦

𝑑𝑥𝑑𝑦�

where 𝐼/0 =0 only if 𝒙, 𝒚are fully statistically independentProof: if independent, then 𝑝/0 𝑥, 𝑦 = 𝑝/ 𝑥 𝑝0 𝑦 ⇒ ln … = 0

NB: 𝐼/0 = 𝐻/ − 𝐻/ 𝑦 = 𝐻0 − 𝐻0 𝑥 , where 𝐻/ = −∫𝑝/ 𝑥 5 ln 𝑝/ 𝑥 𝑑𝑥�

� is entropy, 𝐻/ 𝑦 is conditional entropy

8

Measure of mutual dependence between two variables: “How much information is shared among them”

Page 9: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

2D Gaussian (uncorrelated)

Two variable 𝒙, 𝒚 are independent: [𝑝/0 𝑥, 𝑦 = 𝑝/ 𝑥 5 𝑝0 𝑦 ]

9

𝒙 − 𝝁𝒙

𝒚−𝝁 𝒚

𝑝/0 𝑥, 𝑦 =12𝜋� 𝜎/

𝑒d /def g

Ahfg 512𝜋� 𝜎0

𝑒d0dei

g

Ahig

Page 10: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

2D Gaussian (correlated)

Two variable 𝒙, 𝒚 are not independent: [𝑝/0 𝑥, 𝑦 ≠ 𝑝/ 𝑥 5 𝑝0 𝑦 ]

10

where:

is the (symmetric) covariance matrix

Corresponding correlation matrix elements:

𝑝/⃑ �⃑� =1

2𝜋 det(𝐶)� 5 exp −12 �⃑� − �⃑� p𝐶d;(�⃑� − �⃑�)

𝜌qr = 𝜌rq = CqrCqq 5 Crr�

𝐶 = 𝑥A − 𝑥 A 𝑥𝑦 − 𝑥 𝑦𝑥𝑦 − 𝑥 𝑦 𝑦A − 𝑦 A

𝒙 − 𝒙

𝒚−

𝒚

Page 11: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

SQRT decorrelation

Find variable transformation that diagonalises a covariance matrix 𝐶

Determine “square-root” 𝐶s of 𝐶 (such that: 𝐶 = 𝐶s 5 𝐶s) by first diagonalising 𝐶

𝐷 = 𝑆p 5 𝐶 5 𝑆 ⟺ 𝐶s = 𝑆 5 𝐷� 5 𝑆p

where 𝐷 is diagonal, 𝐷� = 𝑑;;� , … , 𝑑ww

� , and 𝑆 an orthogonal matrix

Linear decorrelation of correlated vector 𝒙then obtained by

𝒙′ = 𝐶s d; 5 𝒙

Principle component analysis (PCA) is another convenient method to achieve linear decorrelation (PCA is linear transformation that rotates a vector such that the maximum variability is visible. It identifies most important gradients)

Example: original correlations

11

𝑥

𝑥

𝑦𝑦

Page 12: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

SQRT decorrelation

Example: after SQRT decorrelation

SQRT decorrelation works only for linear correlations!

Find variable transformation that diagonalises a covariance matrix 𝐶

Determine “square-root” 𝐶s of 𝐶 (such that: 𝐶 = 𝐶s 5 𝐶s) by first diagonalising 𝐶

𝐷 = 𝑆p 5 𝐶 5 𝑆 ⟺ 𝐶s = 𝑆 5 𝐷� 5 𝑆p

where 𝐷 is diagonal, 𝐷� = 𝑑;;� , … , 𝑑ww

� , and 𝑆 an orthogonal matrix

Linear decorrelation of correlated vector 𝒙then obtained by

12

𝑥

𝑥

𝑦𝑦

′′

𝒙′ = 𝐶s d; 5 𝒙

Page 13: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Functions of random variables

Any function of a random variable is itself a random variable

E.g., 𝒙 with PDF 𝒑𝒙(𝒙)becomes: 𝒚 = 𝒇(𝒙)𝒚 could be a parameter extracted from a measurement

13

What is the PDF 𝒑𝒚(𝒚)?

• Probability conservation: 𝑝0 𝑦 |𝑑𝑦| = 𝑝/(𝑥)|𝑑𝑥|

• For a 1D function 𝑓(𝑥) with existing inverse:

• Hence:

𝑑𝑦 =𝑑𝑓 𝑥𝑑𝑥 𝑑𝑥 ⟺ 𝑑𝑥 =

𝑑𝑓d; 𝑦𝑑𝑦 𝑑𝑦

𝒑𝒚(𝒚) = 𝑝/ 𝑓d;(𝑦)𝑑𝑥𝑑𝑦

Note: this is not the standard error propagation but the full PDF !

14 Fundamental concepts

.--.. 10 3: 10

(b) 8 8

6 6

4 4

2 2

0 0 0 2 4 6 8 10 0 2 4 6 8 10

x x

Fig. 1.7 Transformation of variables for (a) a function q( x) with a single-valued inverse x( a) and (b) a function for which the interval da corresponds to two intervals dXl and dX2'

x and x + dx be equal to the probability for a to be between a and a + da. That IS,

g(a')da' = 1 J(x)dx, dS

(1.30)

where the integral is carried out over the infinitesimal element dS defined by the region in x-space between a (x) = a' and a (x) = a' + da', as shown in Fig. 1. 7 ( a) . If the function a(x) can be inverted to obtain x(a), equation (1.30) gives

11x (a+da) I l x (aH, *,da

g(a)da = J(x')dx' = J(x')dx', x(a) x(a)

(1.31)

or

g(a) = f(x(a)) 1 I· (1.32)

The absolute value of dx/da ensures that the integral is positive. If the function a(x) does not have a unique inverse, one must include in dS contributions from all regions in x-space between a(x) = a' and a(x) = a' +da', as shown in Fig. 1.7(b).

The p.d.f. g(a) of a function a(xl, ... , xn) of n random variables Xl, ... , Xn with the joint p.d.f. J(XI,.'" xn) is determined by

g(a')da' = J .. ·15 J(XI, ... , Xn)dXI ... dxn, (1.33)

where the infinitesimal volume element dS is the region in Xl, ... ,xn-space be-tween the two (hyper)surfaces defined by a(xI, ... , xn) = a' and a(xI, ... , xn) = a' + da'.

𝑑𝑦

𝑦=𝑓(𝑥)

Glen Cowan: Statistical data analysis

𝑑𝑥

𝑥

Page 14: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Error propagation

Let’s assume a measurement 𝒙 with unknown PDF 𝒑𝒙(𝒙), and a transformation 𝒚 = 𝒇(𝒙)

• �̅� and 𝑉| are estimates of 𝜇 and variance 𝜎Aof 𝑝/(𝑥)

What are 𝐸 𝑦 and, in particular, 𝝈𝒚𝟐 ? ® Taylor-expand 𝑓 𝑥 around �̅�:

• 𝑓 𝑥 = 𝑓 �̅� + ���/�/�/̅

𝑥 − �̅� + ⋯ ⇒ 𝐸 𝑓 𝑥 ≃ 𝑓 �̅� (because: 𝐸 𝑥 − 𝑥� = 0 !)

Now define 𝑦� = 𝑓 �̅� , and from the above follows:

⬄ 𝑦 − 𝑦� ≃ ���/�/�/̅

𝑥 − �̅�

⬄ 𝐸 (𝑦 − 𝑦�)A = ���/�/�/̅

A𝐸 (𝑥 − �̅�)A

⬄ 𝑉|0 =���/�/�/̅

A𝑉|/

⬄ 𝜎0 =���/�/�/̅

5 𝜎/

14

® (approximate) error propagation

Page 15: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Error propagation (continued)

In case of several variables, compute covariance matrix and partial derivatives

• Let 𝒇 = 𝒇(𝒙𝟏,… , 𝒙𝒏) be a function of 𝒏 randomly distributed variables

• ���/�/�/̅

A𝑉|/ becomes: (where: �̅� = (�̅�;, … , �̅�w))

• with the covariance matrix:

15

�𝜕𝑓𝜕𝑥q

𝜕𝑓𝜕𝑥r

�/̅

w

q,r�;

5 𝑉|q,r

𝑉|q,r =𝜎/�A ⋯ 𝜎/�/�⋮ ⋱ ⋮

𝜎/�/� ⋯ 𝜎/�A

® The resulting “error” (uncertainty) depends on the correlation of the input variables

o Typically (not always:) positive correlations lead to an increase of the total error,

o and negative correlations decrease the total error

For very complicated functional dependence 𝒇 = 𝒇(𝒙𝟏,… , 𝒙𝒏), use Monte Carlo techniques (“pseudo MC generation”) to propagate uncertainties

Page 16: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Probability (axioms) & Statistics

16

Page 17: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

What is a Probability ?

Axioms of probability (Kolmogorov, 1933)

• 𝑷 𝑨 ≥ 𝟎, where 𝑨 is any subset of sample space (universe) 𝑼

• Unitarity: ∫ 𝑷 𝑨 𝒅𝑨 = 𝟏�𝑼

• If (𝑨 ∩ 𝑩) = 𝟎 (read: “𝐴𝑎𝑛𝑑𝐵”) ® 𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷(𝑩) (where: 𝐴 ∪ 𝐵 = “𝐴𝑜𝑟𝐵”)

Recall: conditional probability 𝑃(𝐴|𝐵) was defined by 𝑃(𝐴|𝐵) = 𝑃 𝐴 ∩ 𝐵 /𝑃(𝐵). It is the probability of 𝐴 in a universe restricted to 𝐵

17

BA

Universe

Venn diagram

Universe

A∩B

Venn diagram

BA

Andrey Nikolaevich Kolmogorov

Disjoint/exclusive: (𝐴 ∩ 𝐵) = 0 Overlapping

Page 18: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

What is a Probability ? (continued)

Axioms of probability ® set theory (a “set” is a collection of things/elements)

1. A measure of how likely an “event” will occur, expressed as the ratio of favourable to all possible cases in repeatable trials

• Frequentist (classical) probability:

2. The “degree of belief” that an event is going to happen

• Bayesian probability:

– 𝑃(“event”) is degree of belief that “event” will happen ® no need for “repeatable trials”

– Degree of belief (in view of the data and previous knowledge (belief) about the “event”) that a parameter has a certain “true” value

• Bayes’ theorem:

18

𝑃(“event”) = limw→�

#outcomeis“event”

𝑛“trials”

𝑃 𝐴 𝐵 =𝑃(𝐵|𝐴) 5 𝑃 𝐴

𝑃(𝐵)

Proof from conditional probability: 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐴 ∩ 𝐵) = 𝑃(𝐵 ∩ 𝐴 = 𝑃 𝐵 𝐴 𝑃(𝐴)

The prior probability 𝑃 𝐴 has been modified by 𝐵 to become the posterior probability 𝑃 𝐴|𝐵

Page 19: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Frequentist versus Bayesian statistics

19

Frequentist statement:

• Probability of the “observed data”* to occur given a model (hypothesis): 𝑃(data|model)

Bayesian statement:

• Probability of the model given the data: 𝑃(model|data)

• Let’s look again at Bayes’ theorem written slightly differently (𝜃 = set of parameters fixing the model)

𝑃 𝜃 data =𝑃(data|𝜃) 5 𝑃 𝜃

𝑃(data)𝑃 𝜃 data : posterior probability of 𝜃 given the data𝑃 data 𝜃 : probability of data given 𝜃𝑷 𝜽 : the “prior” probability for 𝜽𝑃 data : a normalisation

, here:

Frequentist statistics is unaware of the “truth”, and only allows to exclude unlikely hypotheses (objective statement).

Bayesian statistics speculates about “truth” by injecting arbitrary prior probabilities (subjective statement). By virtue of the Central Limit Theorem, the prior dependence may be weak in concrete cases.

*The term ”observed data” is sloppy: meant is an observed estimator value, and the probability refers to cumulative estimator values

Page 20: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

𝑃(data | model) ≠ 𝑃(model | data)

20

Consider a new physics search where a local excess of events has been observed with a (global) significance of 2.5 standard deviations, corresponding to a ~0.6% one-sided probability

• Assuming 𝑃 data model = 𝑃 model data), and concluding that for a given well-fitting new physics model 𝑃 newphysicsmodel data ≃ 99.4% is wrong(frequently done by the press) 1.5 2 2.5 3 3.5

Even

ts /

100

GeV

1−10

1

10

210

310

410DataBackground model1.5 TeV EGM W', c = 12.0 TeV EGM W', c = 12.5 TeV EGM W', c = 1Significance (stat)Significance (stat + syst)

ATLAS-1 = 8 TeV, 20.3 fbs

WZ Selection

[TeV]jjm1.5 2 2.5 3 3.5

Sign

ifica

nce

2−1−0123

[ ATLAS, arXiv:1506.00962 ]

Frequentist statistics gives the probability to observe certain data under a given hypothesis, but it says nothing about the probability of the hypothesis to be true. Important subtlety!

• Will later define “confidence levels”: if 𝑃 data model < 5% ® discard model

Page 21: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Frequentist versus Bayesian statistics

Both statistical concepts have important applications

• Most LHC results use frequentist statistics as it is objective, and empirical sciences progress by successive exclusion and improvement of the understanding (theory)

• Nevertheless, there are many Bayesian elements (eg, “decision” on exclusion or discovery given an observed 𝑃 data model , interpretation of results)

• It is also possible to define, problem-dependent, “objective priors” in Bayesian statistics• A frequentist analysis can become technically very challenging ® Bayesian often simpler

• The predictivity of Bayesian statistics is useful when it comes to decision taking, eg:

• Should I sell, buy or hold certain stocks ?• Should I build the LHC ? (Bayesian “no-loose” theorem)

• Almost any decision in life…

21

Page 22: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Frequentist versus Bayesian statistics

Both statistical concepts have important applications

• Most LHC results use frequentist statistics as it is objective, and empirical sciences progress by successive exclusion and improvement of the understanding (theory)

• Nevertheless, there are many Bayesian elements (eg, “decision” on exclusion or discovery given an observed 𝑃 data model , interpretation of results)

• It is also possible to define, problem-dependent, “objective priors” in Bayesian statistics• A frequentist analysis can become technically very challenging ® Bayesian often simpler

• The predictivity of Bayesian statistics is useful when it comes to decision taking, eg:

• Should I sell, buy or hold certain stocks ?• Should I build the LHC ? (Bayesian “no-loose” theorem)

• Almost any decision in life…

22

Bayesians address the question everyone is interested in, by using assumptions no-one believes

Frequentists use impeccable logic to deal with an issue of no interest to anyone

Slightly provocative summary by Louis Lyons (Academic Lecture at Fermilab, August 17, 2004)

…to be taken with the grain of salt !

Page 23: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Hypothesis testing

A hypothesis 𝑯 specifies some model which might lie at the origin of the data 𝒙

a) Point hypothesis: 𝐻 could be a particular event type (eg, Higgs boson versus background)

b) Composite hypothesis: 𝐻 could be a parameter (eg, Higgs boson mass or coupling strength)

23

In case a), the PDF is simply PDF 𝑥 = PDF 𝑥;𝐻

In case b), 𝐻 contains unspecified parameters (𝜃: mass, coupling, systematic uncertainties)

• A whole band of PDF 𝑥;𝐻(𝜃)

• For given 𝑥, PDF 𝑥;𝐻(𝜃) can be interpreted as a function of 𝜃 ® Likelihood function 𝑳(𝜃)

• 𝐿 𝜃 = 𝐿 𝑥 𝐻 𝜃 for fixed 𝜃 is the probability density to observe 𝑥 given the model 𝐻(𝜃), but note that 𝐿 𝜃 is not the PDF of 𝑥 versus 𝜃 given 𝐻(𝜃)

Statistical tests are often formulated using a

• Null hypothesis (eg, Standard Model (SM) background only)

• Alternative hypothesis (eg, SM background + new physics)

Page 24: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Hypothesis testing (continued)

Example, a multivariate (® see last lecture) classification analysis to search for new physics

24

Take 𝒏 input variables and combine into single output discriminant or test statistic 𝒚

𝒚

PDF 𝑦 bkg

Choose cut value: i.e. a region where one can “reject” the null-(background-) hypothesis(optimal cut value depends on signal and background cross-section and purity)

> cut: signal region= cut: decision boundary< cut: background region

𝒚:

PDF 𝑦 signal

cut

var1+var2-6 -4 -2 0 2 4 6

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3 SignalBackground

var1+var2-6 -4 -2 0 2 4 6

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var1+var2

var1-var2-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

var1-var2-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var1-var2

var3-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

00.05

0.1

0.150.2

0.25

0.30.35

0.40.45

var3-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

00.05

0.1

0.150.2

0.25

0.30.35

0.40.45

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var3

var4-4 -3 -2 -1 0 1 2 3 4 5

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

var4-4 -3 -2 -1 0 1 2 3 4 5

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var4

var1+var2-6 -4 -2 0 2 4 6

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3 SignalBackground

var1+var2-6 -4 -2 0 2 4 6

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var1+var2

var1-var2-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

var1-var2-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var1-var2

var3-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

00.05

0.1

0.150.2

0.25

0.30.35

0.40.45

var3-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

00.05

0.1

0.150.2

0.25

0.30.35

0.40.45

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var3

var4-4 -3 -2 -1 0 1 2 3 4 5

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

var4-4 -3 -2 -1 0 1 2 3 4 5

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var4

var1+var2-6 -4 -2 0 2 4 6

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3 SignalBackground

var1+var2-6 -4 -2 0 2 4 6

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var1+var2

var1-var2-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

var1-var2-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var1-var2

var3-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

00.05

0.1

0.150.2

0.25

0.30.35

0.40.45

var3-4 -3 -2 -1 0 1 2 3 4

Norm

alis

ed

00.05

0.1

0.150.2

0.25

0.30.35

0.40.45

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var3

var4-4 -3 -2 -1 0 1 2 3 4 5

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

var4-4 -3 -2 -1 0 1 2 3 4 5

Norm

alis

ed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

U/O

-flow

(S,B

): (0

.0, 0

.0)%

/ (0

.0, 0

.0)%

Input variables (training sample): var4

Page 25: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Hypothesis testing (continued)

25

It occurs that one makes mistakes:

Type-2 error: (false negative)® accept null hypothesis although it is not true (there is new physics in the data)

Type-1 error: (false positive)® reject null hypothesis although it is true (no new physics)

Signal (𝐻;) Background (𝐻³)

Signal (𝐻;) J Type-2 error

Background (𝐻³) Type-1 error J

Example: goal of new physics search: exclude null hypothesis (as being unlikely the model underlying the observation)

Page 26: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Hypothesis testing (continued)

26

It occurs that one makes mistakes:

Type-2 error: (false negative)® accept null hypothesis although it is not true (there is new physics in the data)

Type-1 error: (false positive)® reject null hypothesis although it is true (no new physics)

Significance 𝜶: Type-1 error rate:Rate (“risk”) of “false discovery”, background in signal sample

Size 𝜷: Type-2 error rate:Power: 1 − 𝛽 = sensitivity to the “alternative”

theory, signal efficiency

should be small

should be small𝛽 = · 𝑃 𝑥 𝐻; 𝑑𝑥0(/)¸¹º»

𝛼 = · 𝑃 𝑥 𝐻³ 𝑑𝑥0(/)½¹º»

Example: goal of new physics search: exclude null hypothesis (as being unlikely the model underlying the observation)

Page 27: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Hypothesis testing (continued)

27

Define critical region 𝑪 ® if data (observation) falls there, reject a hypothesis

• Want to discriminate between hypotheses 𝑯³ and 𝑯;

• Define test statistic 𝒚(𝒙) for data 𝒙

• Compute expected 𝒚 distributions for two hypotheses: PDF(𝑦(𝑥)|𝐻³) and PDF(𝑦(𝑥)|𝐻;)

• Compute observed test statistic 𝒚𝐨𝐛𝐬(𝒙) ® decide on outcome whether or not 𝒚𝐨𝐛𝐬 ∈ 𝑪

𝑦ÃÄÅ

Page 28: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Neyman-Pearson Lemma

Neyman-Pearson (1933):

The Likelihood ratio used as test statistic 𝒚(𝒙) gives for each significance 𝜶the test (critical region) with the largest power 𝟏 − 𝜷.

28

Likelihood Ratio:

or any monotonic function thereof, e.g. ln(𝑦 𝑥 )

0 1

1

01 − 𝛽

Best ROC curve given by likelihood ratio

Type-1 error smallType-2 error large

Type-1 error large Type-2 error small

1−𝛼

𝒚 𝒙 =𝑷 𝒙 𝑯𝟏𝑷 𝒙 𝑯𝟎

Which test statistic (discriminant) 𝒚(𝒙) should one actually choose? What is optimal?

The likelihood ratio maximises area under “Receiver Operation Characteristics” (ROC) curve worse

The proof of the Neyman-Pearson Lemma is straightforward and almost obvious given the definitions of 𝛼 and 𝛽

See, eg: https://en.wikipedia.org/wiki/Neyman–Pearson_lemma

Page 29: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Neyman-Pearson Lemma

Unfortunately, the Neyman-Pearson Lemma holds strictly only for simple hypotheses without free parameters

If 𝐻³/; are “composite hypotheses” 𝑯𝟎/𝟏 𝜽 , it is not even sure that there exists a so-called uniformly most powerful test statistic that for each given 𝛼 is the most powerful (largest 1 − 𝛽)

Note: already in presence of systematic uncertainties (as varying but constrained “nuisance parameters”) it is not certain that the likelihood ratio is the optimal test statistic

However: the likelihood ratio is probably close to optimal, it is a very convenient test statistic, and therefore commonly used in experimental particle physics

29

Page 30: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Frequentist confidence intervals

In frequentist statistics one cannot make a probabilistic statement about the true value of a parameter given the data.

Instead:

• One defines acceptance / rejection regions of a test statistic (𝜶)

• The measurement (data) is one specific outcome of an ensemble of possible data

• One accepts or rejects 𝑯𝟎 with confidence level given by 𝜶

• It is also possible to state how probable a particular or worse outcome (test statistic measurement) is for a given hypothesis (eg, 𝑯𝟎) ® p-value

One then shows the data and quotes the 𝐻³ outcome given the required confidence level and the hypothesis p-value

30

Page 31: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

A typical (but highly simplified) frequentist analysis

1. Specify a hypothesis 𝐻³ and test statistic or estimator (® likelihood ratio 𝒚)

2. Specify the significance of the test, ie, how much of a Type-1 error rate to accept: eg, confidence level of 95% ® 𝜶 = 5%

3. Take the measurement: 𝒚𝐨𝐛𝐬4. Check whether 𝒚𝐨𝐛𝐬 lies inside or outside of critical region ® decide on 𝑯𝟎

5. If excluded, compute p-value of 𝑯𝟎 to see how deep it lies in the critical region

31𝑦ÃÄÅ

No composite hypothesis yet (see later). Simple hypothesis test using data

= · PDF(𝑦|𝐻³) 𝑑𝑦�

0ÆÇÈ

𝑦

PDF(𝑦|𝐻 ³)

𝛼 = 5%

p-value

Page 32: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Significance and p-values

Note:

32

𝜶 (significance) must be specified before the hypothesis test is made

The p-value is a property of actual measurement (observation)

Again: the p-value is not a measure how probable the hypothesis is

The confidence level of a hypothesis test (accept / reject) is given by 𝛼not the p-value

It is convenient to express observed p-values in terms of Gaussian 𝜎 (“sigma”):

• How many standard deviations “𝑍” for same p-value on one-sided Gaussian

• In ROOT: TMath: : Prob 𝑍 ∗ 𝑍, 1 /2 = 𝑝 (p-value)(eg: p-value corresponding to 𝑍 = 5𝜎 is 2.87 5 10dÏ)

• Inverse in ROOT: sqrt(TMath: : ChisquareQuantile 1 − 2 ∗ 𝑝, 1 ) = 𝑍x

(x)

ϕ

Z

p−value

arXiv:1007.1727

Page 33: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Distribution of p-values

Assume:

• Test statistic: 𝒚 (function of measured quantities)

• PDF of 𝑦 for given hypothesis 𝑯: 𝒑𝒚(𝒚;𝑯)

• p-value(𝒚;𝑯) = ∫ 𝑝0(𝑦′; 𝐻) 𝑑𝑦′�0 for each measurement 𝒚

p-values are random variables ® distribution if measurement repeated

Derived from a cumulative distribution ® must be uniform for matching hypothesis 𝑯

33

p−value(𝑦; 𝐻) 0 1

• Hence, in a fraction of times, the p-value of a given measurement may become very small, although 𝑯 is the correct hypothesis

• If the true and tested hypotheses are different, the p-value distribution will deviate from uniform (but usually one cannot just repeat a measurement or an experiment to test this)

mea

sure

men

ts

Page 34: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Statistical tests in new particle/physics searches

Discovery test

• Disprove background-only hypothesis 𝑯𝟎

• Estimate probability of “upward” (or “signal-like”) fluctuation of background

34

Exclusion limit

• Upper limit on new physics cross section

• Disprove signal + background hypothesis 𝑯𝟎

• Estimate probability of downward fluctuation of signal + background: find minimal signal, for which 𝑯𝟎 (here: S+B) can be excluded at specified confidence Level

Background-only: 𝑯𝟎PDF:Poisson(𝑁Ó; 𝜇Ó = 4)

Nobs for 5sdiscoveryp = 2.875 10–7

Example: PDF of background-only test statistic

𝑁

Type 1 error α = 5% → 95%CL

Example: PDFs for B and S+B

𝐻; = 𝐵 𝐻³ = 𝑆 + 𝐵

𝑁

Page 35: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Statistical tests in new particle/physics searches

Discovery test

• Disprove background-only hypothesis 𝑯𝟎

• Estimate probability of “upward” (or “signal-like”) fluctuation of background

35

Exclusion limit

• Upper limit on new physics cross section

• Disprove signal + background hypothesis 𝑯𝟎

• Estimate probability of downward fluctuation of signal + background: find minimal signal, for which 𝑯𝟎 (S+B) can be excluded at pre-specified confidence Level

Background-only: 𝑯𝟎PDF:Poisson(𝑁Ó; 𝜇Ó = 4)

Nobs for 5sdiscoveryp = 2.875 10–7

Example: PDF of background-only test statistic

𝑁

Type 1 error α = 5% → 95%CL

Example: PDFs for B and S+B

𝐻; = 𝐵 𝐻³ = 𝑆 + 𝐵

𝑁

Realistic discovery and exclusion likelihood tests involve complex fits of several signal and background-normalisation (so-called control ) regions, signal and background yields, as well as nuisance parameters describing systematic uncertainties.

We will come to this, but first need to learn about parameter estimation.

Page 36: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

[GeV]Hm110 115 120 125 130 135 140 145 150

0Lo

cal p

-1110

-1010

-910

-810

-710

-610

-510-410

-310-210-1101

Obs. Exp.

σ1 ±-1Ldt = 5.8-5.9 fb∫ = 8 TeV: s

-1Ldt = 4.6-4.8 fb∫ = 7 TeV: sATLAS 2011 - 2012

σ0σ1σ2σ3

σ4

σ5

σ6

Statistical tests in new particle/physics searches — teaser

Discovery test — Higgs discovery in 2012

• 5.9σ rejection of background-only hypothesis from statistical combination of dominantly H ® γγ, ZZ*, WW* decays at mH = 126 GeV

• No trials factor (look-elsewhere-effect, LEE) taken into account in above number, but would not qualitatively change picture

36

Exclusion limit

• 13 TeV search for new physics (here: a new heavy Higgs boson) in events with at least two tau leptons

• Figure shows expected and observed 95% confidence level upper limits on cross section times branching fraction

(GeV)φm210 310

(pb)

)ττ→φ(

B⋅)φ(g

95%

CL

limit

on

-210

-110

1

10

210

310 ObservedExpected

Expectedσ1± Expectedσ2±

ObservedExpected

Expectedσ1± Expectedσ2±

CMSPreliminary

(13 TeV)-12.3 fbhτhτ+µ+ehτ+ehτµATLAS, 1207.7214

CMS, CMS-PAS-HIG-16-006

Page 37: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Parameter estimation

An estimator is a function of a data sample 𝜽Ö = 𝜽Ö 𝒙𝟏,… . , 𝒙𝑵 that estimates the characteristic parameter 𝜽 of a parent distribution.

Examples:

• Mean value estimator:

• Variance estimator:

• Median estimator …..

• ...but also: CP-asymmetry parameter in B meson sample (very complex parameter estimation)

The estimator 𝜽Ö is a random variable (function of measured data that are random)

The estimator 𝜽Ö has itself an expectation value, an expected variance, for given 𝜽:

37

�̂� =1𝑁�𝑥q

Ù

q�;

(one way to define the mean value, there could be others)

𝑉| =1

𝑁 − 1� 𝑥q − �̅� AÙ

q�;

𝐸 𝜃| 𝑥 𝜃 = ∫𝜃|�� 𝑥 𝑓 𝑥 𝜃 𝑑𝑥® , with 𝑓 𝑥 𝜃 the distribution (PDF) of the expected data

Page 38: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Parameter estimation

𝜽Ö is a random variable that follows a PDF. Consider many measurements / experiments:

→ There will be a spread of 𝜃| estimates. Different estimators can have different properties:

38

BiasedLargevariance

Best

Glen Cowan𝜃|𝜃

• Biased or unbiased: if 𝐸 𝜃| 𝑥 𝜃 = 𝜃 ® unbiased

• Small bias and small variance can be “in conflict”

– asymptotic bias ® limit for infinite observations/data samples

Page 39: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Maximum likelihood estimator

Want to estimate (measure !) a parameter 𝜽

Observe �⃗�q = 𝑥;, … . 𝑥Û q, 𝑖 = 1,𝑁 (ie: 𝐾 observables per event, and 𝑁 events)

Hypothesis is PDF 𝑝/(�⃗�; 𝜃), ie, the distribution of �⃗� given 𝜃

There are 𝑁 independent events ® combine their PDFs:

For fixed �⃗� consider 𝑝/(�⃗�; 𝜃) as function of 𝜃 ® Likelihood 𝑳(𝜽)

• 𝐿 𝜃 is at maximum (if unbiased) for 𝜃| = 𝜃»Þºß

39

𝑃(�⃗�;,..�⃗�Ù; 𝜃) =à𝑝/(�⃗�q; 𝜃)Ù

q�;

log L=41.2 (ML fit) (a) log L=41.0 (true parameters)

4

2

o -0.2 o 0.2 0.4 0.6

x

4

2

o -0.2

log L=13.9 log L=18.9

o

ML estimators 71

(b)

0.2 0.4 0.6

x

Fig. 6.1 A sample of 50 observations of a Gaussian random variable with mean J1. = 0.2 and standard deviation cr = 0.1. (a) The p.d.f. evaluated with the parameters that maximize the likelihood function and with the true parameters. (b) The p.d.f. evaluated with parameters far from the true values, giving a lower likelihood.

With this motivation one defines the maximum likelihood (ML) estimators for the parameters to be those which maximize the likelihood function. As long as the likelihood function is a differentiable function of the parameters (}1, ... , (}m, and the maximum is not at the boundary of the parameter range, the estimators are given by the solutions to the equations, -.

oL O(}i =_ 0, i = 1, ... , m. (6.3)

If more than one local maximum exists, the highest one is taken. As with other types of estimators, they are usually written with hats, 8 = ({h, ... , 8m ), to dis-tinguish them from the true parameters (}i whose exact values remain unknown.

The general idea of maximum likelihood is illustrated in Fig. 6.1. A sample of 50 measurements (shown as tick marks on the horizontal axis) was generated according to a Gaussian p.d.f. with parameters J.l = 0.2, (J' = 0.1. The solid curve in Fig. 6.1(a) was computed using the parameter values for which the likelihood function (and hence also its logarithm) are a maximum: fl = 0.204 and U = 0.106. Also shown as a dashed curve is the p.d.f. using the true parameter values. Because of random fluctuations, the estimates fl and u are not exactly equal to the true values J.l and (J'. The estimators fl and u and their variances, which reflect the size of the statistical errors, are derived below in Section 6.3. Figure 6.1(b) shows the p.d.f. for parameters far away from the true values, leading to lower values of the likelihood function.

The motivation for the ML principle presented above does not necessarily guararitee any optimal properties for the resulting estimators. The ML method turns out to have many advantages, among them ease of use and the fact that no binning is necessary. In the following the desirability of ML estimators will

Glen Cowan

50 observations of a Gaussian random variable with mean 0.2 and σ=0.1

Good estimate

of 𝜃

Poor estimate

of 𝜃

Task: maximise 𝐿 𝜃 to derive best estimate for 𝜃|

In practice, often minimise− 2 5 ln(𝐿 𝜃 ) (see later why)

® Maximum likelihood fit

Page 40: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Maximum likelihood estimator (continued)

40

® In a full maximum likelihood fit one could now determine �̂� and 𝜎á

® If one is not interested in fitting 𝜎 but just 𝜇, one can omit the (then constant) 2nd term:

−2 5 Δln 𝐿 𝜇 𝑥 =�𝑥q − 𝜇 A

𝜎A

Ù

q�;® which is the “least squares” (𝝌𝟐) expression

where: Δln 𝐿 𝜇 𝑥 = ln 𝐿 𝜇 𝑥 − constantterm

Let’s take the Gaussian example from before: 𝐿(𝜇, 𝜎|𝑥) = ;Aä� h

exp − /de g

Ahg

• Measure 𝑁 events: 𝑥;,…,𝑥Ù

• Full likelihood given by: 𝐿 𝜇, 𝜎 𝑥 = ∏ ;Aä� h

exp − /æde g

AhgÙq�;

• In logarithmic form: −2 5 ln 𝐿 𝜇, 𝜎 𝑥 = ∑ /æde g

hgÙq�; − 2𝑁 5 ln ;

Aä� h

Page 41: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Maximum likelihood estimator (continued)

41

So far considered unbinned datasets (i.e., likelihood is given by product of PDFs for each event)

One can replace the events by bins of a histogram

• Useful if very large number of events, or PDF has very complex form, or if only broad regions are considered rather than the full shape of a PDF

• Most LHC analyses use binned maximum likelihood fits

Each bin 𝒊 has 𝑵𝒊 events that are Poisson distributed around 𝝁𝒊

• The prediction of the 𝜇q can be obtained from Monte Carlo simulation

Likelihood function: 𝐿 𝜃 = 𝑃 𝑁;, …𝑁wÇéêÈ; 𝜃 = à𝜇qÙæ(𝜃)𝑁q!

𝑒deæ(ì)wÇéêÈ

q�;

−2 5 ln 𝐿(𝜃) = 2 � 𝜇q 𝜃 − 𝑁qln 𝜇q(𝜃) − ln 𝑁q!wÇéêÈ

q�;

…and in log form:

Page 42: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Maximum likelihood estimator (continued)

Maximum likelihood estimator is typically unbiased only in limit 𝑁 → ∞

42

Asymmetric errors• Another approximation alternative to the parabolic one may be

to evaluate the excursion range of −2ln L.• Error (nσ) determined by the range around the maximum for

which −2ln L increases by +1 (+n2 for nσ intervals)

European School of HEP 2016 Luca Lista 58

θ

−2lnL

−2lnLmax

−2lnLmax+ 1

!+ !+ + δ+!+ – δ−

• Errors can be asymmetric

• For a Gaussian PDF the result is identical to the 2nd order derivative matrix

• Implemented in Minuit as MINOS function

1

If likelihood function is Gaussian (often the case for large 𝑁 by virtue of central limit theorem):

→ Estimate 1σ confidence interval for 𝜃(“parameter uncertainty”) by finding intersections −𝟐 5 𝜟𝐥𝐧 𝑳 = 𝟏 around minimum

→ Resulting uncertainty on 𝜃 may be asymmetric

If (very) non-Gaussian:

→ revert typically to (classical) Neyman confidence intervals (® see later)

Luca

Lis

ta, E

PSH

EP S

choo

l 201

6

Page 43: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

min2χ

0 5 10 15 20 25 30 35 40 45 50

Num

ber o

f toy

exp

erim

ents

0

100

200

300

400

500

600

700

0 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

=14dof distribution for n2χ

Toy analysis excl. theo. errors

)SM|

data

p-va

lue

for (

0

0.2

0.4

0.6

0.8

1

0.004±p-value = 0.202

Goodness-of-Fit (GoF)

Maximum likelihood estimator determines the best parameter 𝜽Ö

But: does the model with the best 𝜃| fit the data well ?

The value of −2 5 ln 𝐿(𝜃|) at minimum does not mean much ® needs calibration

→ Determine the expected distribution of −2 5 ln 𝐿(𝜃|) using pseudo Monte Carlo events, and compare measured value to expected ones

43−𝟐 5 𝐥𝐧 𝑳(𝜽)−𝟐 5 𝐥𝐧 𝑳(𝜽Ö)

Gfitter group

Page 44: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Goodness-of-Fit (continued)

A Goodness-of-fit test is more straightforward with 𝝌𝟐 estimator

Let’s use the binned example again. The task is to minimise versus 𝜃:

44

𝜒òóôA 𝜃| = minì 𝜒A 𝜃 = �𝑁q − 𝜇q(𝜃) A

𝜎qA

wÇéêÈ

q�;

𝜒A has known properties: 𝐸[𝜒A] = 𝑛õ.Ã.ö = 𝑘 (= number of degrees of freedom)

Cumulative PDF: probability to find 𝜒A > 𝜒òóôA : TMath: : Prob(𝜒òóôA , 𝑘)

Figu

res

from

: ht

tps:

//en.

wik

iped

ia.o

rg/w

iki/C

hi-s

quar

ed_d

istri

butio

n

PDF:𝑃𝜒A;𝑛�

𝜒A 𝜒A

Cum

ulat

ive

PDF

𝑍ù�; = 1𝜎 𝑍ù�; = 2𝜎

Page 45: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Classical confidence level

Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)

Statement about probability to cover true value 𝝁ú𝐭𝐫𝐮𝒆 of parameter 𝝁ú fit to data

45

• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed

20 33. Statistics

33.3.2. Frequentist confidence intervals :

The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.

33.3.2.1. The Neyman construction for confidence intervals:

Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that

P (x1 < x < x 2; θ) = 1 − α =x 2

x 1f (x ; θ) dx . (33 .49)

This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.

Figure 33.3: Construction of the confidence belt (see text).

February 18, 2012 20:15

�̂�

𝜇 »Þºß

(hyp

othe

sis)

Page 46: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Classical confidence level

Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)

Statement about probability to cover true value 𝝁ú𝐭𝐫𝐮𝒆 of parameter 𝝁ú fit to data

46

20 33. Statistics

33.3.2. Frequentist confidence intervals :

The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.

33.3.2.1. The Neyman construction for confidence intervals:

Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that

P (x1 < x < x 2; θ) = 1 − α =x 2

x 1f (x ; θ) dx . (33 .49)

This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.

Figure 33.3: Construction of the confidence belt (see text).

February 18, 2012 20:15

�̂�

• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed

• Determine the (central) intervals (“acceptance region”) in these PDFs such that they contain 𝜶

𝜶

𝜇 »Þºß

(hyp

othe

sis)

Page 47: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Classical confidence level

Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)

Statement about probability to cover true value 𝝁ú𝐭𝐫𝐮𝒆 of parameter 𝝁ú fit to data

47

20 33. Statistics

33.3.2. Frequentist confidence intervals :

The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.

33.3.2.1. The Neyman construction for confidence intervals:

Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that

P (x1 < x < x 2; θ) = 1 − α =x 2

x 1f (x ; θ) dx . (33 .49)

This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.

Figure 33.3: Construction of the confidence belt (see text).

February 18, 2012 20:15

�̂�

• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed

• Determine the (central) intervals (“acceptance region”) in these PDFs such that they contain 𝜶

• Do this for all �̂�»Þºßhypotheses

• Connect all the red dots: confidence belt

𝜶

𝜇 »Þºß

(hyp

othe

sis)

Page 48: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Classical confidence level

Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)

Statement about probability to cover true value 𝛍ú𝐭𝐫𝐮𝐞 of parameter 𝝁ú fit to data

48

20 33. Statistics

33.3.2. Frequentist confidence intervals :

The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.

33.3.2.1. The Neyman construction for confidence intervals:

Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that

P (x1 < x < x 2; θ) = 1 − α =x 2

x 1f (x ; θ) dx . (33 .49)

This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.

Figure 33.3: Construction of the confidence belt (see text).

February 18, 2012 20:15

𝜇 »Þºß

(hyp

othe

sis)

�̂�

• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed

• Determine the (central) intervals (“acceptance region”) in these PDFs such that they contain 𝜶

• Do this for all �̂�»Þºßhypotheses

• Connect all the red dots: confidence belt

• Measure 𝝁ú𝐨𝐛𝐬

→ Confidence interval [�̂�;, �̂�A] given by vertical line intersecting the belt

𝜶

𝝁ú𝐨𝐛𝐬

𝝁ú𝟏

𝝁ú𝟐

® 𝛼 = 95% of the intervals [�̂�;, �̂�A] contain �̂�»Þºß

Page 49: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Combining confidence intervals

The construction of Neyman intervals may involve large resources if done with pseudo Monte Carlo experiments. In many cases, experiments take “Gaussian” short cut, assuming that the PDF(�̂�»Þºß) is Gaussian and does not depend on �̂�»Þºß (see previous slides)

In Gaussian case, measurements can be combined by multiplying their likelihood functions

Otherwise: it is important to combine individual measurements, not the confidence intervals: construct confidence belt of combined measurement

The following “Gaussian shortcut” will be wrong in that case:

49In a perfectly Gaussian and uncorrelated case, this simple formula is correct

arXi

v:12

01.2

631v

2

Page 50: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

Combining confidence intervals

The construction of Neyman intervals may involve large resources if done with pseudo Monte Carlo experiments. In many cases, experiments take “Gaussian” short cut, assuming that the PDF(�̂�»Þºß) is Gaussian and does not depend on �̂�»Þºß (see previous slides)

In Gaussian case, measurements can be combined by multiplying their likelihood functions

Otherwise: it is important to combine individual measurements, not the confidence intervals: construct confidence belt of combined measurement

The following “Gaussian shortcut” will be wrong in that case:

50In a perfectly Gaussian and uncorrelated case, this simple formula is correct

arXi

v:12

01.2

631v

2

Page 51: Introduction to probability and statistics (2) · 1 Introduction to probability and statistics (2) Andreas Hoecker (CERN) CERN Summer Student Lecture, 17–21 July 2017 If you have

51

We have introduced the axioms of probability theory and discussed the difference between frequentist and Bayesian statistics

We have discussed hypothesis testing, introduced Type-1 and 2 errors, and the Neyman-Pearson likelihood-ratio lemma

Test statistics, confidence intervals, significance and p-values were introduced

Parameter estimation with the maximum likelihood technique, goodness-of-fit, and the derivation of a classical Neyman confidence belt were discussed

Next: realistic maximum likelihood fits, Monte Carlo techniques and data unfolding

Summary for today