Upload
others
View
11
Download
1
Embed Size (px)
Citation preview
1
Introduction to probability and statistics (2)
Andreas Hoecker (CERN)CERN Summer Student Lecture, 17–21 July 2017
If you have questions, please do not hesitate to contact me: [email protected]
2
Outline (4 lectures)
1st lecture:• Introduction • Probability (…some catch-up to do)
2nd lecture:• Probability axioms and hypothesis testing• Parameter estimation• Confidence levels
3rd lecture:• Maximum likelihood fits• Monte Carlo methods• Data unfolding
4th lecture:• Multivariate techniques and machine learning
Catch-up from yesterday
3
10 Fundamental concepts
F(x) = L P(xd· (1.16)
A useful concept related to the cumulative distribution is the so-called quan-tile of order a or a-point. The quantile Xa is defined as the value of the random variable x such that F(xa) = 0', with 0 ::; 0' ::; 1. That is, the quantile is simply the inverse function of the cumulative distribution,
(1.17)
A commonly used special case is xl/2, called the median of x. This is often used as a measure of the typical 'location' of the random variable, in the sense that there are equal probabilities for x to be observed greater or less than xl/2.
Another commonly used measure of location is the mode, which is defined as the value of the random variable at which the p.d.f. is a maximum. A p.d.f. may, of course, have local maxima. By the most commonly used location parameter is the expectation value, which will be introduced in Section 1.5.
Consider now the case where the result of a measurement is characterized not by one but by several quantities, which may be regarded as a multidimensional random vector. If one is studying people, for example, one might measure for each person their height, weight, age, etc. Suppose a measurement is characterized by two continuous random variables x and y. Let the event A be 'x observed in [x, x + dx] and y observed anywhere', and let B be 'y observed in [y, y + dy] and x observed anywhere', as indicated in Fig. 1.4.
y 10
",I--- event A 8
4 ... .. 1' . '\ B •. -. .. ... dy
.' .. '. . 2 ... ' ..
... : . -7 E- dx
o o 2 4 6 8
x
The joint p.d.f. f(x, y) is defined by
10
Fig. 1.4 A scatter plot of two ran-dom variables x and y based on 1000 observations. The probability for a point to be observed in the square given by the intersection of the two bands (the event A n B) is given by the joint p.d.f. times the area element, f(x, y)dxdy.
P(A n B) probability of x in [x, x + dx] and y in [y, y + dy] f(x, y)dxdy. (1.18)
i
some event “𝑨”
some event “𝑩”
Multidimensional random variables
What if a measurement consists of two variables?
Let:𝑨 = measurement 𝒙 in [𝒙,𝒙 + 𝒅𝒙]𝑩 = measurement 𝒚 in [𝒚,𝒚 + 𝒅𝒚]
Joint probability: 𝑃 𝐴 ∩ 𝐵 = 𝑝/0 𝑥, 𝑦 𝑑𝑥𝑑𝑦(where 𝑝/0 𝑥, 𝑦 is joint PDF)
If the two variables are independent:𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 5 𝑃 𝐵𝑝/0 𝑥, 𝑦 = 𝑝/ 𝑥 5 𝑝0 (𝑦)
Marginal PDF: if one is not interested in dependence on 𝒚 (or cannot measure it),
→ integrate out (“marginalise”) 𝒚, ie, project onto 𝒙→ resulting one-dimensional PDF: 𝑝/ 𝑥 = ∫ 𝑝/0 𝑥, 𝑦 𝑑𝑦
4
From: Glen Cowan, Statistical data analysis
Y
Xl
10
B
X2
" " " ". " 1.1.
,'II
" "
(a)
0'--_-L.J __ L----Jc...u.... __ ..1...-_----'
o 2 4 6 8 10
X
Functions of random variables 13
:5 (b) 0.4
0.3
0.2
0.1
" 0
0 2 4 6 8 10
y
Fig. 1.6 (a) A scatter plot of random variables x and y indicating two infinitesimal bands in x of width dx at Xl (solid band) and X2 (dashed band). (b) The conditional p.d.f.s h(ylxt) and h(ylx2) corresponding to the projections of the bands onto the y axis.
I: g(xly)fy(y)dy, I: h(Ylx)fx(x)dx.
(1.27)
(1.28)
These correspond to the law of total probability given by equation (1.7), gener-alized to the case of continuous random variables.
If 'x in [x,x+dx] with any y' (event A) and 'y in [y+dy] with any x' (event B) are independent, i.e. P(A n B) = P(A) P(B), then the corresponding joint p.d.f. for x and y factorizes:
f(x, y) = fx(x) fy(y)· (1.29)
From equations (1.24) and (1.25), one sees that for independent random variables x and y the conditional p.d.f. g(xly) is the same for all y, and similarly h(ylx) does not depend on x. In other words, having knowledge of one of the variables does not change the probabilities for the other. The variables x and y shown in Fig. 1.6, for example, are not independent, as can be seen from the fact that h(ylx) depends on x.
1.4 Functions of random variables Functions of random variables are themselves random variables. Suppose a(x) is a continuous function of a continuous random variable x, where x is distributed according to the p.d.f. f(x). What is the p.d.f. g(a) that describes the distribution of a? This is determined by requiring that the probability for x to occur between
Conditioning versus marginalisation
Conditional probability 𝑷 𝑨 𝑩 : [ read: 𝑃(𝐴|𝐵) = “probability of 𝐴 given 𝐵” ]
Rather than integrating over the whole 𝑦 region (marginalisation), look at one-dimensional (1D) slices of the two-dimensional (2D) PDF 𝑝/0 𝑥, 𝑦 :
𝑷 𝑨 𝑩 =𝑃 𝐴 ∩ 𝐵𝑃 𝐵 =
𝑝/0 𝑥, 𝑦 𝑑𝑥𝑑𝑦𝑝0 𝑦 𝑑𝑥
𝑝0 𝑦 𝑥; = 𝑝/0(𝑥 = const = 𝑥;, 𝑦)
From: Glen Cowan, Statistical data analysis
5
𝑝 0𝑦𝑥
𝑝0 𝑦 𝑥; 𝑝0 𝑦 𝑥A
𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝐵 5 𝑃 𝐵 ⇔
Covariance and correlation
Recall, for 1D PDF 𝒑𝒙 𝒙 we had: 𝐸 𝑥 = 𝜇/; 𝑉[𝑥] = 𝜎/A
For a 2D PDF 𝒑𝒙𝒚 𝒙, 𝒚 , one correspondingly has: 𝜇/, 𝜇0, 𝜎/, 𝜎0
How do 𝒙 and 𝒚 co-vary ? ®
From this define the scale / dimension invariant correlation coefficient:
C/0 = covariance/0 = 𝐸 𝑥 − 𝜇/ 𝑦 − 𝜇0 = 𝐸 𝑥𝑦 − 𝜇/𝜇0
𝜌/0 = C/0𝜎/𝜎0
• If 𝑥, 𝑦 are independent: 𝜌/0 = 0, ie, they are uncorrelated (or they factorise)Proof: 𝐸 𝑥𝑦 = ∬𝑥𝑦 5 𝑝/0 𝑥, 𝑦 𝑑𝑥𝑑𝑦�
� = ∬𝑥𝑦 5 𝑝/ 𝑥 𝑝0 𝑦 𝑑𝑥𝑑𝑦�� = ∫ 𝑥 5 𝑝/ 𝑥 𝑑𝑥 5 ∫ 𝑦 5 𝑝0 𝑦 𝑑𝑦 = 𝜇/𝜇0
• Note that the contrary is not always true: non-linear correlations can lead to 𝜌/0 = 0, ® see next page
, where 𝜌/0 ⊂ [−1,+1]
6
Correlations
Figure from: https://en.wikipedia.org/wiki/Correlation_and_dependence
…and non-linear correlation patterns are not or only approximately captured by 𝜌/0 (see above figures)
…it does not measure the slope 𝜌/0 (see above figures)
The correlation coefficient measures the noisiness and direction of a linear relationship:
7
𝑥
𝑦 = 𝜌/0
Correlations
Non-linear correlation can be captured by the “mutual information” quantity 𝑰𝒙𝒚:
𝐼/0 = [𝑝/0 𝑥, 𝑦 5 ln𝑝/0 𝑥, 𝑦𝑝/ 𝑥 𝑝0 𝑦
𝑑𝑥𝑑𝑦�
�
where 𝐼/0 =0 only if 𝒙, 𝒚are fully statistically independentProof: if independent, then 𝑝/0 𝑥, 𝑦 = 𝑝/ 𝑥 𝑝0 𝑦 ⇒ ln … = 0
NB: 𝐼/0 = 𝐻/ − 𝐻/ 𝑦 = 𝐻0 − 𝐻0 𝑥 , where 𝐻/ = −∫𝑝/ 𝑥 5 ln 𝑝/ 𝑥 𝑑𝑥�
� is entropy, 𝐻/ 𝑦 is conditional entropy
8
Measure of mutual dependence between two variables: “How much information is shared among them”
2D Gaussian (uncorrelated)
Two variable 𝒙, 𝒚 are independent: [𝑝/0 𝑥, 𝑦 = 𝑝/ 𝑥 5 𝑝0 𝑦 ]
9
𝒙 − 𝝁𝒙
𝒚−𝝁 𝒚
𝑝/0 𝑥, 𝑦 =12𝜋� 𝜎/
𝑒d /def g
Ahfg 512𝜋� 𝜎0
𝑒d0dei
g
Ahig
2D Gaussian (correlated)
Two variable 𝒙, 𝒚 are not independent: [𝑝/0 𝑥, 𝑦 ≠ 𝑝/ 𝑥 5 𝑝0 𝑦 ]
10
where:
is the (symmetric) covariance matrix
Corresponding correlation matrix elements:
𝑝/⃑ �⃑� =1
2𝜋 det(𝐶)� 5 exp −12 �⃑� − �⃑� p𝐶d;(�⃑� − �⃑�)
𝜌qr = 𝜌rq = CqrCqq 5 Crr�
𝐶 = 𝑥A − 𝑥 A 𝑥𝑦 − 𝑥 𝑦𝑥𝑦 − 𝑥 𝑦 𝑦A − 𝑦 A
𝒙 − 𝒙
𝒚−
𝒚
SQRT decorrelation
Find variable transformation that diagonalises a covariance matrix 𝐶
Determine “square-root” 𝐶s of 𝐶 (such that: 𝐶 = 𝐶s 5 𝐶s) by first diagonalising 𝐶
𝐷 = 𝑆p 5 𝐶 5 𝑆 ⟺ 𝐶s = 𝑆 5 𝐷� 5 𝑆p
where 𝐷 is diagonal, 𝐷� = 𝑑;;� , … , 𝑑ww
� , and 𝑆 an orthogonal matrix
Linear decorrelation of correlated vector 𝒙then obtained by
𝒙′ = 𝐶s d; 5 𝒙
Principle component analysis (PCA) is another convenient method to achieve linear decorrelation (PCA is linear transformation that rotates a vector such that the maximum variability is visible. It identifies most important gradients)
Example: original correlations
11
𝑥
𝑥
𝑦𝑦
SQRT decorrelation
Example: after SQRT decorrelation
SQRT decorrelation works only for linear correlations!
Find variable transformation that diagonalises a covariance matrix 𝐶
Determine “square-root” 𝐶s of 𝐶 (such that: 𝐶 = 𝐶s 5 𝐶s) by first diagonalising 𝐶
𝐷 = 𝑆p 5 𝐶 5 𝑆 ⟺ 𝐶s = 𝑆 5 𝐷� 5 𝑆p
where 𝐷 is diagonal, 𝐷� = 𝑑;;� , … , 𝑑ww
� , and 𝑆 an orthogonal matrix
Linear decorrelation of correlated vector 𝒙then obtained by
12
𝑥
𝑥
𝑦𝑦
′
′
′′
𝒙′ = 𝐶s d; 5 𝒙
Functions of random variables
Any function of a random variable is itself a random variable
E.g., 𝒙 with PDF 𝒑𝒙(𝒙)becomes: 𝒚 = 𝒇(𝒙)𝒚 could be a parameter extracted from a measurement
13
What is the PDF 𝒑𝒚(𝒚)?
• Probability conservation: 𝑝0 𝑦 |𝑑𝑦| = 𝑝/(𝑥)|𝑑𝑥|
• For a 1D function 𝑓(𝑥) with existing inverse:
• Hence:
𝑑𝑦 =𝑑𝑓 𝑥𝑑𝑥 𝑑𝑥 ⟺ 𝑑𝑥 =
𝑑𝑓d; 𝑦𝑑𝑦 𝑑𝑦
𝒑𝒚(𝒚) = 𝑝/ 𝑓d;(𝑦)𝑑𝑥𝑑𝑦
Note: this is not the standard error propagation but the full PDF !
14 Fundamental concepts
.--.. 10 3: 10
(b) 8 8
6 6
4 4
2 2
0 0 0 2 4 6 8 10 0 2 4 6 8 10
x x
Fig. 1.7 Transformation of variables for (a) a function q( x) with a single-valued inverse x( a) and (b) a function for which the interval da corresponds to two intervals dXl and dX2'
x and x + dx be equal to the probability for a to be between a and a + da. That IS,
g(a')da' = 1 J(x)dx, dS
(1.30)
where the integral is carried out over the infinitesimal element dS defined by the region in x-space between a (x) = a' and a (x) = a' + da', as shown in Fig. 1. 7 ( a) . If the function a(x) can be inverted to obtain x(a), equation (1.30) gives
11x (a+da) I l x (aH, *,da
g(a)da = J(x')dx' = J(x')dx', x(a) x(a)
(1.31)
or
g(a) = f(x(a)) 1 I· (1.32)
The absolute value of dx/da ensures that the integral is positive. If the function a(x) does not have a unique inverse, one must include in dS contributions from all regions in x-space between a(x) = a' and a(x) = a' +da', as shown in Fig. 1.7(b).
The p.d.f. g(a) of a function a(xl, ... , xn) of n random variables Xl, ... , Xn with the joint p.d.f. J(XI,.'" xn) is determined by
g(a')da' = J .. ·15 J(XI, ... , Xn)dXI ... dxn, (1.33)
where the infinitesimal volume element dS is the region in Xl, ... ,xn-space be-tween the two (hyper)surfaces defined by a(xI, ... , xn) = a' and a(xI, ... , xn) = a' + da'.
𝑑𝑦
𝑦=𝑓(𝑥)
Glen Cowan: Statistical data analysis
𝑑𝑥
𝑥
Error propagation
Let’s assume a measurement 𝒙 with unknown PDF 𝒑𝒙(𝒙), and a transformation 𝒚 = 𝒇(𝒙)
• �̅� and 𝑉| are estimates of 𝜇 and variance 𝜎Aof 𝑝/(𝑥)
What are 𝐸 𝑦 and, in particular, 𝝈𝒚𝟐 ? ® Taylor-expand 𝑓 𝑥 around �̅�:
• 𝑓 𝑥 = 𝑓 �̅� + ���/�/�/̅
𝑥 − �̅� + ⋯ ⇒ 𝐸 𝑓 𝑥 ≃ 𝑓 �̅� (because: 𝐸 𝑥 − 𝑥� = 0 !)
Now define 𝑦� = 𝑓 �̅� , and from the above follows:
⬄ 𝑦 − 𝑦� ≃ ���/�/�/̅
𝑥 − �̅�
⬄ 𝐸 (𝑦 − 𝑦�)A = ���/�/�/̅
A𝐸 (𝑥 − �̅�)A
⬄ 𝑉|0 =���/�/�/̅
A𝑉|/
⬄ 𝜎0 =���/�/�/̅
5 𝜎/
14
® (approximate) error propagation
Error propagation (continued)
In case of several variables, compute covariance matrix and partial derivatives
• Let 𝒇 = 𝒇(𝒙𝟏,… , 𝒙𝒏) be a function of 𝒏 randomly distributed variables
• ���/�/�/̅
A𝑉|/ becomes: (where: �̅� = (�̅�;, … , �̅�w))
• with the covariance matrix:
15
�𝜕𝑓𝜕𝑥q
𝜕𝑓𝜕𝑥r
�/̅
w
q,r�;
5 𝑉|q,r
𝑉|q,r =𝜎/�A ⋯ 𝜎/�/�⋮ ⋱ ⋮
𝜎/�/� ⋯ 𝜎/�A
® The resulting “error” (uncertainty) depends on the correlation of the input variables
o Typically (not always:) positive correlations lead to an increase of the total error,
o and negative correlations decrease the total error
For very complicated functional dependence 𝒇 = 𝒇(𝒙𝟏,… , 𝒙𝒏), use Monte Carlo techniques (“pseudo MC generation”) to propagate uncertainties
Probability (axioms) & Statistics
16
What is a Probability ?
Axioms of probability (Kolmogorov, 1933)
• 𝑷 𝑨 ≥ 𝟎, where 𝑨 is any subset of sample space (universe) 𝑼
• Unitarity: ∫ 𝑷 𝑨 𝒅𝑨 = 𝟏�𝑼
• If (𝑨 ∩ 𝑩) = 𝟎 (read: “𝐴𝑎𝑛𝑑𝐵”) ® 𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷(𝑩) (where: 𝐴 ∪ 𝐵 = “𝐴𝑜𝑟𝐵”)
Recall: conditional probability 𝑃(𝐴|𝐵) was defined by 𝑃(𝐴|𝐵) = 𝑃 𝐴 ∩ 𝐵 /𝑃(𝐵). It is the probability of 𝐴 in a universe restricted to 𝐵
17
BA
Universe
Venn diagram
Universe
A∩B
Venn diagram
BA
Andrey Nikolaevich Kolmogorov
Disjoint/exclusive: (𝐴 ∩ 𝐵) = 0 Overlapping
What is a Probability ? (continued)
Axioms of probability ® set theory (a “set” is a collection of things/elements)
1. A measure of how likely an “event” will occur, expressed as the ratio of favourable to all possible cases in repeatable trials
• Frequentist (classical) probability:
2. The “degree of belief” that an event is going to happen
• Bayesian probability:
– 𝑃(“event”) is degree of belief that “event” will happen ® no need for “repeatable trials”
– Degree of belief (in view of the data and previous knowledge (belief) about the “event”) that a parameter has a certain “true” value
• Bayes’ theorem:
18
𝑃(“event”) = limw→�
#outcomeis“event”
𝑛“trials”
𝑃 𝐴 𝐵 =𝑃(𝐵|𝐴) 5 𝑃 𝐴
𝑃(𝐵)
Proof from conditional probability: 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐴 ∩ 𝐵) = 𝑃(𝐵 ∩ 𝐴 = 𝑃 𝐵 𝐴 𝑃(𝐴)
The prior probability 𝑃 𝐴 has been modified by 𝐵 to become the posterior probability 𝑃 𝐴|𝐵
Frequentist versus Bayesian statistics
19
Frequentist statement:
• Probability of the “observed data”* to occur given a model (hypothesis): 𝑃(data|model)
Bayesian statement:
• Probability of the model given the data: 𝑃(model|data)
• Let’s look again at Bayes’ theorem written slightly differently (𝜃 = set of parameters fixing the model)
𝑃 𝜃 data =𝑃(data|𝜃) 5 𝑃 𝜃
𝑃(data)𝑃 𝜃 data : posterior probability of 𝜃 given the data𝑃 data 𝜃 : probability of data given 𝜃𝑷 𝜽 : the “prior” probability for 𝜽𝑃 data : a normalisation
, here:
Frequentist statistics is unaware of the “truth”, and only allows to exclude unlikely hypotheses (objective statement).
Bayesian statistics speculates about “truth” by injecting arbitrary prior probabilities (subjective statement). By virtue of the Central Limit Theorem, the prior dependence may be weak in concrete cases.
*The term ”observed data” is sloppy: meant is an observed estimator value, and the probability refers to cumulative estimator values
𝑃(data | model) ≠ 𝑃(model | data)
20
Consider a new physics search where a local excess of events has been observed with a (global) significance of 2.5 standard deviations, corresponding to a ~0.6% one-sided probability
• Assuming 𝑃 data model = 𝑃 model data), and concluding that for a given well-fitting new physics model 𝑃 newphysicsmodel data ≃ 99.4% is wrong(frequently done by the press) 1.5 2 2.5 3 3.5
Even
ts /
100
GeV
1−10
1
10
210
310
410DataBackground model1.5 TeV EGM W', c = 12.0 TeV EGM W', c = 12.5 TeV EGM W', c = 1Significance (stat)Significance (stat + syst)
ATLAS-1 = 8 TeV, 20.3 fbs
WZ Selection
[TeV]jjm1.5 2 2.5 3 3.5
Sign
ifica
nce
2−1−0123
[ ATLAS, arXiv:1506.00962 ]
Frequentist statistics gives the probability to observe certain data under a given hypothesis, but it says nothing about the probability of the hypothesis to be true. Important subtlety!
• Will later define “confidence levels”: if 𝑃 data model < 5% ® discard model
Frequentist versus Bayesian statistics
Both statistical concepts have important applications
• Most LHC results use frequentist statistics as it is objective, and empirical sciences progress by successive exclusion and improvement of the understanding (theory)
• Nevertheless, there are many Bayesian elements (eg, “decision” on exclusion or discovery given an observed 𝑃 data model , interpretation of results)
• It is also possible to define, problem-dependent, “objective priors” in Bayesian statistics• A frequentist analysis can become technically very challenging ® Bayesian often simpler
• The predictivity of Bayesian statistics is useful when it comes to decision taking, eg:
• Should I sell, buy or hold certain stocks ?• Should I build the LHC ? (Bayesian “no-loose” theorem)
• Almost any decision in life…
21
Frequentist versus Bayesian statistics
Both statistical concepts have important applications
• Most LHC results use frequentist statistics as it is objective, and empirical sciences progress by successive exclusion and improvement of the understanding (theory)
• Nevertheless, there are many Bayesian elements (eg, “decision” on exclusion or discovery given an observed 𝑃 data model , interpretation of results)
• It is also possible to define, problem-dependent, “objective priors” in Bayesian statistics• A frequentist analysis can become technically very challenging ® Bayesian often simpler
• The predictivity of Bayesian statistics is useful when it comes to decision taking, eg:
• Should I sell, buy or hold certain stocks ?• Should I build the LHC ? (Bayesian “no-loose” theorem)
• Almost any decision in life…
22
Bayesians address the question everyone is interested in, by using assumptions no-one believes
Frequentists use impeccable logic to deal with an issue of no interest to anyone
Slightly provocative summary by Louis Lyons (Academic Lecture at Fermilab, August 17, 2004)
…to be taken with the grain of salt !
Hypothesis testing
A hypothesis 𝑯 specifies some model which might lie at the origin of the data 𝒙
a) Point hypothesis: 𝐻 could be a particular event type (eg, Higgs boson versus background)
b) Composite hypothesis: 𝐻 could be a parameter (eg, Higgs boson mass or coupling strength)
23
In case a), the PDF is simply PDF 𝑥 = PDF 𝑥;𝐻
In case b), 𝐻 contains unspecified parameters (𝜃: mass, coupling, systematic uncertainties)
• A whole band of PDF 𝑥;𝐻(𝜃)
• For given 𝑥, PDF 𝑥;𝐻(𝜃) can be interpreted as a function of 𝜃 ® Likelihood function 𝑳(𝜃)
• 𝐿 𝜃 = 𝐿 𝑥 𝐻 𝜃 for fixed 𝜃 is the probability density to observe 𝑥 given the model 𝐻(𝜃), but note that 𝐿 𝜃 is not the PDF of 𝑥 versus 𝜃 given 𝐻(𝜃)
Statistical tests are often formulated using a
• Null hypothesis (eg, Standard Model (SM) background only)
• Alternative hypothesis (eg, SM background + new physics)
Hypothesis testing (continued)
Example, a multivariate (® see last lecture) classification analysis to search for new physics
24
Take 𝒏 input variables and combine into single output discriminant or test statistic 𝒚
𝒚
PDF 𝑦 bkg
Choose cut value: i.e. a region where one can “reject” the null-(background-) hypothesis(optimal cut value depends on signal and background cross-section and purity)
> cut: signal region= cut: decision boundary< cut: background region
𝒚:
PDF 𝑦 signal
cut
var1+var2-6 -4 -2 0 2 4 6
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3 SignalBackground
var1+var2-6 -4 -2 0 2 4 6
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var1+var2
var1-var2-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
var1-var2-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var1-var2
var3-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
00.05
0.1
0.150.2
0.25
0.30.35
0.40.45
var3-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
00.05
0.1
0.150.2
0.25
0.30.35
0.40.45
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var3
var4-4 -3 -2 -1 0 1 2 3 4 5
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
var4-4 -3 -2 -1 0 1 2 3 4 5
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var4
var1+var2-6 -4 -2 0 2 4 6
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3 SignalBackground
var1+var2-6 -4 -2 0 2 4 6
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var1+var2
var1-var2-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
var1-var2-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var1-var2
var3-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
00.05
0.1
0.150.2
0.25
0.30.35
0.40.45
var3-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
00.05
0.1
0.150.2
0.25
0.30.35
0.40.45
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var3
var4-4 -3 -2 -1 0 1 2 3 4 5
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
var4-4 -3 -2 -1 0 1 2 3 4 5
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var4
var1+var2-6 -4 -2 0 2 4 6
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3 SignalBackground
var1+var2-6 -4 -2 0 2 4 6
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var1+var2
var1-var2-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
var1-var2-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var1-var2
var3-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
00.05
0.1
0.150.2
0.25
0.30.35
0.40.45
var3-4 -3 -2 -1 0 1 2 3 4
Norm
alis
ed
00.05
0.1
0.150.2
0.25
0.30.35
0.40.45
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var3
var4-4 -3 -2 -1 0 1 2 3 4 5
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
var4-4 -3 -2 -1 0 1 2 3 4 5
Norm
alis
ed
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
U/O
-flow
(S,B
): (0
.0, 0
.0)%
/ (0
.0, 0
.0)%
Input variables (training sample): var4
…
Hypothesis testing (continued)
25
It occurs that one makes mistakes:
Type-2 error: (false negative)® accept null hypothesis although it is not true (there is new physics in the data)
Type-1 error: (false positive)® reject null hypothesis although it is true (no new physics)
Signal (𝐻;) Background (𝐻³)
Signal (𝐻;) J Type-2 error
Background (𝐻³) Type-1 error J
Example: goal of new physics search: exclude null hypothesis (as being unlikely the model underlying the observation)
Hypothesis testing (continued)
26
It occurs that one makes mistakes:
Type-2 error: (false negative)® accept null hypothesis although it is not true (there is new physics in the data)
Type-1 error: (false positive)® reject null hypothesis although it is true (no new physics)
Significance 𝜶: Type-1 error rate:Rate (“risk”) of “false discovery”, background in signal sample
Size 𝜷: Type-2 error rate:Power: 1 − 𝛽 = sensitivity to the “alternative”
theory, signal efficiency
should be small
should be small𝛽 = · 𝑃 𝑥 𝐻; 𝑑𝑥0(/)¸¹º»
𝛼 = · 𝑃 𝑥 𝐻³ 𝑑𝑥0(/)½¹º»
Example: goal of new physics search: exclude null hypothesis (as being unlikely the model underlying the observation)
Hypothesis testing (continued)
27
Define critical region 𝑪 ® if data (observation) falls there, reject a hypothesis
• Want to discriminate between hypotheses 𝑯³ and 𝑯;
• Define test statistic 𝒚(𝒙) for data 𝒙
• Compute expected 𝒚 distributions for two hypotheses: PDF(𝑦(𝑥)|𝐻³) and PDF(𝑦(𝑥)|𝐻;)
• Compute observed test statistic 𝒚𝐨𝐛𝐬(𝒙) ® decide on outcome whether or not 𝒚𝐨𝐛𝐬 ∈ 𝑪
𝑦ÃÄÅ
Neyman-Pearson Lemma
Neyman-Pearson (1933):
The Likelihood ratio used as test statistic 𝒚(𝒙) gives for each significance 𝜶the test (critical region) with the largest power 𝟏 − 𝜷.
28
Likelihood Ratio:
or any monotonic function thereof, e.g. ln(𝑦 𝑥 )
0 1
1
01 − 𝛽
Best ROC curve given by likelihood ratio
Type-1 error smallType-2 error large
Type-1 error large Type-2 error small
1−𝛼
𝒚 𝒙 =𝑷 𝒙 𝑯𝟏𝑷 𝒙 𝑯𝟎
Which test statistic (discriminant) 𝒚(𝒙) should one actually choose? What is optimal?
The likelihood ratio maximises area under “Receiver Operation Characteristics” (ROC) curve worse
The proof of the Neyman-Pearson Lemma is straightforward and almost obvious given the definitions of 𝛼 and 𝛽
See, eg: https://en.wikipedia.org/wiki/Neyman–Pearson_lemma
Neyman-Pearson Lemma
Unfortunately, the Neyman-Pearson Lemma holds strictly only for simple hypotheses without free parameters
If 𝐻³/; are “composite hypotheses” 𝑯𝟎/𝟏 𝜽 , it is not even sure that there exists a so-called uniformly most powerful test statistic that for each given 𝛼 is the most powerful (largest 1 − 𝛽)
Note: already in presence of systematic uncertainties (as varying but constrained “nuisance parameters”) it is not certain that the likelihood ratio is the optimal test statistic
However: the likelihood ratio is probably close to optimal, it is a very convenient test statistic, and therefore commonly used in experimental particle physics
29
Frequentist confidence intervals
In frequentist statistics one cannot make a probabilistic statement about the true value of a parameter given the data.
Instead:
• One defines acceptance / rejection regions of a test statistic (𝜶)
• The measurement (data) is one specific outcome of an ensemble of possible data
• One accepts or rejects 𝑯𝟎 with confidence level given by 𝜶
• It is also possible to state how probable a particular or worse outcome (test statistic measurement) is for a given hypothesis (eg, 𝑯𝟎) ® p-value
One then shows the data and quotes the 𝐻³ outcome given the required confidence level and the hypothesis p-value
30
A typical (but highly simplified) frequentist analysis
1. Specify a hypothesis 𝐻³ and test statistic or estimator (® likelihood ratio 𝒚)
2. Specify the significance of the test, ie, how much of a Type-1 error rate to accept: eg, confidence level of 95% ® 𝜶 = 5%
3. Take the measurement: 𝒚𝐨𝐛𝐬4. Check whether 𝒚𝐨𝐛𝐬 lies inside or outside of critical region ® decide on 𝑯𝟎
5. If excluded, compute p-value of 𝑯𝟎 to see how deep it lies in the critical region
31𝑦ÃÄÅ
No composite hypothesis yet (see later). Simple hypothesis test using data
= · PDF(𝑦|𝐻³) 𝑑𝑦�
0ÆÇÈ
𝑦
PDF(𝑦|𝐻 ³)
𝛼 = 5%
p-value
Significance and p-values
Note:
32
𝜶 (significance) must be specified before the hypothesis test is made
The p-value is a property of actual measurement (observation)
Again: the p-value is not a measure how probable the hypothesis is
The confidence level of a hypothesis test (accept / reject) is given by 𝛼not the p-value
It is convenient to express observed p-values in terms of Gaussian 𝜎 (“sigma”):
• How many standard deviations “𝑍” for same p-value on one-sided Gaussian
• In ROOT: TMath: : Prob 𝑍 ∗ 𝑍, 1 /2 = 𝑝 (p-value)(eg: p-value corresponding to 𝑍 = 5𝜎 is 2.87 5 10dÏ)
• Inverse in ROOT: sqrt(TMath: : ChisquareQuantile 1 − 2 ∗ 𝑝, 1 ) = 𝑍x
(x)
ϕ
Z
p−value
arXiv:1007.1727
Distribution of p-values
Assume:
• Test statistic: 𝒚 (function of measured quantities)
• PDF of 𝑦 for given hypothesis 𝑯: 𝒑𝒚(𝒚;𝑯)
• p-value(𝒚;𝑯) = ∫ 𝑝0(𝑦′; 𝐻) 𝑑𝑦′�0 for each measurement 𝒚
p-values are random variables ® distribution if measurement repeated
Derived from a cumulative distribution ® must be uniform for matching hypothesis 𝑯
33
p−value(𝑦; 𝐻) 0 1
• Hence, in a fraction of times, the p-value of a given measurement may become very small, although 𝑯 is the correct hypothesis
• If the true and tested hypotheses are different, the p-value distribution will deviate from uniform (but usually one cannot just repeat a measurement or an experiment to test this)
mea
sure
men
ts
Statistical tests in new particle/physics searches
Discovery test
• Disprove background-only hypothesis 𝑯𝟎
• Estimate probability of “upward” (or “signal-like”) fluctuation of background
34
Exclusion limit
• Upper limit on new physics cross section
• Disprove signal + background hypothesis 𝑯𝟎
• Estimate probability of downward fluctuation of signal + background: find minimal signal, for which 𝑯𝟎 (here: S+B) can be excluded at specified confidence Level
Background-only: 𝑯𝟎PDF:Poisson(𝑁Ó; 𝜇Ó = 4)
Nobs for 5sdiscoveryp = 2.875 10–7
Example: PDF of background-only test statistic
𝑁
Type 1 error α = 5% → 95%CL
Example: PDFs for B and S+B
𝐻; = 𝐵 𝐻³ = 𝑆 + 𝐵
𝑁
Statistical tests in new particle/physics searches
Discovery test
• Disprove background-only hypothesis 𝑯𝟎
• Estimate probability of “upward” (or “signal-like”) fluctuation of background
35
Exclusion limit
• Upper limit on new physics cross section
• Disprove signal + background hypothesis 𝑯𝟎
• Estimate probability of downward fluctuation of signal + background: find minimal signal, for which 𝑯𝟎 (S+B) can be excluded at pre-specified confidence Level
Background-only: 𝑯𝟎PDF:Poisson(𝑁Ó; 𝜇Ó = 4)
Nobs for 5sdiscoveryp = 2.875 10–7
Example: PDF of background-only test statistic
𝑁
Type 1 error α = 5% → 95%CL
Example: PDFs for B and S+B
𝐻; = 𝐵 𝐻³ = 𝑆 + 𝐵
𝑁
Realistic discovery and exclusion likelihood tests involve complex fits of several signal and background-normalisation (so-called control ) regions, signal and background yields, as well as nuisance parameters describing systematic uncertainties.
We will come to this, but first need to learn about parameter estimation.
[GeV]Hm110 115 120 125 130 135 140 145 150
0Lo
cal p
-1110
-1010
-910
-810
-710
-610
-510-410
-310-210-1101
Obs. Exp.
σ1 ±-1Ldt = 5.8-5.9 fb∫ = 8 TeV: s
-1Ldt = 4.6-4.8 fb∫ = 7 TeV: sATLAS 2011 - 2012
σ0σ1σ2σ3
σ4
σ5
σ6
Statistical tests in new particle/physics searches — teaser
Discovery test — Higgs discovery in 2012
• 5.9σ rejection of background-only hypothesis from statistical combination of dominantly H ® γγ, ZZ*, WW* decays at mH = 126 GeV
• No trials factor (look-elsewhere-effect, LEE) taken into account in above number, but would not qualitatively change picture
36
Exclusion limit
• 13 TeV search for new physics (here: a new heavy Higgs boson) in events with at least two tau leptons
• Figure shows expected and observed 95% confidence level upper limits on cross section times branching fraction
(GeV)φm210 310
(pb)
)ττ→φ(
B⋅)φ(g
gσ
95%
CL
limit
on
-210
-110
1
10
210
310 ObservedExpected
Expectedσ1± Expectedσ2±
ObservedExpected
Expectedσ1± Expectedσ2±
CMSPreliminary
(13 TeV)-12.3 fbhτhτ+µ+ehτ+ehτµATLAS, 1207.7214
CMS, CMS-PAS-HIG-16-006
Parameter estimation
An estimator is a function of a data sample 𝜽Ö = 𝜽Ö 𝒙𝟏,… . , 𝒙𝑵 that estimates the characteristic parameter 𝜽 of a parent distribution.
Examples:
• Mean value estimator:
• Variance estimator:
• Median estimator …..
• ...but also: CP-asymmetry parameter in B meson sample (very complex parameter estimation)
The estimator 𝜽Ö is a random variable (function of measured data that are random)
The estimator 𝜽Ö has itself an expectation value, an expected variance, for given 𝜽:
37
�̂� =1𝑁�𝑥q
Ù
q�;
(one way to define the mean value, there could be others)
𝑉| =1
𝑁 − 1� 𝑥q − �̅� AÙ
q�;
𝐸 𝜃| 𝑥 𝜃 = ∫𝜃|�� 𝑥 𝑓 𝑥 𝜃 𝑑𝑥® , with 𝑓 𝑥 𝜃 the distribution (PDF) of the expected data
Parameter estimation
𝜽Ö is a random variable that follows a PDF. Consider many measurements / experiments:
→ There will be a spread of 𝜃| estimates. Different estimators can have different properties:
38
BiasedLargevariance
Best
Glen Cowan𝜃|𝜃
• Biased or unbiased: if 𝐸 𝜃| 𝑥 𝜃 = 𝜃 ® unbiased
• Small bias and small variance can be “in conflict”
– asymptotic bias ® limit for infinite observations/data samples
Maximum likelihood estimator
Want to estimate (measure !) a parameter 𝜽
Observe �⃗�q = 𝑥;, … . 𝑥Û q, 𝑖 = 1,𝑁 (ie: 𝐾 observables per event, and 𝑁 events)
Hypothesis is PDF 𝑝/(�⃗�; 𝜃), ie, the distribution of �⃗� given 𝜃
There are 𝑁 independent events ® combine their PDFs:
For fixed �⃗� consider 𝑝/(�⃗�; 𝜃) as function of 𝜃 ® Likelihood 𝑳(𝜽)
• 𝐿 𝜃 is at maximum (if unbiased) for 𝜃| = 𝜃»Þºß
39
𝑃(�⃗�;,..�⃗�Ù; 𝜃) =à𝑝/(�⃗�q; 𝜃)Ù
q�;
log L=41.2 (ML fit) (a) log L=41.0 (true parameters)
4
2
o -0.2 o 0.2 0.4 0.6
x
4
2
o -0.2
log L=13.9 log L=18.9
o
ML estimators 71
(b)
0.2 0.4 0.6
x
Fig. 6.1 A sample of 50 observations of a Gaussian random variable with mean J1. = 0.2 and standard deviation cr = 0.1. (a) The p.d.f. evaluated with the parameters that maximize the likelihood function and with the true parameters. (b) The p.d.f. evaluated with parameters far from the true values, giving a lower likelihood.
With this motivation one defines the maximum likelihood (ML) estimators for the parameters to be those which maximize the likelihood function. As long as the likelihood function is a differentiable function of the parameters (}1, ... , (}m, and the maximum is not at the boundary of the parameter range, the estimators are given by the solutions to the equations, -.
oL O(}i =_ 0, i = 1, ... , m. (6.3)
If more than one local maximum exists, the highest one is taken. As with other types of estimators, they are usually written with hats, 8 = ({h, ... , 8m ), to dis-tinguish them from the true parameters (}i whose exact values remain unknown.
The general idea of maximum likelihood is illustrated in Fig. 6.1. A sample of 50 measurements (shown as tick marks on the horizontal axis) was generated according to a Gaussian p.d.f. with parameters J.l = 0.2, (J' = 0.1. The solid curve in Fig. 6.1(a) was computed using the parameter values for which the likelihood function (and hence also its logarithm) are a maximum: fl = 0.204 and U = 0.106. Also shown as a dashed curve is the p.d.f. using the true parameter values. Because of random fluctuations, the estimates fl and u are not exactly equal to the true values J.l and (J'. The estimators fl and u and their variances, which reflect the size of the statistical errors, are derived below in Section 6.3. Figure 6.1(b) shows the p.d.f. for parameters far away from the true values, leading to lower values of the likelihood function.
The motivation for the ML principle presented above does not necessarily guararitee any optimal properties for the resulting estimators. The ML method turns out to have many advantages, among them ease of use and the fact that no binning is necessary. In the following the desirability of ML estimators will
Glen Cowan
50 observations of a Gaussian random variable with mean 0.2 and σ=0.1
Good estimate
of 𝜃
Poor estimate
of 𝜃
Task: maximise 𝐿 𝜃 to derive best estimate for 𝜃|
In practice, often minimise− 2 5 ln(𝐿 𝜃 ) (see later why)
® Maximum likelihood fit
Maximum likelihood estimator (continued)
40
® In a full maximum likelihood fit one could now determine �̂� and 𝜎á
® If one is not interested in fitting 𝜎 but just 𝜇, one can omit the (then constant) 2nd term:
−2 5 Δln 𝐿 𝜇 𝑥 =�𝑥q − 𝜇 A
𝜎A
Ù
q�;® which is the “least squares” (𝝌𝟐) expression
where: Δln 𝐿 𝜇 𝑥 = ln 𝐿 𝜇 𝑥 − constantterm
Let’s take the Gaussian example from before: 𝐿(𝜇, 𝜎|𝑥) = ;Aä� h
exp − /de g
Ahg
• Measure 𝑁 events: 𝑥;,…,𝑥Ù
• Full likelihood given by: 𝐿 𝜇, 𝜎 𝑥 = ∏ ;Aä� h
exp − /æde g
AhgÙq�;
• In logarithmic form: −2 5 ln 𝐿 𝜇, 𝜎 𝑥 = ∑ /æde g
hgÙq�; − 2𝑁 5 ln ;
Aä� h
Maximum likelihood estimator (continued)
41
So far considered unbinned datasets (i.e., likelihood is given by product of PDFs for each event)
One can replace the events by bins of a histogram
• Useful if very large number of events, or PDF has very complex form, or if only broad regions are considered rather than the full shape of a PDF
• Most LHC analyses use binned maximum likelihood fits
Each bin 𝒊 has 𝑵𝒊 events that are Poisson distributed around 𝝁𝒊
• The prediction of the 𝜇q can be obtained from Monte Carlo simulation
Likelihood function: 𝐿 𝜃 = 𝑃 𝑁;, …𝑁wÇéêÈ; 𝜃 = à𝜇qÙæ(𝜃)𝑁q!
𝑒deæ(ì)wÇéêÈ
q�;
−2 5 ln 𝐿(𝜃) = 2 � 𝜇q 𝜃 − 𝑁qln 𝜇q(𝜃) − ln 𝑁q!wÇéêÈ
q�;
…and in log form:
Maximum likelihood estimator (continued)
Maximum likelihood estimator is typically unbiased only in limit 𝑁 → ∞
42
Asymmetric errors• Another approximation alternative to the parabolic one may be
to evaluate the excursion range of −2ln L.• Error (nσ) determined by the range around the maximum for
which −2ln L increases by +1 (+n2 for nσ intervals)
European School of HEP 2016 Luca Lista 58
θ
−2lnL
−2lnLmax
−2lnLmax+ 1
!+ !+ + δ+!+ – δ−
• Errors can be asymmetric
• For a Gaussian PDF the result is identical to the 2nd order derivative matrix
• Implemented in Minuit as MINOS function
1
If likelihood function is Gaussian (often the case for large 𝑁 by virtue of central limit theorem):
→ Estimate 1σ confidence interval for 𝜃(“parameter uncertainty”) by finding intersections −𝟐 5 𝜟𝐥𝐧 𝑳 = 𝟏 around minimum
→ Resulting uncertainty on 𝜃 may be asymmetric
If (very) non-Gaussian:
→ revert typically to (classical) Neyman confidence intervals (® see later)
Luca
Lis
ta, E
PSH
EP S
choo
l 201
6
min2χ
0 5 10 15 20 25 30 35 40 45 50
Num
ber o
f toy
exp
erim
ents
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40 45 500
100
200
300
400
500
600
700
=14dof distribution for n2χ
Toy analysis excl. theo. errors
)SM|
data
p-va
lue
for (
0
0.2
0.4
0.6
0.8
1
0.004±p-value = 0.202
Goodness-of-Fit (GoF)
Maximum likelihood estimator determines the best parameter 𝜽Ö
But: does the model with the best 𝜃| fit the data well ?
The value of −2 5 ln 𝐿(𝜃|) at minimum does not mean much ® needs calibration
→ Determine the expected distribution of −2 5 ln 𝐿(𝜃|) using pseudo Monte Carlo events, and compare measured value to expected ones
43−𝟐 5 𝐥𝐧 𝑳(𝜽)−𝟐 5 𝐥𝐧 𝑳(𝜽Ö)
Gfitter group
Goodness-of-Fit (continued)
A Goodness-of-fit test is more straightforward with 𝝌𝟐 estimator
Let’s use the binned example again. The task is to minimise versus 𝜃:
44
𝜒òóôA 𝜃| = minì 𝜒A 𝜃 = �𝑁q − 𝜇q(𝜃) A
𝜎qA
wÇéêÈ
q�;
𝜒A has known properties: 𝐸[𝜒A] = 𝑛õ.Ã.ö = 𝑘 (= number of degrees of freedom)
Cumulative PDF: probability to find 𝜒A > 𝜒òóôA : TMath: : Prob(𝜒òóôA , 𝑘)
Figu
res
from
: ht
tps:
//en.
wik
iped
ia.o
rg/w
iki/C
hi-s
quar
ed_d
istri
butio
n
PDF:𝑃𝜒A;𝑛�
𝜒A 𝜒A
Cum
ulat
ive
𝑍ù�; = 1𝜎 𝑍ù�; = 2𝜎
Classical confidence level
Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)
Statement about probability to cover true value 𝝁ú𝐭𝐫𝐮𝒆 of parameter 𝝁ú fit to data
45
• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed
20 33. Statistics
33.3.2. Frequentist confidence intervals :
The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.
33.3.2.1. The Neyman construction for confidence intervals:
Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that
P (x1 < x < x 2; θ) = 1 − α =x 2
x 1f (x ; θ) dx . (33 .49)
This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.
Figure 33.3: Construction of the confidence belt (see text).
February 18, 2012 20:15
�̂�
𝜇 »Þºß
(hyp
othe
sis)
Classical confidence level
Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)
Statement about probability to cover true value 𝝁ú𝐭𝐫𝐮𝒆 of parameter 𝝁ú fit to data
46
20 33. Statistics
33.3.2. Frequentist confidence intervals :
The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.
33.3.2.1. The Neyman construction for confidence intervals:
Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that
P (x1 < x < x 2; θ) = 1 − α =x 2
x 1f (x ; θ) dx . (33 .49)
This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.
Figure 33.3: Construction of the confidence belt (see text).
February 18, 2012 20:15
�̂�
• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed
• Determine the (central) intervals (“acceptance region”) in these PDFs such that they contain 𝜶
𝜶
𝜇 »Þºß
(hyp
othe
sis)
Classical confidence level
Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)
Statement about probability to cover true value 𝝁ú𝐭𝐫𝐮𝒆 of parameter 𝝁ú fit to data
47
20 33. Statistics
33.3.2. Frequentist confidence intervals :
The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.
33.3.2.1. The Neyman construction for confidence intervals:
Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that
P (x1 < x < x 2; θ) = 1 − α =x 2
x 1f (x ; θ) dx . (33 .49)
This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.
Figure 33.3: Construction of the confidence belt (see text).
February 18, 2012 20:15
�̂�
• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed
• Determine the (central) intervals (“acceptance region”) in these PDFs such that they contain 𝜶
• Do this for all �̂�»Þºßhypotheses
• Connect all the red dots: confidence belt
𝜶
𝜇 »Þºß
(hyp
othe
sis)
Classical confidence level
Neyman confidence belt for confidence level (CL) 𝜶 (e.g. 95%)
Statement about probability to cover true value 𝛍ú𝐭𝐫𝐮𝐞 of parameter 𝝁ú fit to data
48
20 33. Statistics
33.3.2. Frequentist confidence intervals :
The unqualified phrase “confidence intervals” refers to freque ntist intervals obtainedwith a procedure due to Neyman [29], described below. These a re intervals (or in themultiparameter case, regions) constructed so as to include the true value of the parameterwith a probability greater than or equal to a specified level, c alled the coverage probability.In this section, we discuss several techniques for producin g intervals that have, at leastapproximately, this property.
33.3.2.1. The Neyman construction for confidence intervals:
Consider a p.d.f. f (x ; θ) where x represents the outcome of the experiment and θ is theunknown parameter for which we want to construct a confidence i nterval. The variablex could (and often does) represent an estimator for θ. Using f (x ; θ), we can find for apre-specified probability 1 − α , and for every value of θ, a set of values x1(θ, α ) andx2(θ, α ) such that
P (x1 < x < x 2; θ) = 1 − α =x 2
x 1f (x ; θ) dx . (33 .49)
This is illustrated in Fig. 33.3: a horizontal line segment [ x1(θ, α ) ,x2(θ, α )] is drawn for representative values of θ. The union of such intervals for all valuesof θ, designated in the figure as D (α ), is known as the confidence belt. Typically thecurves x1(θ, α ) and x2(θ, α ) are monotonic functions of θ, which we assume for thisdiscussion.
Figure 33.3: Construction of the confidence belt (see text).
February 18, 2012 20:15
𝜇 »Þºß
(hyp
othe
sis)
�̂�
• Each hypothesis �̂�»Þºßhas a PDF of how the measured values �̂�ÃÄÅ will be distributed
• Determine the (central) intervals (“acceptance region”) in these PDFs such that they contain 𝜶
• Do this for all �̂�»Þºßhypotheses
• Connect all the red dots: confidence belt
• Measure 𝝁ú𝐨𝐛𝐬
→ Confidence interval [�̂�;, �̂�A] given by vertical line intersecting the belt
𝜶
𝝁ú𝐨𝐛𝐬
𝝁ú𝟏
𝝁ú𝟐
® 𝛼 = 95% of the intervals [�̂�;, �̂�A] contain �̂�»Þºß
Combining confidence intervals
The construction of Neyman intervals may involve large resources if done with pseudo Monte Carlo experiments. In many cases, experiments take “Gaussian” short cut, assuming that the PDF(�̂�»Þºß) is Gaussian and does not depend on �̂�»Þºß (see previous slides)
In Gaussian case, measurements can be combined by multiplying their likelihood functions
Otherwise: it is important to combine individual measurements, not the confidence intervals: construct confidence belt of combined measurement
The following “Gaussian shortcut” will be wrong in that case:
49In a perfectly Gaussian and uncorrelated case, this simple formula is correct
arXi
v:12
01.2
631v
2
Combining confidence intervals
The construction of Neyman intervals may involve large resources if done with pseudo Monte Carlo experiments. In many cases, experiments take “Gaussian” short cut, assuming that the PDF(�̂�»Þºß) is Gaussian and does not depend on �̂�»Þºß (see previous slides)
In Gaussian case, measurements can be combined by multiplying their likelihood functions
Otherwise: it is important to combine individual measurements, not the confidence intervals: construct confidence belt of combined measurement
The following “Gaussian shortcut” will be wrong in that case:
50In a perfectly Gaussian and uncorrelated case, this simple formula is correct
arXi
v:12
01.2
631v
2
51
We have introduced the axioms of probability theory and discussed the difference between frequentist and Bayesian statistics
We have discussed hypothesis testing, introduced Type-1 and 2 errors, and the Neyman-Pearson likelihood-ratio lemma
Test statistics, confidence intervals, significance and p-values were introduced
Parameter estimation with the maximum likelihood technique, goodness-of-fit, and the derivation of a classical Neyman confidence belt were discussed
Next: realistic maximum likelihood fits, Monte Carlo techniques and data unfolding
Summary for today