Bayesian Within The Gates A View From Particle Physics

Bayesian within the Gates Harrison B. Prosper SAMSI, 2006 1

Bayesian Within The GatesBayesian Within The GatesA View From Particle PhysicsA View From Particle Physics

Harrison B. ProsperFlorida State University

SAMSI24 January, 2006


OutlineOutline Measuring Zero as Precisely as Possible!

Signal/Background Discrimination 1-D Example 14-D Example

Some Open Issues

Summary


Measuring Zero!Measuring Zero!Diamonds may not beforeverNeutron <-> anti-neutron transitions, CRISPExperiment (1982 – 1985),Institut Laue LangevinGrenoble, France

MethodFire gas of cold neutrons onto a graphite foil. Look for annihilation ofanti-neutron component.


Measuring Zero!Measuring Zero!Count number of signal +background events N.

Suppress putative signaland count background events B, independently.

Results:

N = 3

B = 7


Measuring Zero!Measuring Zero!Classic 2-ParameterCounting Experiment

N ~ Poisson(s+b)

B ~ Poisson(b)

Wanted:

A statement like

s < u(N,B) @ 90% CL


Measuring Zero!Measuring Zero!In 1984, no exact solutionexisted in the particlephysics literature!

But, surely it must havebeen solved by statisticians.

Alas, from Kendal and Stuart I learnt that calculating exactconfidence intervals is “a matter of very considerable difficulty”.


Measuring Zero!Measuring Zero!

Exact in what way?

Over the ensemble of statements of the form

s є [0, u)

at least 90% of them should be true

whatever the true value of the signal s AND whatever the true value of the background parameter b.

blame… Neyman (1937)


“Keep it simple, but no simpler”

Albert Einstein


Bayesian @ the Gate (1984)Bayesian @ the Gate (1984)Solution:

p(N,B|s,b) = Poisson(s+b) Poisson(b) the likelihoodlikelihoodp(s,b) = uniform(s,b) the priorprior

Compute the posteriorposterior density p(s,b|NN,,BB)p(s,b|NN,,BB) = p(NN,,BB|s,b) p(s,b)/p(NN,,BB)

Marginalize over bp(s|N,B) = ∫p(s,b|N,B) db This reasoning was

compelling to me then, and is much more so now!


Particle Physics DataParticle Physics Data

proton + anti-proton

-> positron (e+)neutrino ()Jet1Jet2Jet3Jet4

This event “lives” in3 + 2 + 3 x 4 = 17dimensions.


jetslttpp

CDF/DzeroDiscovery of top quark(1995)

Data redSignal greenBackgroundblue, magenta

Dzero: 17-D -> 2-D

Particle Physics DataParticle Physics Data


But that was then, and now is now!

Today we have 2 GHz laptops, with 2 GB of memory!

It is fun to deploy huge, sometimes unreliable,computational resources, that is, brains, to reducethe dimensionality of data.

But perhaps it is now feasible to work directly in the original high-dimensional space, using hardware!


Signal/Background Signal/Background DiscriminationDiscrimination

The optimal solution is to compute

p(S|x) = p(x|s) p(s) / [p(x|s) p(s) + p(x|B) p(B)]

Every signal/background discrimination method is ultimately an algorithm to approximate this solution, or a mapping thereof.

Therefore, if a method is already at the Bayes limit, no other method, however sophisticated, can do better!


GivenD D = x, yx = {x1,…xN}, y = {y1,…yN}of N training examples

Infer A discriminant function f(x, w), with parameters w p(ww|x, y) = p(x, y|ww) p(ww) / p(x, y)= p(y|x, w) p(x|ww) p(ww) / p(y|x) p(x)= p(y|x, w) p(ww) / p(y|x)assuming p(x|w) -> p(x)



A typical likelihood for classification:

p(y|x, ww) = i f(xi, ww)y [1 – f(xi, ww)]1-y

where y = 0 for background eventsy = 1 for signal events

If f(x, ww) flexible enough, then maximizing p(y|x, ww) with respect to w yields f = p(S|x), asymptotically.



However, in a full Bayesian calculation one usually averages with respect to the posterior density

y(x) = ∫ f(x, ww) p(ww|D) dw

Questions:1. Do suitably flexible functions f(x, ww) exist?

2. Is there a feasible way to do the integral?



Answer 1: Hilbert’s 13Answer 1: Hilbert’s 13thth Problem!Problem!

Prove that the following is impossible

y(x,y,z) = F( A(x), B(y), C(z) )

In 1957, Kolmogorov proved thecontrary conjecture

y(x1,..,xn) = F( f1(x1),…,fn(xn) )

I’ll call such functions, F, Kolmogorov functions


Kolmogorov FunctionsKolmogorov Functions

H

j

P

iiijjj xuavbwxf

1 1

tanh),(

n(x,w)x1

x2

u, a

v, b )],(exp[11),(

wxfwxn

A neural network is an example of a Kolmogorov function, that is, a function capable of approximating arbitrary mappings f:RN -> UThe parameters w = (u, a, v, b) are called weightsweights


Answer 2: Use Hybrid MCMCAnswer 2: Use Hybrid MCMCComputational Method

Generate a Markov chain (MC) of N points {w} drawn from the posterior density p(w|D) and average over the last M points.

Each point corresponds to a network.

SoftwareFlexible Bayesian Modeling by Radford Neal

http://www.cs.utoronto.ca/~radford/fbm.software.html


A 1-D ExampleA 1-D ExampleSignal

p+pbar -> t q b

Background p+pbar -> W b b

NN Model Class (1, 15, 1)

MCMC 500 tqb + Wbb events Use last 20 networks

in a MC chain of 500.

x

Wbbtqb


A 1-D ExampleA 1-D Example

x

Dots p(S|x) = HS/(HS+HB)

HS, HB, 1-D histograms

Curves Individual NNs n(x, wwkk)

Black curve < n(x, w) >


A 14-D Example (Finding A 14-D Example (Finding Susy!)Susy!)

Transversemomentumspectra

Signal:blackcurve

Signal/Noise

1/100,0001/100,000



Missingtransversemomentumspectrum

(caused byescape ofneutrinosand Susyparticles)

Variable count

4 x (ET, , )

+ (ET, )

= 14


Likelihood Prior


Signal250 p+pbar -> top + anti-top (MC) events

Background250 p+pbar -> gluino gluino (MC) events

NN Model Class(14, 40, 1) (641-D parameter space!)

MCMCUse last 100 networks in a Markov chain of

10,000, skipping every 20.


But does it Work?But does it Work?

Signal to noisecan reach 1/1with anacceptablesignal strength


But does it Work? But does it Work? Let

d(x) = N p(x|S) + N p(x|B) be the density of the data, containing 2N events, assuming, for simplicity, p(S) = p(B).

A properly trained classifier y(x) approximates

p(S|x) = p(x|S)/[p(x|S) + p(x|B)]

Therefore, if the signal and background events are weighted with y(x), we should recover the signal density.


But does it Work? But does it Work?

Amazingly well !


Some Open IssuesSome Open Issues Why does this insane function p(w1,…,w641|x1,

…,x500) behave so well? 641 parameters > 500 events!

How should one verify that an n-D (n ~ 14) swarm of simulated background events matches the n-D swarm of observed events (in the background region)?

How should one verify that y(x) is indeed a reasonable approximation to the Bayes discriminant, p(S|x)?


SummarySummary Bayesian methods have been, and are being,

used with considerable success by particle physicists. Happily, the frequentist/Bayesian Cold War is abating!

The application of Bayesian methods to highly flexible functions, e.g., neural networks, is very promising and should be broadly applicable.

Needed: A powerful way to compare high-dimensional swarms of points.

Agree, or not agree, that is the question!

Documents

Bayesian Within The Gates A View From Particle Physics