Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 1 Multivariate Methods in HEP Pushpa Bhat...

Preview:

Citation preview

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 1

Multivariate Methods in HEP

Pushpa Bhat Fermilab

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 2

Outline

• Introduction/History• Physics Analysis Examples• Popular Methods

• Likelihood Discriminants• Neural Networks• Bayesian Learning• Decision Trees

• Future• Issues and Concerns• Summary

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 3

Some History

• In 1990 most of the HEP community was skeptical towards use of multivariate methods, particularly so in case of neural networks (NN)• NN as a black box

Can’t understand weightsNonlinear mapping; higher order correlations Though mathematical function can’t explain in terms of physicsCan’t calculate systematic errors reliably

Uni-variate or “cut-based” analysis was the norm • Some were pursuing application of neural network methods to HEP

around 1990• Peterson, Lonnblad, Denby, Becks, Seixas, Lindsey, etc

• First AIHENP (Artificial Intelligence in High Energy & Nuclear Physics) workshop was in 1990.• Organizers included D. Perret-Gallix, K.H. Becks, R. Brun, J.Vermaseren. AIHENP metamorphosed into ACAT ten years later, in 2000

• Multivariate methods such as Fisher discriminants were in limited use.• In 1990, I began to pursue the use of multivariate methods, especially

NN, in top quark searches at Dzero.

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 4

Mid-1990’s

• LEP experiments had been using NN and likelihood discriminants for particle-ID applications and eventually for signal searches (Steinberger; tau-ID)

• H1 at HERA successfully implemented and used NN for triggering (Kiesling).

• Hardware NN was attempted at Fermilab at CDF• Fermilab Advanced Analysis Methods Group

brought CDF and DØ together for discussion of these methods and applications in physics analyses.

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 5

The Top QuarkPost-Evidence, Pre-Discovery !

Fisher Analysis of tte channel

One candidate event (S/B)(mt = 180 GeV)

= 18 w.r.t. Z = 10 w.r.t WW

NN Analysis tt e+jets channeltt

W+jets

W+jetstt160 Data

P. Bhat, DPF94

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 6

Cut Optimization for Top Discovery Feb. ‘95

Signal

BackgroundJan. ’95

(Aspen) cut

Mar. ’95Discovery cut

Contours: Possible NN cuts Feb. ‘95

Sig. Eff.

S/B (Feb-Mar, 95 -Discovery

Conventional cut)

S/B reach with 2-v NN analysisfor similar efficiency

(Jan, 95 –Aspen mtg.Conventional cut)

Neural Network Equi-probability Contour cuts from 2-variable analysis compared with conventional cuts used in Jan. ’95 and in Observation paper

P. Bhat, H.Prosper, E. AmidiD0 Top Marathon, Feb. ‘95

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 7

Measurement of the Top Quark Mass

Discriminant variables

mt = 173.3 ± 5.6(stat.) ± 6.2 (syst.) GeV/c2

The DiscriminantsThe Discriminants

DØ Lepton+jetsDØ Lepton+jets

Fit performed in 2-D: (DLB/NN, mfit)

Run I (1996) result with NN and likelihoodRecent (CDF+D0) mt measurement:

mt= 171.4 ± 2.1 Gev/c2

First significant physics result using multivariate methods

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 8

Higgs, the Holy Grail of HEPDiscovery Reach at the Tevatron

• The challenges are daunting! But using NN provides same reach with a factor of 2 less luminosity w.r.t. conventional analysis

• Improved bb mass resolution & b-tag efficiency crucial

Run II Higgs study hep-ph/0010338 (Oct-2000)P.C.Bhat, R.Gilmartin, H.Prosper, Phys.Rev.D.62 (2000) 074022

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 9

Then, it got easier

• One of the important steps in getting the NN accepted at the Tevatron experiments was to make the Bayesian connection.

• Another important message to drive home was “the maximal use of information in the event” for the job at hand

• Developed a random grid search technique that can be used as baseline for comparison

• Neural network methods now have become popular due to the ease of use, power and many successful applications

Maybe too easy??

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 10

Optimal Event Selection

x

r(x,y) = constant defines an optimaldecision boundary

r(x,y) = constant defines an optimaldecision boundary

Feature spaceFeature space

),|(

),|(

)()|,(

)()|,(),(

yxbp

yxsp

bpbyxp

spsyxpyxr

),|(

),|(

)()|,(

)()|,(),(

yxbp

yxsp

bpbyxp

spsyxpyxr

S = B =

Conventional cutsx x

y y

0

0

y

0y

x0

x

y

x

y

0x

0y

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 11

The NN-Bayesian Connection

Output of a feed forward neural network can approximate the posterior probability P(s|x1,x2).

r

rxspxy

1)|()ˆ,(

1x

2x

)ˆ,,( 21 xxy

))P(|P(x

))P(|P(x )x |( 11

1ii CC

CCCP

)()|(

)()|(

bpbxp

spsxpr

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 12

Limitations of “Conventional NN”

• The training yields one set of weights or network parameters• Need to look for “best” network, but avoid overfitting

• Heuristic decisions on network architecture• Inputs, number of hidden nodes, etc.

• No direct way to compute uncertainties

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 13

Ensembles of Networks

NN1

NN2

NN3

NNM

X

y1

y2

y3

yM

)(xyayi

ii

Decision by averaging over many networks (a committee of networks) has lower error than that of any individual network.

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 14

Bayesian Learning

• The result of Bayesian training is a posterior density of the network weights

P(w|training data) • Generate a sequence of weights (network

parameters) in the network parameter space i.e., a sequence of networks. The optimal network is approximated by averaging over the last K points:

K

1knew

1),( kwxy

Ky

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 15

Bayesian Learning – 2

• Advantages• Less prone to over-fitting• Less need to optimize the size of the network. Can use a

large network! Indeed, number of weights can be greater than number of training events!

• In principle, provides best estimate of p(t|x)p(t|x)

• Disadvantages• Computationally demanding!

• The dimensionality of the parameter space is, typically, large • There could be multiple maxima in the likelihood function p(t|

x,w), or, equivalently, multiple minima in the error function E(x,w).

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 16

Example: Single Top Search

• Training Data• 2000 events (1000 tqb- + 1000 Wbb-)• Standard set of 11 variables

• Network• (11, 30, 1) Network (391391 parameters!)

• Markov Chain Monte Carlo (MCMC)• 500 iterations, but use last 100 iterations • 20 MCMC steps per iteration• NN-parameters stored after each iteration• 10,000 steps• ~ 1000 steps / hour (on 1 GHz, Pentium III laptop)

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 17

Signal:tqb; Background:Wbb Distributions

Example: Single Top Search

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 18

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 19

Decision Trees

• Recover events that fail criteria in cut-based analyses• Start at first “node” with a fraction of the “training

sample” • Select best variable and cut with best separation to

produce two “branches ” of events, (F)ailed and (P)assed cut

• Repeat recursively on successive nodes• Stop when improvement stops or when too few events

are left • Terminal node is called a “leaf ” with purity =

Ns/(Ns+Nb)• Run remaining events and data through the tree to

derive results• Boosting DT:

• Boosting is a recently developed technique that improves any weak classifier (decision tree, neural network, etc)

• Boosting averages the results of many trees, dilutes the discrete nature of the output, improves the performance

DØ single topanalysis

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 20

Matrix Element MethodExample: Top mass measurement

• Maximal use of information in each event by calculating event-by-event signal and background probabilities based on the respective matrix element

x: reconstructed kinematic variables of final state objectsJES: jet energy Scale from Mw constraint

• Signal and background probabilities from differential cross sections

• Write combined likelihood for all events

• Maximize likelihood w.r.t. mtop, JES

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 21

Summary

• Multivariate methods are now used extensively in HEP data analysis

• Neural networks, because of their ease of use and power, are favorites for particle-ID and signal/background discrimination

• Bayesian neural networks take us one step closer to optimization

• Likelihood discriminants and Decision trees are becoming popular because they are easier to “defend” (no “black-box” stigma)

• Many issues remain to be addressed as we get ready to deploy the multivariate methods for discoveries in HEP

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 22

Nothing tends so much to the advancement of knowledge as the application of a new instrument - Humphrey Davy

No amount of experimentation can ever prove me right; a single experiment can prove me wrong. - Albert Einstein

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 23

CDF

CDF

DØDØ

Booster

World’s Highest Energy Laboratory

(for now)

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 24

Our Fancy New Toys

LHC Ring

SPS Ring

PS

Circumference = 27kmBeam Energy = 7.7 TeVLuminosity =1.65x1034 cm-2sec-1

Startup date: 2007

p p

LHC Magnet LHC Tunnel

TI 2TI 2

TI 8TI 8

The Large Hadron Collider

CMS

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 25

LHC Environment

14 TeV Proton Proton colliding beams

Parameter ValueBunch-crossing frequency 40 MHz

Average # of collisions / crossing

20

“interaction rate” ~109

Average # of charged tracks

1000

Radiation field severe

CMS Parameter ValueLevel-1 trigger rate 100 kHz

Mean time between triggers

10 sec

Trigger latency 3.2 sec

Solenoid field 4 T

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 26

CMS Silicon Tracker

Challenges

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 27

CMS Si Tracker

5.4 m

2,4

m

Inner Barrel & Disks

(TIB & TID)

PixelsOuter Barrel (TOB)

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 28

Lots of Silicon

214m2 of silicon sensors11.4 million silicon strips66 million pixels!

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 29

Si Tracker Challenges

• Large and complex system• 77.4 million total channels (out of a total of 78.2 M for

experiment)• Detector monitoring, data organization, data quality monitoring,

analysis, visualization, interpretation all daunting!

• Need to monitor every channel and make sure most of the detector is working at all times (live fraction of the detector and efficiencies bound to decrease with time)

• Need to verify data integrity and data quality for physics• Diagnose and fix problems ASAP• Keep calibration and alignment parameters current

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 30

Detector/Data Monitoring

• Monitor• Environmental variables

• Temperatures, coolant flow rates, interlocks, radiation doses

• Hardware status• Voltages, currents

• Channel Data• Readout states, Errors, missing data/channels, bad ID for

channel/modulemany kinds to be categorized and tracked and displayedshould be able to find rare problems/errors (with low

occurrence rate) that may corrupt data Problems (Rare problems may indicate a developing failure mode or hidden bad behavior)

Correlate problem/noisy channels with history, temperature, currents, etc.

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 31

Data Quality Monitoring

• Monitor• Raw Data

• Pedestals, noise, adc counts, occupancies, efficiencies• Processed high level objects

• Clusters, tracks, etc.• Evaluate thousands of histograms

• Can’t visually examine all• Automatically evaluate histograms by comparing to reference

histograms • Adaptive, efficient, find evolving patterns over time

• Quantiles? q-q plots/comparison instead of KS test?• A variety of 2D “heat” maps

• Occupancies, #of bad channels/module, #of errors/module, etc.

• Typical occupancy ~ 2% in strip tracker• 200,000 channels written out 100 times/sec

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 32

Module Assembly Precision

Example of a“Heat” map

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 33

Need smart approaches

• What are the best techniques for data-mining?• To organize data for analysis and data visualization

• complex geometry/addressing makes visualization difficult

• For finding problematic channels quickly, efficiently clustering, exploratory data-mining

• For finding anomalies, corrupt data, patterns of behaviorFeature-finding algorithms, superpose many events, time

evolution, spatial and temporal correlations

• Noise Correlations • Via correlation coefficients of defined groups• Correlate to history (time variations), environmental

variables

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 34

Data Visualization

• Based on hierarchical/geometrical structure of the tracker• Display every channel, attach objects/info to each

Sub-structuresLayers/ringsModulesReadout Chips

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 35

Multivariate Analysis Issues

• Dimensionality Reduction• Choosing Variables optimally without losing information

• Choosing the right method for the problem• Controlling Model Complexity• Testing Convergence• Validation

• Given a limited sample what is the best way?

• Computational Efficiency

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 36

Multivariate Analysis Issues

• Correctness of modeling• How do we make sure the multivariate modeling is

correct? • The data used for training or building PDEs represent reality.

Is it sufficient to check the modeling in the mapped variable? Pair-wise correlations? Higher order correlations?

• How do we show that the background is modeled well? How do we quantify the correctness of modeling?

• In conventional analysis, we normally look for variables that are well modeled in order to apply cuts

• How well is the background modeled in the signal region?

• Worries about hidden bias• Worries about underestimating errors

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 37

Sociological Issues

• We have been conservative in the use of MV methods for discovery.

• We have been more aggressive in the use of MV methods for setting limits.

• But discovery is more important and needs all the power you can muster!

• This is expected to change at LHC.

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 38

Summary

• The next generation of experiments will need to adopt advanced data mining and data analysis techniques

• Conventional/routine tasks such as alignment, detector performance and data quality monitoring and data visualization will be challenging and require new approaches

• Many issues regarding use of multivariate methods of data analysis for discoveries and measurements need to be addressed to make optimal use of data

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 39

MV: Where can we use them?

• Almost everywhere since HEP events are multivariate• Improve several aspects of analysis

• Event selection• Triggering, Real-time Filters, Data Streaming

• Event reconstruction• Tracking/vertexing, particle ID

• Signal/Background Discrimination• Higgs discovery, SUSY discovery, Single top, …

• Functional Approximation• Jet energy corrections, tag rates, fake rates

• Parameter estimation• Top quark mass, Higgs mass, SUSY model parameters

• Data Exploration• Knowledge Discovery via data-mining• Data-driven extraction of information, latent structure analysis