Multivariate Analysis with TMVA Basic Concept and New Features Helge Voss ( * ) (MPI–K, Heidelberg) LHCb MVA Workshop, CERN, July 10, 2009 ( * ) On behalf

Multivariate Analysis with TMVA

Basic Concept and New Features

Helge Voss(*) (MPI–K, Heidelberg)

LHCb MVA Workshop, CERN, July 10, 2009

(*) On behalf of the author team: A. Hoecker, P. Speckmayer, J. Stelzer, J.Therhaag, E.v.Toerne , H. Voss

And the contributors: See acknowledgments on page xx

On the web: http://tmva.sf.net/ (home), https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome (tutorial)

http://tmva.sf.net/

http://tmva.sf.net/

http://tmva.sf.net/

https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome



2 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

A linear boundary? A nonlinear one?Rectangular cuts?

S

B

x1

x2 S

B

x1

x2 S

B

x1

x2

How can we decide?

Once decided on a class of boundaries, how to find the “optimal” one?

Event Classification

Suppose data sample of two types of events: with class labels Signal and Background (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged)

how to set the decision boundary to select events of type S ? we have discriminating variables x1, x2, …


Regression In High Energy Physics

Most HEP event reconstruction require some sort of “function estimate”,

of which we do not have an analytical expression:

energy deposit in a the calorimeter: shower profile calibration entry location of the particle in the calorimeter distance between two overlapping photons in the calorimeter … you can certainly think of more..

Maybe you could even imagine some final physics parameter that needs to be fitted

to your event data and you would rather use a non analytic fit function from Monte

Carlo events than an analytic parametrisation of the theory ???:

reweighing method used in the LEP W-mass analysis …

Maybe your application can be successfully learned by a machine using Monte Carlo events ??

somewhat wild guessing, might not be reasonable to do..


Regression

linear?

x

f(x)

x

f(x)

x

f(x) constant ? non - linear?

how to estimate a “functional behaviour” from a given set of ‘known measurements” ?

Assume for example “D”-variables that somehow characterize the shower in your calorimeter.

from Monte Carlo or testbeam measurements, you have a data sample with events that measure all D-Variables x and the corresponding particle energy

the particle energy as a function of the D observables x is a surface in the D+1 dimensional space given by D-observables + energy

seems trivial??

maybe… the human eye and brain behind have very good pattern recognition capabilities!

but what if you have more variables?


Event Classification: Why MVA ?

• What do you think. Is this event is more “likely” signal or background ??

• Assume 4 Uncorrelated Gaussians:


Event Classification

Each event, if Signal or Background, has “D” measured variables.

D

“feature space”

y(x)

most general formy = y(x); x D x={x1,….,xD}: input variables

y(x): RnR:

If one histogramms the resulting y(x) values:

Find a mapping from D-dimensional input/observable/”feature” space

to one dimensional output

class lables

Who sees how this would look like for the regression problem?


y(x)

Event Classification Each event, if Signal or Background, has “D” measured variables.

D

“feature space”

y(B) 0, y(S) 1

y(x): “test statistic” in D-dimensional space of input variables

y(x): RnR:

distributions of y(x): PDFS(y) and PDFB(y)

overlap of PDFS(y) and PDFB(y) separation power , purity

>cut: signal=cut: decision boundary<cut: background

y(x): used to set the selection cut!

Find a mapping from D-dimensional input/observable/”feature” space

y(x)=const: surface defining the decision boundary.

efficiency and purity

to one dimensional output

class lables


MVA and Machine Learning

The previous slide was basically the idea of “Multivariate Analysis” (MVA)rem: What about “standard cuts” ?

Finding y(x) : RnR ( “basically” the same for regression and classification)given a certain type of model class y(x)

in an automatic way using “known” or “previously solved” eventsi.e. learn from known “patterns”

such that the resulting y(x) has good generalization properties when applied to “unknown” events

that is what the “machine” is supposed to be doing: supervised machine learning

Of course… there’s no magic, we still need to:choose the discriminating variables

choose the class of models (linear, non-linear, flexible or less flexible)

tune the “learning parameters” bias vs. variance trade off

check generalization properties

consider trade off between statistical and systematic uncertainties


Overview/Outline

Idea: rather than just implementing new MVA techniques and making them

available in ROOT (i.e. like TMulitLayerPercetron does):

one common platform / interface for all MVA classification and regression algorithms.

easy to use and switch between classifiers

(common) data pre-processing capabilities

train / test all classifiers on same data sample and evaluate consistently

common analysis, application and evaluation framework (ROOT Macros, C++ executables, python scripts..)

application of trained classifier/regressor using libTMVA.so and corresponding “weight files” or stand alone C++ code (latter is currently disabled (TMVA 4.01) )

Outline

The TMVA project

Survey of the available classifiers/regressors

Toy examples

General considerations and systematics


TMVA Development and Distribution

TMVA is a sourceforge (SF) package for world-wide accessHome page ………………. http://tmva.sf.net/

SF project page …………. http://sf.net/projects/tmva

View SVN ………………… http://tmva.svn.sourceforge.net/viewvc/tmva/trunc/TMVA/

Mailing list .……………….. http://sf.net/mail/?group_id=152074

Tutorial TWiki ……………. https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome

Active project

Currently 6 core developers, and 42 registered contributors at SF

>4864 downloads since March 2006 (without CVS/SVN checkouts and ROOT users)

current version. 4.0.1 (29 Jun 2009)

try to be fast on users requests bug fixes etc.

Integrated and distributed with ROOT since ROOT v5.11/03 (newest version 4.0.1 in ROOT production release 5.23)


New (and future) TMVA Developments

MV-Regression

Code re-design allowing for:

boosting any classifier (not only decision trees)

combination of any pre-processing algorithms

classifier training in multi-regions

multi-class classifiers (will slowly creep in to some classifiers…)

cross-validation

New MVA-Technique

PDE-Foam (classification + regression)

Data preprocessing

Gauss-transformation

arbitrary combination of preprocessing

Numerous small add-ons etc..


Currently implemented classifiers :

Rectangular cut optimisation

Projective (1D) likelihood estimator

multidimensional likelihood estimators (PDE range search, PDE-Foam, k-Nearest Neighbour) (also as Regressor)

Linar discirminant analysis (Fisher, H-Matrix general Linear (also as Regressor) )

Function Discriminant (also as Regressor)

Artificial neural networks (3 multilayer perceptron implementations (MLP also as Regressor with multi-dimensional target) )

Boosted/bagged/randomised decision trees (also as Regressor)

Rule Ensemble Fitting

Support Vector Machine

(Plugin for any other MVA classifier (e.g. NeuroBayes))

Currently implemented data preprocessing :

Linear decorrelation

Principal component analysis

Gauss transformation

The TMVA Classifiers (Regressors)


Note that decorrelation is only complete, if Correlations are linear

Input variables are Gaussian distributed

Example for Data Preprocessing: Decorrelation

Commonly realised for all classifiers in TMVA (centrally in DataSet class)

using the “square-root” of the correlation matrix

using the Principal Component Analysis

originaloriginal SQRT derorr.SQRT derorr. PCA derorr.PCA derorr.

Removal of linear correlations by rotating input variables


“Gaussian-isation”

Improve decorrelation by pre-Gaussianisation of variables

First: transformation to achieve uniform (flat) distribution:

event

evefl

ntat variabl , es

kx

k k

i

k kix p x kx d

Transformed variable k Measured value PDF of variable k

Second: make Gaussian via inverse error function:

Gauss 1 flevent even

at

t 2 erf 2 1 var iables,k ki i kx x

2

0

2erf

xte tx d

The integral can be solved in an unbinned way by event counting,

or by creating non-parametric PDFs (see later for likelihood section)

Third: decorrelate (and “iterate” this procedure)


OriginalOriginal

“Gaussian-isation”

Signal - GaussianisedSignal - Gaussianised

We cannot simultaneously Gaussianise both signal and background ?!

Background - GaussianisedBackground - Gaussianised


Pre-processing

The decorrelation/gauss transform can be determined from either: Signal + Background together

Signal only

Background only

For all but the “Likelihood” methods (1D or multi-dimensional) we need to decide on either of those. (TMVA default: signal+background)

For “Likelihood” methods where we test “Signal” hypothesis against “Background” hypothesis, BOTH are used.


Rectangular Cut Optimisation

Simplest method: indiviudual cuts on each variable

cuts out a single rectangular volume from the variable space

Technical challenge: how to find optimal cuts ? MINUIT fails due to non-unique solution space

TMVA uses: Monte Carlo sampling, Genetic Algorithm, Simulated Annealing

Huge speed improvement of volume search by sorting events in binary tree

Cuts usually benefit from prior decorrelation of cut variables


Projective (1D) Likelihood Estimator (PDE Approach)

Probability density estimators for each input variable

combined in likelihood estimator (ignoring correlations)

Optimal approach if zero correlations (or linear decorrelation)

Otherwise: significant performance loss

Technical challenge: how to estimate the PDF shapes

Difficult to automate for arbitrary PDFs

Easy to automate, can create artefacts/suppress information

Automatic, unbiased, but suboptimal

TMVA uses binned shape interpolation using spline functions

Or: unbinned adaptive Gaussian kernel density estimation

TMVA performs automatic validation of goodness-of-fit

3 ways: parametric fitting (function) event counting nonparametric fitting


Multidimensional PDE Approach

Use a single PDF per event class (sig, bkg), which spans Nvar dimensions

2 different implementations of the same idea:PDE Range-Search: count number of signal and background events in “vicinity” of test event preset or adaptive volume defines “vicinity”

Carli-Koblitz, NIM A501, 576 (2003)

H1

H0

x1

x2

test event

Improve simple event counting by use of kernel functions to penalise large distance from test event

eventPDERS , 0.86y i V

Increase counting speed with binary search tree

Genuine k-NN algorithm (author R. Ospanov)

Intrinsically adaptive approach

Very fast search with kd-tree event sorting

Regression: assume the function value distance weighted averaged over the “volume”

PDE-Foam: self adaptive phase space binning divides space in hyperqubes as “volumes”…


H1

H0

x1

x2 H1

H0

x1

x2 H1

H0

x1

x2 H1

H0

x1

x2

Fisher’s Linear Discriminant Analysis (LDA)

LDA determines axis in the input variable hyperspace such that a projection of events onto this axis pushes signal and background as far away from each other as possible

Classifier response couldn’t be simpler:

event evevari

Fabl

i 0e

nts

k kk

i iy F x F

“Fisher coefficients”

projection axis

Well known, simple and elegant classifier

Function discriminant analysis (FDA)

Fit any user-defined function of input variables requiring that signal events return 1 and background 0

Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator


Nonlinear Analysis: Artificial Neural Networks

Achieve nonlinear classifier response by “activating” output nodes using nonlinear weights

1( ) 1 xA x e

1

i

. . .N

1 input layer k hidden layers 1 ouput layer

1

j

M1

. . .

. . . 1

. . .Mk

2 output classes (signal and background)

Nvar discriminating input variables

11w

ijw

1jw. . .. . .

1( ) ( ) ( ) ( 1)

01

kMk k k k

j j ij ii

x w w xA

var

(0)1..i Nx

( 1)1,2kx

(“Activation” function)

with:

Fee

d-fo

rwar

d M

ultil

ayer

Per

cept

ron

TMlpANN: Interface to ROOT’s MLP implementation

MLP: TMVA’s own MLP implementation for increased speed and flexibility

CFMlpANN: ALEPH’s Higgs search ANN, translated from FORTRAN

Three different implementations in TMVA (all are Multilayer Perceptrons)

Regression: uses the real valued output of ONE node


Boosted Decision Trees (BDT)DT: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background

BDT: combine forest of DTs, with differently weighted events in each tree (trees can also be weighted)

Bottom-up “pruning” of a decision tree

Remove statistically insignificant nodes to reduce tree overtraining

e.g., “AdaBoost”: incorrectly classified events receive larger weight in next DT

GradientBoost

“Bagging”: random poisson event weights

resampling with replacement

“Randomised Forest”, like bagging and usaging of random sub-sets of variables

Boosting/bagging/randomising create set of “basis functions”: final classifier is linear combination of these improves stability


Boosted Decision Trees (BDT)DT: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background

Bottom-up “pruning” of a decision tree

Remove statistically insignificant nodes to reduce tree overtraining

Decision tree before pruningDecision tree after pruning

Randomised Forests don’t need pruning VERY ROBUSTSome people even claim that any “boosted” decision trees don’t need to be pruned, but

according to our experience, this is not always true.

Regression: assume the function value of average in the leaf node


Learning with Rule Ensembles

One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30=000111102.

Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003

Model is linear combination of rules, where a rule is a sequence of cuts

RF 01 1

ˆ ˆR RM n

m m k km k

x x xy a ra b

rules (cut sequence rm=1 if all cuts satisfied, =0 otherwise)

normalised discriminating event variables

RuleFit classifier

Linear Fisher termSum of rules

The problem to solve is

Create rule ensemble: use forest of decision trees

Fit coefficients am, bk: gradient direct regularization minimising Risk (Friedman et al.)

Pruning removes topologically equal rules” (same variables in cut sequence)


Boosting

Training Sampleclassifier

C(0)(x)

Weighted Sample

re-weight

classifier C(1)(x)

Weighted Sample

re-weight

classifier C(2)(x)

Weighted Sample

re-weight

Weighted Sample

re-weight

classifier C(3)(x)

classifier C(m)(x)

ClassifierN(i)

ii

y(x) w C (x)


Adaptive Boosting (AdaBoost)

Training Sampleclassifier

C(0)(x)

Weighted Sample

re-weight

classifier C(1)(x)

Weighted Sample

re-weight

classifier C(2)(x)

Weighted Sample

re-weight

Weighted Sample

re-weight

classifier C(3)(x)

classifier C(m)(x)

err

err

err

1 fwith :

f

misclassified eventsf

all events

ClassifierN (i)(i)err

(i)i err

1 fy(x) log C (x)

f

AdaBoost re-weights events misclassified by previous classifier by:

AdaBoost weights the classifiers also using the error rate of the individual classifier according to:


AdaBoost: A simple demonstration

The example: (somewhat artificial…but nice for demonstration) : • Data file with three “bumps”• Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps )

B S

var(i) > x var(i) <= x

Two reasonable cuts: a) Var0 > 0.5 εsignal=66% εbkg ≈ 0% misclassified events in total 16.5%

or b) Var0 < -0.5 εsignal=33% εbkg ≈ 0% misclassified events in total 33%

the training of a single decision tree stump will find “cut a)”

a)b)



The first “tree”, choosing cut a) will give an error fraction: err = 0.165

.. and hence will chose: “cut b)”: Var0 < -0.5

b)

The combined classifier: Tree1 + Tree2the (weighted) average of the response to

a test event from both trees is able to separate signal from background as good as one would expect from the most powerful classifier

before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5

the next “tree” sees essentially the following data sample:

re-weight



Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost


Boosting Fisher Discriminant

• Fisher Discriminant: • y(x) = A linear combination of the input variables

• A linear combination of Fisher discriminant is again a Fisher discriminant. Hence “simple boosting doesn’t help”• However:

• apply the “Fisher Discriminant” CUT in each step (or any other non-linear transformation”

then it works!



There are methods to create linear decision boundaries using only measures of distances

leads to quadratic optimisation processs

• Using variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) allows in principle to separate with linear decision boundaries non linear problems

• The decision boundary in the end is defined only by training events that are closest to the boundary



Support Vector Machines

x1

x2

margin

support vectors

Sep

arab

le d

ata

Find hyperplane that best separates signal from background

optimal hyperplane

Best separation: maximum distance (margin) between closest events (support) to hyperplane

Linear decision boundary

Non

-sep

arab

le d

ata

Solution of largest margin depends only on inner product of support vectors (distances)

quadratic minimisation problem

1

2

4

3

If data non-separable add misclassification cost parameter C·ii to minimisation function


Support Vector Machines

x1

x2

x1

x3

x1

x2

Non-linear cases:

Kernel size paramter typically needs careful tuning! (Overtraining!)

Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data

Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x Ф(x)·Ф(x).

certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid

Choose Kernel and fit the hyperplane using the linear techniques developed above

(x1,x2)Sep

arab

le d

ata

Non

-sep

arab

le d

ata

Find hyperplane that best separates signal from background

Best separation: maximum distance (margin) between closest events (support) to hyperplane

Linear decision boundary

Solution depends only on inner product of support vectors quadratic minimisation problem

If data non-separable add misclassification cost parameter C·ii to minimisation function


Data Preparation and Preprocessing

Data input format: ROOT TTree or ASCII

Supports selection of any subset or combination or function of available variables

Supports application of pre-selection cuts (possibly independent for signal and bkg)

Supports global event weights for signal or background input files

Supports use of any input variable as individual event weight

Supports various methods for splitting into training and test samples:

Block wise

Randomly

Periodically (i.e. periodically 3 testing ev., 2 training ev., 3 testing ev, 2 training ev. ….)

User defined separate Training and Testing trees

Preprocessing of input variables (e.g., decorrelation)


Code Flow for Training and Application Phases

T MVA tutorialScripts can be ROOT scripts, C++ executables or python scripts (via PyROOT)Scripts can be ROOT scripts, C++ executables or python scripts (via PyROOT)


MVA Evaluation Framework

TMVA is not only a collection of classifiers, but an MVA framework

After training, TMVA provides ROOT evaluation scripts (through GUI)

Plot all signal (S) and background (B) input variables with and without pre-processing

Correlation scatters and linear coefficients for S & B

Classifier outputs (S & B) for test and training samples (spot overtraining)

Classifier Rarity distribution

Classifier significance with optimal cuts

B rejection versus S efficiency

Classifier-specific plots:• Likelihood reference distributions• Classifier PDFs (for probability output and Rarity)• Network architecture, weights and convergence• Rule Fitting analysis plots• Visualisation of decision trees


Evaluating the Classifier Training

Projective likelihood PDFs, MLP training, BDTs, …

average no. of nodes before/after pruning: 4193 / 968



Classifier output distributions for test and training samples …



Optimal cut for each classifiers …

Using the TMVA graphical output one can determine the optimal cut on a classifier output (working point in “statistical significance)



Background rejection versus signal efficiencies …

ˆ( ) ( )y

R y y y dy

Best plot to compare classifier performance

An elegant variable is the Rarity: transforms to uniform background. Height of signal peak direct measure of classifier performance

If background in data non-uniform problem in training sample


Evaluating the Classifiers (taken from TMVA output…)

--- Fisher : Ranking result (top variable is best ranked)--- Fisher : ------------------------------------------------ Fisher : Rank : Variable : Discr. power--- Fisher : ------------------------------------------------ Fisher : 1 : var4 : 2.175e-01--- Fisher : 2 : var3 : 1.718e-01--- Fisher : 3 : var1 : 9.549e-02--- Fisher : 4 : var2 : 2.841e-02--- Fisher : ---------------------------------------------B

ette

r va

riabl

e

--- Factory : Inter-MVA overlap matrix (signal):--- Factory : --------------------------------- Factory : Likelihood Fisher--- Factory : Likelihood: +1.000 +0.667--- Factory : Fisher: +0.667 +1.000--- Factory : ------------------------------

Input Variable Ranking

Classifier correlation and overlap

How discriminating is a variable ?

Do classifiers select the same events as signal and background ? If not, there is something to gain !


Evaluating the Classifiers (taken from TMVA output…)

Evaluation results ranked by best signal efficiency and purity (area)------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi-

Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance:------------------------------------------------------------------------------

Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682

------------------------------------------------------------------------------Testing efficiency compared to training efficiency (overtraining check)

------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from traing sample)

Methods: @B=0.01 @B=0.10 @B=0.30------------------------------------------------------------------------------

Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873)MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873)LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872)PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881)RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848)HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868)BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911)CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715)Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677)

-----------------------------------------------------------------------------

Bet

ter

clas

sifie

r

Check for over-training


Subjective Summary of TMVA Classifier Properties

Criteria

Classifiers

CutsLikeli-hood

PDERS/ k-NN

H-Matrix Fisher MLP BDTRule

fittingSVM

Perfor-mance

no / linear correlations nonlinear

correlations

Speed

Training Response /

Robust-ness

Overtraining Weak input variables

Curse of dimensionality Transparency

The properties of the Function discriminant (FDA) depend on the chosen function


Regression Example

Example: Target as a quadratic function of 2 variables

•approximated by:• “Linear Regressor”

• Neural Network


NEW! TMVA Users Guide for 4.0Available on http://tmva.sf.net

TMVA Users Guide97pp, incl. code examples

arXiv physics/0703039

web page with quick reference summary of all training options!


Some T o y E x a m p l e sSome T o y E x a m p l e s


The “Schachbrett” Toy (chess board)

Performance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiers

After some parameter tuning, also SVM und ANN(MLP) perform equally well

Theoretical maximum

Events weighted by SVM response

Event distribution


More Toys: Linear-, Cross-, Circular Correlations

Illustrate the behaviour of linear and nonlinear classifiers

Linear correlations(same for signal and background)

Linear correlations(opposite for signal and background)

Circular correlations(same for signal and background)


How does linear decorrelation affect strongly nonlinear cases ?

Original correlations

SQRT decorrelation

PCA decorrelation


Weight Variables by Classifier Output

Linear correlations(same for signal and background)

Cross-linear correlations(opposite for signal and background)

Circular correlations(same for signal and background)

How well do the classifier resolve the various correlation patterns ?

LikelihoodLikelihood - DPDERSFisherMLPBDT


Final Classifier Performance

Background rejection versus signal efficiency curve:

LinearExampleCrossExampleCircularExample


Stability with Respect to Irrelevant Variables

Toy example with 2 discriminating and 4 non-discriminating variables ?




use only two discriminant variables in classifiers






use only two discriminant variables in classifiersuse all discriminant variables in classifiers

use all discriminant variables in classifiers


g e n e r a l r e m a r k s a n d

a b o u t s y s t e m a t i c s

g e n e r a l r e m a r k s a n d

a b o u t s y s t e m a t i c s

56 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning

General Advice for (MVA) AnalysesThere is no magic in MVA-Methods:

no need to be too afraid of “black boxes”

you typically still need to make careful tuning and do some “hard work”

The most important thing at the start is finding good observables

good separation power between S and B

little correlations amongst each other

no correlation with the parameters you try to measure in your signal sample!

Think also about possible combination of variables

can this eliminate correlation

Always apply straightforward preselection cuts and let the MVA only do the rest.

“Sharp features should be avoided” numerical problems, loss of information when binning is applied

simple variable transformations (i.e. log(var1) ) can often smooth out these areas and allow signal and background differences to appear in a clearer way

Treat regions in the detector that have different features “independent”

can introduce correlations where otherwise the variables would be uncorrelated!


Some Words about Systematic Errors Typical worries are:

What happens if the estimated “Probability Density” is wrong ?

Can the Classifier, i.e. the discrimination function y(x), introduce systematic uncertainties?

What happens if the training data do not match “reality”

Any wrong PDF leads to imperfect discrimination function

Imperfect ! (calling it “wrong” wouldn’t “right”) y(x) loss of discrimination power

that’s all!

classical cuts face exactly the same problem,

But: in addition to cutting on features that are not correct, now you can also “exploit” correlations that are in fact not correct HOWEVER: see next slide

P(x | S)y(x)

P(x | B)

Systematic error are only introduced once “Monte Carlo events” with imperfect modeling are used for

efficiency; purity

#expected events

same problem with classical “cut” analysis

use control samples to test MVA-output distribution (y(x))

Combined variable (MVA-output, y(x)) might “hide” problems in ONE individual variable

train classifier with few variables only and compare with data

check all input variables individually ANYWAY


Systematic “Error” in Correlations• Use as training sample events that have correlatetions

• optimize CUTs• train an propper MVA (e.g. Likelihood, BDT)

• Assume in “real data” there are NO correlations SEE what happens!!


Systematic “Error” in Correlations•Compare “Data” (TestSample) and Monte-Carlo

(both taken from the same underlying distribution)


Systematic “Error” in Correlations•Compare “Data” (TestSample) and Monte-Carlo

(both taken from the same underlying distributions that

differ by the correlation!!! )

Differences are ONLY visible in the MVA-output plots (and if you’d look at cut sequences….)


Treatment of Systematic Uncertainties

“Calibration uncertainty” may shift the central value and hence worsen (or increase) the discrimination power of “var4”

Is there a strategy however to become ‘less sensitive’ to possible systematic uncertainties

i.e. classically: variable that is prone to uncertainties do not cut in the region of steepest gradient

classically one would not choose the most important cut on an uncertain variable

Try to make classifier less sensitive to “uncertain variables”

i.e. re-weight events in training to decrease separation

in variables with large systematic uncertainty

(certainly not yet a recipe that can strictly be followed, more

an idea of what could perhaps be done)



1st Way

Cla

ssifi

er o

utpu

t dis

trib

utio

ns fo

r si

gnal

onl

yC

lass

ifier

out

put d

istr

ibut

ions

for

sign

al o

nly



2nd Way

Cla

ssifi

er o

utpu

t dis

trib

utio

ns fo

r si

gnal

onl

yC

lass

ifier

out

put d

istr

ibut

ions

for

sign

al o

nly


C o p y r i g h t s & C r e d i t sC o p y r i g h t s & C r e d i t s

TMVA is open source software

Use & redistribution of source permitted according to terms in BSD license

Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland). Backup slides on:

(i) some illustrative toy examples (ii) treatment of systematic uncertainties (iii) sensitivity to weak input variables

http://tmva.sf.net/


Questions/Discussion pointsHow to treat systematic uncertainties in MVA methods?

check multidimensioal training samples including their correlations

which differences we care about?

Can one give a particular classifier that is:

always the best?

always the best for specific class of models?

How could one teach the classifiers to optimise for certain regions of interest (i.e. very large background rejection only ?)

is that actually necessary, or does the cut on the MVA output provide just this?

Train according to “smallest error on measurement” rather than best separation of signal/background

Anyone a good idea for maybe better treatment of negative weights?

How to best divide the data sample into “test and training” sample?

Do we want an additional validation sample?

automatic tuning of parameters

cross validation


Bias Variance Trade off -- Overtraining

model complexityhighlow

pred

ictio

n er

ror

low bias

high variance

high bias

low variance

test sample

training sample

S

B

x1

x2 S

B

x1

x2

Overtraining!


Cross Validation many (all) classifiers have tuning parameters “” that need to be controlled against

overtraining

number of training cycles, number of nodes (neural net)

smoothing parameter h (kernel density estimator)

….

the more free parameter a classifier has to adjust internally more prone to overtraining

more training data better training results

division of data set into “training” and “test” sample reduces the “training” data

Train TrainTrainTrainTest Train

Cross Validation: divide the data sample into say 5 sub-sets

Train TrainTrainTrainTest TrainTrain TrainTrainTrain TestTrain TrainTrain TestTrainTrain TrainTrainTrain TestTrain

train 5 classifiers: yi(x,) : i=1,..5,

classifier yi(x,) is trained without the i-th sub sample

calculate the test error:events

i kkevents

1CV( ) L(y (x , )) L : loss function

N

choose tuning parameter a for which CV() is minimum and train the final classifier using all data


Minimisation

Robust global minimum finder needed at various places in TMVA

Default solution in HEP: (T)Minuit/Migrad [ How much longer do we need to suffer …. ? ]

Gradient-driven search, using variable metric, can use quadratic Newton-type solution

Poor global minimum finder, gets quickly stuck in presence of local minima

Specific global optimisers implemented in TMVA:Genetic Algorithm: biology-inspired optimisation algorithm

Simulated Annealing: slow “cooling” of system to avoid “freezing” in local solution

Brute force method: Monte Carlo SamplingSample entire solution space, and chose solution providing minimum estimator

Good global minimum finder, but poor accuracy

TMVA allows to chain minimisers

For example, one can use MC sampling to detect the vicinity of a global minimum, and then use Minuit to accurately converge to it.


Monte Carlo Brute force methods: Random Monte Carlo or Grid search

•Sample entire solution space, and chose solution providing minimum estimator

•Good global minimum finder, but poor accuracy

MinuitDefault solution in HEP: Minuit

• Gradient-driven search, using variable metric, can use quadratic Newton-type solution

• Poor global minimum finder, gets quickly stuck in presence of local minima

Grid Search

Minimisation Techniques

Simulated Annealing

Genetic Algorithm

Like heating up metal and slowly cooling it down (“annealing”)

Atoms in metal move towards the state of lowest energy while for sudden cooling atoms tend to freeze in intermediate higher energy states slow “cooling” of system to avoid “freezing” in local solution

Biology-inspired

• “Genetic” representation of points in the parameter space

• Uses mutation and “crossover”

• Finds approximately global minima


Binary Trees

Tree data structure in which each node has at most two childrenTypically the child nodes are called left and right

Binary trees are used in TMVA to implement binary search trees and decision trees

Root node

Leaf node

><

Dep

th o

f a

tree Amount of computing time to

search for events:

• Box search: (Nevents)Nvar

• BT search: Nevents·Nvarln2(Nevents)


Gauss-Decorr-Gauss etc..

G

GDG

GDGDGDG

Documents

Multivariate Analysis with TMVA Basic Concept and New Features Helge Voss ( * ) (MPI–K, Heidelberg) LHCb MVA Workshop, CERN, July 10, 2009 ( * ) On behalf