18
Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Albert a , E. Aliu b , H. Anderhub c , P. Antoranz d , A. Armada b , M. Asensio d , C. Baixeras e , J. A. Barrio d , H. Bartko f , D. Bastieri g , J. Becker h , W. Bednarek i , K. Berger a , C. Bigongiari g , A. Biland c , R. K. Bock f ,g , P. Bordas j , V. Bosch-Ramon j , T. Bretz a , I. Britvitch c , M. Camara d , E. Carmona f , A. Chilingarian k , S. Ciprini , J. A. Coarasa f , S. Commichau c , J. L. Contreras d , J. Cortina b , M. T. Costado m,v , V. Curtef h , V. Danielyan k , F. Dazzi g , A. De Angelis n , C. Delgado m , R. de los Reyes d , B. De Lotto n , E. Domingo-Santamar´ ıa b , D. Dorner a , M. Doro g , M. Errando b , M. Fagiolini o , D. Ferenc p , E. Fern´ andez b , R. Firpo b , J. Flix b , M. V. Fonseca d , L. Font e , M. Fuchs f , N. Galante f , R. J. Garc´ ıa-L´opez m,v , M. Garczarczyk f , M. Gaug m , M. Giller i , F. Goebel f , D. Hakobyan k , M. Hayashida f , T. Hengstebeck q,* , A. Herrero m,v , D. H¨ ohne a , J. Hose f , S. Huber a , C. C. Hsu f , P. Jacon i , T. Jogler f , R. Kosyra f , D. Kranich c , R. Kritzer a , A. Laille p , E. Lindfors , S. Lombardi g , F. Longo n , J. L´ opez b , M. L´ opez d , E. Lorenz c,f , P. Majumdar f , G. Maneva r , K. Mannheim a , M. Mariotti g , M. Mart´ ınez b , D. Mazin b , C. Merck f , M. Meucci o , M. Meyer a , J. M. Miranda d , R. Mirzoyan f , S. Mizobuchi f , A. Moralejo b , D. Nieto d , K. Nilsson , J. Ninkovic f , E. O˜ na-Wilhelmi b , N. Otte f ,q , I. Oya d , M. Panniello m,w , R. Paoletti o , J. M. Paredes j , M. Pasanen , D. Pascoli g , F. Pauss c , R. Pegna o , M. Persic n,s , L. Peruzzo g , A. Piccioli o , N. Puchades b , E. Prandini g , A. Raymers k , W. Rhode h , M.Rib´o j , J. Rico b , M. Rissi c , A. Robert e , S. R¨ ugamer a , A. Saggion g , T. Y. Saito f , A. S´ anchez e , P. Sartori g , V. Scalzotto g , V. Scapin n , R. Schmitt a , T. Schweizer f , Preprint submitted to NIM 9 July 2018 arXiv:0709.3719v2 [astro-ph] 8 Nov 2007

Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

Embed Size (px)

Citation preview

Page 1: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

Implementation of the Random Forest

Method for the Imaging Atmospheric

Cherenkov Telescope MAGIC

J. Albert a, E. Aliu b, H. Anderhub c, P. Antoranz d,A. Armada b, M. Asensio d, C. Baixeras e, J. A. Barrio d,

H. Bartko f , D. Bastieri g, J. Becker h, W. Bednarek i,K. Berger a, C. Bigongiari g, A. Biland c, R. K. Bock f,g,P. Bordas j, V. Bosch-Ramon j, T. Bretz a, I. Britvitch c,

M. Camara d, E. Carmona f , A. Chilingarian k, S. Ciprini `,J. A. Coarasa f , S. Commichau c, J. L. Contreras d, J. Cortina b,

M. T. Costado m,v, V. Curtef h, V. Danielyan k, F. Dazzi g,A. De Angelis n, C. Delgado m, R. de los Reyes d, B. De Lotto n,

E. Domingo-Santamarıa b, D. Dorner a, M. Doro g,M. Errando b, M. Fagiolini o, D. Ferenc p, E. Fernandez b,

R. Firpo b, J. Flix b, M. V. Fonseca d, L. Font e, M. Fuchs f ,N. Galante f , R. J. Garcıa-Lopez m,v, M. Garczarczyk f ,

M. Gaug m, M. Giller i, F. Goebel f , D. Hakobyan k,M. Hayashida f , T. Hengstebeck q,∗, A. Herrero m,v, D. Hohne a,

J. Hose f , S. Huber a, C. C. Hsu f , P. Jacon i, T. Jogler f ,R. Kosyra f , D. Kranich c, R. Kritzer a, A. Laille p, E. Lindfors `,S. Lombardi g, F. Longo n, J. Lopez b, M. Lopez d, E. Lorenz c,f ,

P. Majumdar f , G. Maneva r, K. Mannheim a, M. Mariotti g,M. Martınez b, D. Mazin b, C. Merck f , M. Meucci o, M. Meyer a,J. M. Miranda d, R. Mirzoyan f , S. Mizobuchi f , A. Moralejo b,

D. Nieto d, K. Nilsson `, J. Ninkovic f , E. Ona-Wilhelmi b,N. Otte f,q, I. Oya d, M. Panniello m,w, R. Paoletti o,

J. M. Paredes j, M. Pasanen `, D. Pascoli g, F. Pauss c,R. Pegna o, M. Persic n,s, L. Peruzzo g, A. Piccioli o,

N. Puchades b, E. Prandini g, A. Raymers k, W. Rhode h,M. Ribo j, J. Rico b, M. Rissi c, A. Robert e, S. Rugamer a,

A. Saggion g, T. Y. Saito f , A. Sanchez e, P. Sartori g,V. Scalzotto g, V. Scapin n, R. Schmitt a, T. Schweizer f ,

Preprint submitted to NIM 9 July 2018

arX

iv:0

709.

3719

v2 [

astr

o-ph

] 8

Nov

200

7

Page 2: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

M. Shayduk q,f , K. Shinozaki f , S. N. Shore t, N. Sidro b,A. Sillanpaa `, D. Sobczynska i, F. Spanier a, A. Stamerra o,

L. S. Stark c, L. Takalo `, P. Temnikov r, D. Tescaro b,M. Teshima f , D. F. Torres u, N. Turini o, H. Vankov r,A. Venturini n, V. Vitale n, R. M. Wagner f , T. Wibig i,W. Wittek f , F. Zandanel g, R. Zanin b, J. Zapatero e

aUniversitat Wurzburg, D-97074 Wurzburg, GermanybInstitut de Fısica d’Altes Energies, Edifici Cn., E-08193 Bellaterra (Barcelona),

SpaincETH Zurich, CH-8093 Switzerland

dUniversidad Complutense, E-28040 Madrid, SpaineUniversitat Autonoma de Barcelona, E-08193 Bellaterra, SpainfMax-Planck-Institut fur Physik, D-80805 Munchen, Germany

gUniversita di Padova and INFN, I-35131 Padova, ItalyhUniversitat Dortmund, D-44227 Dortmund, Germany

iUniversity of Lodz, PL-90236 Lodz, PolandjUniversitat de Barcelona, E-08028 Barcelona, Spain

kYerevan Physics Institute, AM-375036 Yerevan, Armenia`Tuorla Observatory, FI-21500 Piikkio, Finland

mInst. de Astrofisica de Canarias, E-38200, La Laguna, Tenerife, SpainnUniversita di Udine, and INFN Trieste, I-33100 Udine, Italy

oUniversita di Siena, and INFN Pisa, I-53100 Siena, ItalypUniversity of California, Davis, CA-95616-8677, USA

qHumboldt-Universitat zu Berlin, D-12489 Berlin, GermanyrInstitute for Nuclear Research and Nuclear Energy, BG-1784 Sofia, BulgariasINAF/Osservatorio Astronomico and INFN Trieste, I-34131 Trieste, Italy

tUniversita di Pisa, and INFN Pisa, I-56126 Pisa, ItalyuICREA & Institut de Ciencies de l’Espai (CSIC-IEEC), E-08193 Bellaterra,

SpainvDepto. de Astrofisica, Universidad, E-38206, La Laguna, Tenerife, Spain

wdeceased

Abstract

The paper describes an application of the tree classification method Random For-est (RF), as used in the analysis of data from the ground-based gamma telescopeMAGIC. In such telescopes, cosmic γ-rays are observed and have to be discrimi-

2

Page 3: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

nated against a dominating background of hadronic cosmic-ray particles. We de-scribe the application of RF for this gamma/hadron separation. The RF methodoften shows superior performance in comparison with traditional semi-empiricaltechniques. Critical issues of the method and its implementation are discussed. Anapplication of the RF method for estimation of a continuous parameter from relatedvariables, rather than discrete classes, is also discussed.

Key words: discrimination, classification, decision tree

1 Introduction

Ground-based gamma-ray astronomy has in recent years shown to be a sourceof spectacular discoveries, constraining the evolution of the universe and con-tributing to the understanding of the origin of cosmic rays. Observations arebased on Imaging Atmospheric Cherenkov Telescopes (IACTs), which takeadvantage of the Cherenkov radiation emanating from the electromagneticshowers that develop during the absorption of gamma-rays in the atmosphere.The faint Cherenkov light flashes are collected in a large-diameter mirror, andrecorded in a pixelized camera.

Several IACT systems are in successful operation today, both in the Northern(MAGIC, VERITAS) and Southern (HESS, CANGAROO) hemisphere; allbut MAGIC are implemented as multi-telescope arrays. Their scientfic goalsinclude galactic and extragalactic sources: Supernova remnants, Pulsars, X-ray binaries, Microquasars, Active Galactic Nuclei (blazars or radio galaxies),Starburst galaxies and potentially also Gamma Ray Bursts. Due to their smallaperture IACTs can only perform scans over small areeas, and usually con-centrate on sources that have been identified at other wavelengths; however,the number of known gamma-ray emitters is increasing fast, and they provideessential contributions to the understanding of the non-thermal universe.

Events seen by an IACT have a very short (≈ 2ns) duration, and the showerimage is recorded as a compact cluster of pixels in the camera of the IACT.A principal component analysis permits to express the characteristics of thiscluster in image parameters, which will present statistically different propertiesfor the (interesting) gamma-rays and the (dominating) hadronic background.IACTs provide raw data with a signal-to-noise ratio much smaller than 1%,even for bright gamma sources. Establishing powerful methods of hadronic

∗ Corresponding author.Email address: [email protected] (T. Hengstebeck).

3

Page 4: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

background rejection thus is a prerequisite for the effective utilization of ob-servations with the Cherenkov technique. The fact was recognized with theadvent of the IACT technique, and has been given ample room in the liter-ature, both for telescope arrays and single telescopes, e.g. [1–5]. Multivariatemethods using global test statistics (e.g. likelihood ratios or artificial neuralnetworks) are specifically mentioned in [3] and [5].

A case study for and comparison of different advanced classification methodsfor a single-dish IACT can be found in [6]. In the same article the main featuresof Cherenkov images measured by gamma-ray telescopes are addressed andexplained, and the image parameters used in the γ/h separation are defined.

In this paper, largely derived from chapter 5 of [7], we limit ourselves to theimplementation, usage, and functionality of the RF method for the single-dish system MAGIC [8]. In [7], a more detailed discussion of the RF methodand comprehensive MC studies are given. The implementation closely followsthe method desribed by L. Breiman [9]. The application in γ/h separationis discussed in detail. Recent MAGIC publications (e.g. [10–12] use the RFtechnique, and [13] dicusses it in the context of the reference observations ofthe Crab nebula. A short comparative study with the established method ofcuts in scaled image parameters is given. We also discuss an application of theRF method in estimating the gamma energy, a continuous variable, in terms ofthe observed image parameters. In the following chapter 2 the Random Forestmethod will be described in detail, since existing mathematical treatmentsshow only few practically useful aspects, if any. The reader not interested inthese details may regard RF as a black-box tree classification method, andcontinue with the results in section 4.2.

2 Basics of the Random Forest (RF) method

The Random Forest method is based on a collection of decision trees, builtup with some elements of random choices. Like many other classification andregression methods, a Random Forest is constructed on the basis of train-ing samples suitable for the application. For the purpose of γ/h separation,the training samples contain the two classes of gammas (usually Monte Carlo(MC) data) and hadrons (usually OFF data, also ON 1 or MC data are pos-sible). In the further discussion, the following definitions will be used: We callthe elements of the training sample events. Each event is characterized bya vector whose components are image parameters obtained by analyzing thecamera pixels. We use the familiar Hillas parameters [1] and some additional

1 ON and OFF data are telescope data obtained by pointing at the source or on anearby, sourceless region of the sky, respectively

4

Page 5: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

parameters, but also observation- and detector-related parameters, like cos(θ),θ being the zenith angle of the source. The space spanned by the event vectorsis multi-dimensional. One can consider the training samples of gammas andhadrons as a single labeled training sample, viz. each event has an integer la-bel (called hadronness) indicating if the event belongs to the class of gammas(hadronness 0) or to the class of hadrons (hadronness 1).

From this sample, a binary decision tree can be constructed, subdividing theparameter space first in two parts depending on one of the parameters, andsubsequently repeating the process again and again for each part. The bestchoice of parameter and the criteria for subdividing are discussed below. Us-ing a single tree for classification purposes, however, usually gives mediocreresults. The tree is overoptimized on the training sample, and there is onlypoor generalization viz. new events will be classified rather badly. This isshown in figure 1. Note, however, that even a set of trees (forest) results insome sparsely populated areas, where the hadronness necessarily is not welldefined, and the probability of misclassification may be substantial.

Fig. 1. Left: Illustration of the RF method for a simple 2-dimensional model case.The black and white points are the observed points in class gamma and hadrons,respectively. They are distributed according to two different, but overlapping 2-di-mensional Gaussians. The result of separation in terms of hadronness is shown incolour. Right: The result of using a single tree on the same data gives no probabilitymeasure like hadronness, but only y/n answers. Its performance is inadequate.

There is no pruning (tree simplification by removing some branches consid-ered irrelevant) of the trees in the Random Forest algorithm. Instead, theRF creates a set of largely uncorrelated trees, and combines their results toform a generalized predictor. Two random elements prior to and within thetree growing process serve to approximate ideally uncorrelated trees; they aredescribed in the following sections.

5

Page 6: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

2.1 Bootstrap aggregating (bagging)

There is usually a single data sample in each class used for training. A straight-forward solution to obtain independent trees is to split the training sampleinto as many non-overlapping subsamples as trees should be grown. However,there are usually not enough training data available for this approach. This isespecially the case if dealing with air shower data, which are always costly togenerate (w.r.t. computer time and storage space). A different way is to pro-duce a bootstrap sample for each tree by sampling n times with replacementfrom the original training sample containing n events. This procedure guaran-tees that the events’ image parameter distributions are statistically identicalfor all bootstrap samples (and equal to the image parameter distributions ofthe original training sample, since the probability of selecting an event is con-stantly 1/n for the sampling with replacement procedure), while the bootstrapsamples do not contain the same events. It may (and will) happen that cer-tain events are taken more than just once: The probability of not selecting acertain event is equal to (1− 1/n), which becomes (1− 1/n)n when repeatingthe selection process n times. As limn→∞(1 + x/n) = ex, the probability ofnot selecting an event in the bootstrap procedure becomes e−1 ≈ 1/3. Thus,in each bootstrap sample there will be on average (1 − 1/e) original trainingevents, the rest (also kept in the sample) are copies.

2.2 Tree growing and random split selection

The tree growing begins with the complete sample contained in a single node,the so called root node, which is identical to the complete image parameterspace. In the following the γ/h separation is achieved by splitting (or cutting)each node into two successor nodes using one of the image parameters at atime, with a cut value optimized to separate the sample into its classes (inour case two: gammas and hadrons). This corresponds to a successive divi-sion of the image parameter space into hypercubes. In order to measure theclassification power (separation ability) of an image parameter and to opti-mize the cut value, the Gini index is used The Gini index is a frequently usedmeasure in dealing with classifiers, originally in economics. Named after theItalian economist Corrado Gini, it measures the inequality of two distribu-tions, e.g. gamma acceptance and hadron acceptance as function of a cut ina variable. It is defined as the ratio between a) the area spanned by the ob-served cumulative distribution and the hypothetical cumulative distributionfor a non-discriminating variable (uniform distribution, 45-degree line), and b)the area under this uniform distribution. It is a variable between zero and one;a low Gini coefficient indicates more equal distributions, a high Gini coefficientshows unequal distribution.

6

Page 7: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

The choice of the parameter taken for splitting is randomized (see below fordetails). The splitting process stops if the node size (events per node) fallsbelow a limit specified by the user, or if there are only events of one class(only gammas or only hadrons) left in the node, which therefore needs not besplit further. These terminal nodes can also be called elementary hypercubes,they cover the entire image parameter space without intersections or gaps.To each terminal node the remaining training events assign a class label l (0for gammas, 1 for hadrons). For terminal nodes still containing a mixture ofevents of different classes, a mean value is calculated for l, taking into accountthe class populations Nh of hadrons and Nγ of gammas: l = Nh/(Nh + Nγ).The original program [14] uses a majority vote, and does not calculate meanvalues.

Before going into more details, the classification process is briefly described:One can take a completely grown tree as starting point (see figure 2). The

Fig. 2. Sketch of a tree structure for the classification of an event v with componentsvlength, vwidth, and vsize. One can follow the decision path through the tree, leadingto classification of the event as hadron.

task is to classify an event characterized by a vector v in the image parameterspace. v is fed into the decision tree; at the first (highest level) node there is asplit in a certain image parameter (e.g. ’length’). Depending on the component(image parameter) ’length’ in v, the event v proceeds to the left node (length< split value) or to the right node (length ≥ split value) at the next lower level.This node again splits in some other (or by chance the same) component, andthe process continues. The result is that v follows a track through the treedetermined by the numerical values of its components, and the tree nodes’cut values, until it will end up in a terminal node. This terminal node assignsa class label l to v, which can now be denoted as li(v), where i is the treenumber.

The vector v will be classified by all trees. Due to the randomization involved,different trees will often give different results, hence the name ’Random Forest’.

7

Page 8: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

From these results, a mean classification is calculated:

h(v) =

∑ntreesi=1 li(v)

ntrees(1)

This mean classification is called Hadronness, and is used as the only teststatistic (split-parameter) in the γ/h separation (see figure 3).

Fig. 3. Mean hadronness for two test samples of gammas (left peak, black) andhadrons (right peak, red). Hadronness is the final and only test statistic in γ/hseparation.

The splitting process is somewhat randomized by a feature called random splitselection. The parameter candidates for a split are chosen randomly from thetotal number of available parameters. Among the candidates, the parameterand corresponding cut value to be used for splitting are chosen by the minimalGini index. In the case of two classes, the Gini index QGini can be referred toas binomial variance of the sample scaled to the interval [0, 1]. The Gini index(or GINI coefficient) can be expressed in terms of the node class populationsNγ, Nh and the total node population N :

QGini =4

Nσbinomial = 4

N

Nh

N= 4

Nγ(N −Nγ)

N2∈ [0, 1] (2)

QGini of a node is zero for the ideal case that only one class is present in thenode (Nγ = 0 or Nh = 0). The Gini index of the split is calculated by addingthe Gini indices of the two successor nodes (denoted by left and right node)and scaling the result to [0,1]:

QGini = 2

(Nγleft

Nleft

Nhleft

Nleft

+Nγright

Nright

Nhright

Nright

)∈ [0, 1] (3)

8

Page 9: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

Choosing the smallest QGini corresponds to minimizing the variance of thepopulation of gammas and hadrons, and naturally purifies the sample. Min-imization of the Gini index provides both the choice of the image parameterand the split value to be used.

More details concerning the Random Forest method can be found in [14]. Theoriginal program was modified to calculate the mean hadronness instead ofa 0 or 1 majority vote for a class. Calculating the arithmetic mean by usingweights (e.g. using the Gini index of terminal nodes) did not further improvethe results [6], [7].

3 Control of the training process

In this chapter we address some specific aspects of RF related to the trainingprocess. Proper training depends on several parameters, steering the growingof trees, which the user should be aware of. In the following these parametersare described.

• Number of trees: the number of trees must be chosen large enough to ensurethe convergence of the error σ, given by

σ(ntree) =

√√√√∑nsample

i=1 (hesti (ntree)− htruei )2

nsample(4)

σ(ntree) is the rms error of the estimated hadronness. hesti (ntree) denotesthe estimated hadronness (which depends on the number ntree of combinedtrees) and htruei is the true hadronness of event i in the sample, whichcontains nsample events in total. The convergence process is shown in figure 4for the training of RF on an MC gamma and MC hadron sample.

Care was taken that the test sample, for which the figure was produced, isdisjunct from the training sample. When taking events already used in thetraining process, σ would be underestimated. From figure 4, the followingpractical method can be deduced: One generates a reasonably high num-ber of trees (100 trees is usually sufficient), performs the training process,and then finds decisions for a test sample using a diminishing number oftrees, to judge how many trees still give satisfactory results. Trees generatedduring the training process are stored successively in a file. For the classifi-cation task one can read in the actually needed number of trees. If no testsample is available, one can take σ(ntree) as calculated from the so-calledout-of-bag data during the training. The out-of-bag data are the ’residue’of the bagging procedure, as explained in the following. In the bagging pro-cedure (generating of bootstrap samples, see chapter 2) there are data foreach tree which have not been used for the tree’s bootstrap sample. Being

9

Page 10: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

Fig. 4. Error (rms, =√

(σ2)) of the estimated hadronness as function of the numberof trees used. Also shown is the variance of each single tree.

independent, they can be used as test data for the corresponding tree. Inother words, each event of the original training sample can be used as testdata for ≈ 1/3 of the trees. If one observes a sufficient convergence of σcalculated from out-of-bag data after, say, 150 trees, actually 50 trees areneeded.• Overtraining: During tree growing, the cut values of the parameters are

adjusted according to the training sample. This overtraining is not a majordrawback, it affects merely the training sample, which provides these exactcut values. According to [14] the overtraining (or overoptimization) vanishesin case of an infinite number of trees. The practical method described abovefavours a minimal forest, with a number of trees sufficiently large to ensurea classification error (of a test sample), which is not significantly decreasedby adding more trees. Such a forest still shows overtraining: when applyingγ/h separation to the training data, the classes of gammas and hadrons canusually be well separated by a cut in hadronness = 0.5. In other words,each tree ’learned by heart’ the training events, and the same is true for theentire forest. The situation is the same with classical cuts: the cut valuesare optimized on a certain observed data set from a gamma source or onMonte Carlo data, and later on applied to the data to be analyzed, whichmust not contain the training data.• Number of trials in random split selection: This concerns the parameters

considered for splitting. A good empirical value for their number is√N

where N is the total number of parameters used in tree growing [14].• Node size: this is the minimum size of node at which further splitting stops.

For correctly labeled training events nodesize = 1 can be used, for partlyincorrect labeled data (e.g. using ON-data as hadrons) nodesize > 1 ispreferable, since data are not intended to be split completely. Experiencetells that a small number < 10 is best.

10

Page 11: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

4 Application of RF in γ/h separation

4.1 Remarks concerning the training process

In this chapter some features related to the Random Forest method will bebriefly addressed. Some of these remarks are valid also for many other ad-vanced classification methods in need of a training process, like Neural Net-works or linear discriminant analysis.

• Training data for Cherenkov telescopes: We have used OFF data and MCgammas (correctly labeled samples) or ON data and MC gammas (partlywrongly labeled hadron sample). It is usually advisable not to use MChadrons, since hadronic showers are difficult to simulate (unlike gammashowers which have a pure electromagnetic nature), so that MC hadronsare difficult to match in all details with real data. In fact, there is no needto use MC hadrons, when OFF or ON data are available. Choosing ON datafor training has the advantage of obviating OFF data taking, and of usingdata taken under identical observational conditions. The Random Forest al-gorithm is stable enough to deal with a hadron sample containing up to 1%of gammas, as shown in figure 5, where the training was performed usingOFF data with variable artificial contamination for the hadrons, and MCdata for the gamma sample. In order to simulate ON data, the OFF data

Fig. 5. Neyman-Pearson or ROC diagrams of hadron training samples with a con-tamination of (mislabeled) gamma events. A hadron sample with 1% gammas intro-duces a negligible loss in selection efficiency.

were contaminated with MC gammas, i.e. the degree of contamination wasknown. For all simulated gamma admixtures the reduction of the separationefficiency beomes visible only in a region of low gamma acceptances, which

11

Page 12: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

is usually not advisible to operate in (too low gamma efficiency). Dependingon the set of image parameters used for training, a generalization of thisresult may not be possible.• Types of parameters: All parameters are treated in the same way, which

means that in particular detector-related or observational parameters likecos(θ) (θ being the zenith angle), σ (image noise, averaged over all pixels),or size (integrated signal of the image), must be used with care. The senseof using such parameters is that cuts in other image parameters will dependon them, but not that they should be used for cuts. Thus, in general, onecan distinguish between parameters to be used for cuts, and parameterson which the cuts in other parameters may depend. To circumvent theproblem, the training data must be chosen not to permit a classificationusing these parameters alone (e.g. by using the same (flat) distribution ofcos(θ) in both training samples). Splits in these parameters, in trainingsamples prepared this way, can not directly serve for separating gammasand hadrons. Additional attention must also be payed if e.g. the gammadata have discrete cos(θ) values for technical reasons in the Monte Carloproduction. In this case the cos(θ) values appearing in the hadron samplemust be rounded to the same values (binned), or the Monte Carlo dataartificially spread to become continuous.

4.2 Comparison with direct cuts in image parameters

An extensive comparison of methods applied to Monte Carlo data sets fortraining and test samples was given in [6]. One of the methods describedthere (called Direct Selection) was based on using simple AND/OR cuts inthe multi-dimensional space of image parameters. The choice of parameters orfunctions thereof offers many possibilities for tuning. We repeat here a similarcomparison, again using Monte Carlo data, using scaled image parameters.Like in [6], no claim can be made that this result, found in favor of the RFmethod, can be generalized to all parameter choices or to real data. Exhaustivecomparisons with real data are lengthy, due to the high dimensionality ofthe problem, which includes data selection and image cleaning steps evenbefore image parameters are obtained. Quality comparisons using real data arealso influenced by the unavoidable changes in operation conditions, that arereflected in data corrections whose effect on separation methods are difficultto evaluate. A comparative study with comprehensive MAGIC data samplesis, however, in preparation.

For this comparison we used independent training and test samples, of 15000events each. Hadrons were simulated with the parameters: energy range 200GeV <E < 30TeV ; spectral index a = −2.7; zenith angle range 0 < θ < 30◦; im-pact parameter range 0 < R < 400m; viewing cone 5◦. The gamma sim-

12

Page 13: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

ulation settings were: energy range 50GeV < E < 30TeV ; spectral indexCrab-like a = −2.6; zenith angle range 0 < θ < 30◦; impact parameter range0 < R < 200m; Figure 6 shows the corresponding distributions of the imageparameters width [deg] and length[deg] as functions of size [phe], for gammasand hadrons. All data were pre-cut to obtain high-quality training and testsamples, requiring leakage 2 < 0.1, dist> 0.3◦, size > 200phe. Clearly, width

Fig. 6. Distribution of the Hillas parameters width (top) and length (bottom) asfunction of log(size), for gammas (left) and hadrons (right), as used in the trainingsamples. The profiles are shown in red (gammas) and black (hadrons), showingthat both parameters are good separators for size values above 200 photoelectrons(corresponding to about 100 GeV)

and length are good separation parameters, at least for values of size exceed-ing 200 phe (photo electrons), which corresponds approximately to energiesabove 100GeV . The size dependence of width and length can be dealt with byusing scaled parameters: The size range (of MC gamma data) is divided intobins, and for each bin i mean and variance of the width distribution (wi andσ2wi

) are calculated. The scaled width wi,scaled for each bin is then obtained bywi,scaled = (wi − wi)/σwi

.

The same procedure is used for the length parameter. As a result one obtains anormalized width and length distribution for gammas: they follow a pdf (prob-ability density function) with mean 0 and variance 1. In these variables, static(size-independent) cuts are used for γ/h separation. In order to find optimalcuts, a maximization of the Q-value which relates the relative acceptances ofgamma-rays and hadrons (Q = εγ/

√εh) was performed, using the Metropolis

minimization package 3 followed by a SIMPLEX minimization. Both packages

2 this parameter, not defined in [6], uses an estimate of fractional energy escapingthe camera3 which includes random perturbations in the search, thus avoiding to return local

13

Page 14: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

are part of TMinuit in the root analysis environment [15].

Both the Random Forest and the scaled parameter method used independentdata for training and testing. Only the parameters size, dist, width, and lengthwere used. The results are compared in the Neyman-Pearson or ROC (Receiveroperator characteristic) diagrams of figure 7; these diagrams show gammaacceptance as function of hadron acceptance. In order to obtain for the scaled

Fig. 7. ROC curves for γ/h separation in the test sample, by the RF method (highercurve) and by cuts in scaled parameters, using the same parameters.

parameter method more than a single point (that of overall maximum Q) inthe ROC diagram, a regularizer a(εh − p)2 was introduced (a generalizationof the method used in [6]). Here p denotes a target acceptance for hadrons,and εh is the freely variable hadron acceptance, which is obtained from themaximization of Q and different for each p. We used a high scaling numbera = 1000 to ensure that the optimization will give as a result a set of cutswith εh close to p.

These results are shown as the lower curve in figure 7. We should stress againthat this comparison can in no way show a general superiority of the RFmethod; practical experience shows that for a given data sample other methods(also including direct selection as in the above example) can, at an effort, befine-tuned to give results comparable to the RF method. However, in no casehas the RF result been shown inferior, and much less tuning is needed (andpossible) with the RF method. More comparisons (including also MAGICdata) can be found in [16].

minima

14

Page 15: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

5 Using a Random Forest estimator for a continuous variable

The RF method permits also to construct an algorithm of estimating a contin-uous quantity rather than a discrete class membership, dealt with in previoussections. We have used this method to estimate non-analytically the parti-cle energy from the measured image parameters. Two main approaches arepossible:

• Forced division into classes: Class labels are assigned to the training eventsaccording to an energy grid. As a result, multiple classes E0, E1, ..., En−1 arecreated. In the RF training process the related class populations are takeninto account together with a more general Gini index [9]

pi = Ni/N (5)

QGini = 1−n−1∑i=0

p2i (6)

Here i is the class index (0 ≤ i ≤ n − 1). As already shown above, theGini index of a split is evaluated as sum of the two Gini indices obtainedafter the split, and minimized. After the training procedure, the class pop-ulations inside a terminal node are used to calculate the estimated energycorresponding to the terminal node:

Eest =

∑n−1i=0 EiNi∑n−1i=0 Ni

(7)

In this application of RF each tree returns an estimated energy and theoverall mean is calculated as the final estimated energy.• A splitting rule based on the continuous quantity: It is possible to completely

avoid the use of classes by introducing a splitting rule, which does not relyon class populations. The idea of the Gini index (with its interpretation asbinomial variance of the classes) as split rule is a purification of the classpopulations, i.e. a separation of the classes, in the subsamples after the splitprocess. Similarly, when using the variance in energy as a splitting criterion,the subsamples are purified with respect to their energy distribution.

σ2(E) =1

n− 1

N∑i=1

(Ei − E)2 =1

n− 1

[(N∑i=1

E2i

)− nE2

]. (8)

In analogy to the Gini index of the split, the variance of the split is calculatedby adding the subsample energy variances, taking into account the nodepopulations as weights:

σ2(E) =1

NL +NR

(NLσ

2L(E) +NRσ

2R(E)

)(9)

15

Page 16: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

We have used both approaches for a set of Monte Carlo data. With 100 classesfor the first (classification) method, it produces results nearly identical to thoseof the second (regression) approach. The results of this latter RF approxima-tion for energy can be seen from figure 8. The linearity is perfect, and the en-ergy resolution (as defined by the rms error σE/E) comes out 26% at 100 GeVand 19% at 1 TeV, very fair values for a single telescope (telescope arrays canreach better resolution). We have not found an analytical parameterization forenergy expressed in terms of image parameters giving a result better than withthe RF representation; with extensive tuning, results comparable in qualityhave been found, though.

Fig. 8. Left: The relation between the RF-estimated energy (horizontal) and initialMonte Carlo energy (vertical axis) is perfectly linear. Right: The rms error σE/Eas function of initial energy.

6 Conclusions

The Random Forest (RF) method based on multiple decision trees was exten-sively tested as an analysis tool in the γ/h separation for data obtained withthe MAGIC telescope. In this paper we discuss many implementation detailsand the parameters a user has to become familiar with. We also compare theperformance of RF with the more conventional technique of cuts in scaled im-age parameters, using MC data. It could be shown that RF in this comparisonis superior to the classical method. This comparison does not imply a generalsuperiority of the RF method; practical experience shows that for a given datasample the conventional methods (like dynamical cuts or cuts in scaled imageparameters) may be tuned to give results comparable (but not superior) tothe RF method. A dedicated comparative study using MAGIC experimentaldata is still under way.

The RF method does produce stable results and is robust with respect toinput parameters, even if strongly correlated. The method adjusts itself to theavailable multi-dimensional space, with a minimum of human intervention:

16

Page 17: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

there are only few tunable parameters, which can be chosen according tosimple criteria (number of trees, trials in random split selection and finalnode size). This simpler control and tuning can then be seen as a generaladvantage over conventional methods. Proper training samples, however, areimportant, as in any advanced method requiring a training process, i.e. one hasto rely on a good Monte Carlo simulation. Using OFF or ON data as hadronsample limits the MC dependence to the gamma showers, better understoodthan hadron showers. There remains, however, the need to correctly treatatmospheric conditions under different zenith angles, and good knowledge ofthe detector.

Training and classification are fast: benchmarks using a 1.5 GHz PC (AthlonXP), with training and test samples each containing 10.000 events, a totalof 10 image parameters used, 100 trees used for classification, each tree com-pletely grown (nodesize=1), 3 trials in random split selection, give one minutefor training and 2 ms/event for classification. A comparable analysis techniquelike Neural Networks demands substantially more computer time for training.

Acknowledgement

We thank Jens Zimmermann for fruitful discussions about the RF methodand for comparisons of the RF method with a Neural Net approach.

References

[1] A.M.Hillas: Proceedings of the 19th International Cosmic Ray Conference,ICRC 1985 La Jolla , 3 (1985) 445

[2] A.M.Hillas: Space Science Rev. 75 (1996) 17

[3] D.J.Fegan: J.Phys.G, Nucl.Part.Phys. 23 (1997) 1013

[4] F.Aharonian et al.: Astropart.Phys. 6 (1997) 343

[5] H.Krawczynski et al.: Astropart.Phys. 25 (2006) 380

[6] R.K.Bock, A.Chilingarian, M.Gaug, et al., Nucl. Inst. and Methods A 516(2004) 511

[7] T. Hengstebeck, PhD thesis, Mathematisch-Naturwissenschaftliche Fakultat I,Humboldt-Universitat zu Berlin, Marz 2007. Available at URL http://edoc.hu-berlin.de/docviews/abstract.php?id=28015

[8] E. Lorenz, New Astron. Rev. 48 (2004) 339

17

Page 18: Implementation of the Random Forest Method for the Imaging ... · Implementation of the Random Forest Method for the Imaging Atmospheric Cherenkov Telescope MAGIC J. Alberta, E. Aliub,

[9] L . Breimann, J. H. Friedmann, R. A. Olshen, C. J .Stone: Classification andRegression Trees, Wadsworth, 1983

[10] J.Albert et al., Astroph. Journal 664 (2007) L87

[11] J.Albert et al., Astroph. Journal 665 (2007) L51

[12] J.Albert et al., Astroph. Journal 669 (2007) 1143

[13] J.Albert et al., to be published in Astroph. Journal, preprint available athttp://de.arxiv.org/abs/0705.3244

[14] L.Breiman, FORTRAN program Random Forests, Version 3.1, and L.Breiman,Manual On Setting Up, Using, And Understanding Random Forests V3. 1, bothavailable at http://oz.berkeley.edu/users/breiman

[15] R. Brun, F. Rademakers, http://root.cern.ch/

[16] J. Zimmermann, PhD thesis, Fakultat fur Physik, Ludwig-Maximilians-Universitat Munchen, Juni 2005. Available at URL http://edoc.mpg.de/274832

18