Kaggle - Higgs Boson ML Challenge project report

Milinda Fernando, Tharindu Rusira, Chalitha Perera, Janindu Arukgoda,Shemil Hashan Department of Computer Science and Engineering,

University of Moratuwa, Sri Lanka.

1. Introduction

The ATLAS experiment and the Compact Muon Solenoid (CMS) experiment recently claimed the discovery of the particle Higgs boson which is believe to be the main particle that responsible for the mass of the elementary other particles. The experiments are running at the large Hadron Collider (LHC) at CERN about those subatomic particles. From basic knowledge of physics we know that each unstable particle tries to decay into more stable particles. Higgs boson has many different processes which it can decay, producing other particles. Experiments have shown that Higgs boson decays mainly into three channels (In physics, channel refer to the particle which unstable particle decays into). All the channels that Higgs boson decays to are boson paris. Most importantly experiments have shown that Higgs boson decays into fermion pairs, namely taulepton or bquarks. The main task of this experiment is to determine the characteristics of the sub particles tau and lepton which is generated as a result of Higgs boson decay.

Large Hadron Collider (LHC) accelerate bunch of protons and make collisions between protons producing hundred millions of proton proton collisions per second. These collisions are recorded as an event. The parameter values corresponding to those events are detected using sensors. The data detected by sensors of the collider are basically known as primitive attributes, and we can use these primitive attributes to derive some other attribute values. All the events generated by the collision are written to large CPU farm producing petabytes of data per year. Form the saved event list large majority of events are uninteresting events. Those events are called as background events. The goal of this challenge is to find a region in the feature space in which there is a significant excess of events (interesting events called the signal events) compared to what known background events.

The goal of the challenge is to develop a procedure that produces the selection region. Kaggle provide a training set with signal/background labels and with weights, a test set (without weights and labels), and a formal objective representing an approximation of the median significance (AMS) of the counting test. The objective is a function of the weights of selected events. As a summary our main goal is to perform a classification task to classify events as background or signal events based on the given attributes. The performance of the classifier is measured by the AMS score we get.

This report presents our approach to solve the “Higgs Boson Machine Learning Challenge” on Kaggle[1]. We present a brief background of the problem and the evidence of submission of our solutions. We then discuss in detail the data analysis, selection of machine learning algorithms and the optimization of the algorithm XGboost using OpenTuner[2].

2. Background

Higgs mechanism is an essential concept in particle physics that explains how matter gets mass. According to Higgs mechanism, Higgs Field is a field that takes constant value across almost everywhere and it acts like a glue that binds all matter together. Higgs Boson is the building block of the Higgs Field.Higgs mechanism was first proposed in 1964 and experiments to discover the Higgs Boson has been carried out since then which led to the construction of the Large Hadron Collider (LHC) at European Organization for Nuclear Research (CERN). A new particle was discovered on 4th July 2012 at LHC in CERN and was confirmed to have the expected behavior of the Higgs Boson in March 2013. But it is yet to be determined whether is an exact fit for the Higgs mechanism. ATLAS is one of the particle physics experiments constructed at LHC. It has observed a signal of Higgs Boson decaying into two tau particles. This signal is hidden in the background noise. Since decay rate is an important property of a particle, the scientific value of this observation is significant.

2.1 Problem Statement [6] The problem can be formally stated as below. Let D = (x1,y1w1) ,....., ( xn,yn,wn) be the training sample where xi ϵ is a ddimensional feature ℜd vector, yi ϵ b,s is the label and wi ϵ is a nonnegative weight.ℜ+ Let S = i : yi = s and B = i: yi = b be the index set of signals and background events. Let ns = | S | and nb = | B | be the number of simulated signal events and background events. There are two important differences between the simulated training set and a training set sampled in a natural way. Since given enough computational resources, as many events as we want can be simulated, the proportion ns/nb of the number of points in the class doesn’t have to reflect the proportion of the prior class probabilities P(y = s) / P(y = b). Given P(y = s)≪ P(y = b), if the number of signal and background events ns and nb were proportional to the prior class probabilities P(y = s), P(y = b), the training sample would be very unbalanced. Also, events in the simulated training set are importanceweighted. Since the objective function (7) will depend on the unnormalized sum of weights, to make the setup invariant to the number of simulated events ns and nb, the sum across each set (training, public test, private test, etc) and each class (signal and background) will be kept fixed.

∑

iεSw N and Ni = s ∑

iεBwi = b

(1)

The physical meanings of Ns and Nb are the expected total number of signal and background events during the time interval of data gathering. The individual weights are proportional to the conditional densities divided by the instrumental densities used by the simulator.

(2) where

(x ) p(x |y ) and p (x ) p(x |y )ps i = i = s b i = i = b

are the conditional signal and background densities and qs(xi) and qb(xi) are instrumental densities. Let g : be an arbitrary classifier. Let the selection region be the set of b, ℜd → s x (x) sG = : g = points classified as signal and let denote the index set of points that are classified as signals.G

︿

i ε G i (x ) G

︿= : xi = : g i = s

Then from equation (1) and (2)

(3) s = ∑

iεS∩

G︿

wi

is an unbiased estimator of the expected number of signal events selected by g,

(4)(x)dxμs = Ns ∫

Gps

and similarly,

(5) b = ∑

iεB∩

G︿

wi

is an unbiased estimator of the expected number of background events selected by g,

(6)(x)dxμb = Nb ∫

Gpb

Here, s and b are luminositynormalized true or false values. For a classifier g, the Gaussian significance of discovery of an experiment with n observed events selected by g is given by . Since n can be estimated by s+b and by b, the objective function is n )/( − μb √μb μb /s √bfor training g. This is indeed the case since the first order behavior of all first order functions is /s ~ s √bbut it is only valid for and . Since this is often not the case in practice, Approximate Median s≪ b b≫ 1 Significance (AMS) function which is defined as follows is used.

(7)MS A = √2 ((s b )ln(1 ) )+ b + reg + sb+ breg − s

where s and b are defined in equations (3) and (5) and breg is a regularization term set to constant breg = 10 in Challenge. Equation (7) with breg = 0 is often used for optimizing the selection region for discovery significance by highenergy physicists. What the participants had to do in the challenge was to train a classifier g based on the training data D with the goal of maximizing the AMS (equation (7)) on test data sets.

3. Solution In this section we explain the approaches that we have followed to solve the higgs boson challenge.

3.1 Data Analysis The provided data set contains 250000 records of atomic collision events. Every instance is given as a collection of an event id, 30 attributes and a weight column. 30 attributes contain both primitive attributes (with PRI prefix) and derived attributes (with DER prefix). A preliminary analysis of training data is done using Weka visualization tool. Here the red color represents the data values corresponds to background events while the blue color represents the data values corresponds to signal events.

Figure 1. Attribute value distributions as shown in WEKA visualizer

Here is a summary of all the attribute values that tubule consists of. The variables prefix with PRI (Primitive) are raw quantities about the bunch collision as measured by the detector. Variables prefix with DER (Derived) are the quantities computed from the primitive features.

Variables are floating point unless specified otherwise. All azimuthal φ angles are in radian in the [− π, + π ] range. Energy, mass, momentum are all in GeV All other variables are unit less. Variables are indicated as “may be undefined” when it can happen that they are meaningless or

cannot be computed; in this case, their value is − 999.0, which is outside the normal range of all variables.

The mass of particles has not been provided, as it can safely be neglected for the Challenge.

EventId: An unique integer identifier of the event. Not to be used as a feature.

DER_mass_MMC: The estimated mass m H of the Higgs boson candidate, obtained through a probabilistic phase space integration (may be undefined if the topology of the event is too far from the expected topology)

DER_mass_transverse_met_lep: The transverse mass between the missing transverse energy and the lepton.

DER_mass_vis: The invariant mass of the hadronic tau and the lepton.

DER_pt_h: The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton, and the missing transverse energy vector.

DER_deltaeta_jet_jet: The absolute value of the pseudorapidity separation between the two jets (undefined if PRI jet num ≤ 1).

DER_mass_jet_jet: The invariant mass of the two jets (undefined if PRI jet num ≤ 1).

DER_prodeta_jet_jet: The product of the pseudorapidities of the two jets (undefined if PRI jet num ≤1).

DER_deltar_tau_lep: The R separation between the hadronic tau and the lepton.

DER_pt_tot: The modulus of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI jet num ≥ 1) and the subleading jet (if PRI jet num = 2) (but not of any additional jets).

DER_sum_pt: The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI jet num ≥ 1) and the subleading jet (if PRI jet num = 2) and the other jets (if PRI jet num = 3).

DER_pt_ratio_lep_tau: The ratio of the transverse momenta of the lepton and the hadronic tau.

DER_met_phi: centrality The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton

DER_lep_eta_centrality: The centrality of the pseudorapidity of the lepton w.r.t. the two jets (undefined if PRI jet num ≤ 1)

PRI_tau_pt: The transverse momentum p2 x + p2 y of the hadronic tau.

PRI_tau_eta: The pseudorapidity η of the hadronic tau.

PRI_tau_phi: The azimuth angle φ of the hadronic tau.

PRI_lep_pt: The transverse momentum p2 x + p2 y of the lepton (electron or muon).

PRI_lep_eta: The pseudorapidity η of the lepton.

PRI_lep_phi: The azimuth angle φ of the lepton.

PRI_met: The missing transverse energy E T miss .

PRI_met_phi: The azimuth angle φ of the missing transverse energy.

PRI_met_sumet: The total transverse energy in the detector.

PRI_jet_num: The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3).

PRI_jet_leading_pt: The transverse momentum p2 x + p2 y of the leading jet, that is the jet with largest transverse momentum (undefined if PRI jet num = 0).

PRI_jet_leading_eta: The pseudorapidity η of the leading jet (undefined if PRI jet num = 0).

PRI_jet_leading_phi: The azimuth angle φ of the leading jet (undefined if PRI jet num = 0).

PRI_jet_subleading_pt: The transverse momentum p2 x + p2 y of the leading jet, that is, the jet with second largest transverse momentum (undefined if PRI jet num ≤ 1).

PRI_jet_subleading_eta: The pseudorapidity η of the subleading jet (undefined if PRI jet num ≤ 1).

PRI_jet_subleading_phi: The azimuth angle φ of the subleading jet (undefined if PRI jet num ≤ 1).

PRI_jet_all_pt: The scalar sum of the transverse momentum of all the jets of the events.

Label:The event label (string) y i ∈ s, b (s for signal, b for background). Not to be used as a feature. Not available in the test sample.

3.2 Missing/Invalid values We observed that value 999.0(undefined value) is frequent in many data instances. One simple way to overcome this problem is to remove data records that contain this invalid values for at least one attribute but since the number of such data records is quite large we might not have enough training data after removing these data records. The other solution was to remove the attributes(columns) but the removal of these attributes lowered the AMS score of the trained models because all these attributes are important in the classification process. Since we cannot eliminate the data rows with missing values we have specified the missing values as missing to the machine learning algorithm we have use. So the algorithm internally do some modifications to the missing values (may be interpolation or replacing it with mean of that average) and build the classification model.

3.3 Machine learning and algorithm selection Initially, we considered common and rudimentary classifiers like J48(C4.5) decision trees available in WEKA with 10fold cross validation to build our models but the results were not satisfactory and also the training phase took a long time to complete.

In Kaggle forums, the XGboost (Extreme Gradient Boost)[4] algorithm was discussed in details and considered to produce impressive results for the HiggsBoson challenge. Further it is seen that XGboost outperforms R’s gbm and scikitlearn’s sklearn.ensemble.GradientBoostingClassifier. XGboost is an optimized gradient boosting library.

3.3.1 eXtreme Gradient Boosting XGboosting(tree booster algorithm) algorithm is a variant of the famous ensemble algorithm, gradient boosting. Tree booster is much similar to decision tree model except that XGboost tree model uses specified objective function rather than using the traditional information gain measures or gain index measure that widely used in decision trees. Tree booster in XGboosting is more appropriate in the scenario of HiggsBoson challenge because this[4] implementation supports instance weighting. Weighting reflects how important a given data row in the training process. Further, it was regarded as a potential solution in public discussion forums. As a model validation technique, 10fold cross validation was primarily used. However in later stages of the experiment, we managed to reduce the time overhead due to the cumbersome crossvalidation process using the AMS approximation file published here[7]. With the latter approach we were given the privilege to use the entire training data set for the training phase and measure model accuracy with the actual test data set. This led us to an actual realization of the quality of the model because Kaggle platform measures the model quality using the test data set that we used to test our models locally. In other words, it eliminated the need to rely on portions of the training data set to measure model accuracy.

3.4 Parameters for XGboosting XGboosting algorithm is operated with three types of parameters,

general parameters booster parameters task parameters

In this project, our main concern was to select a good set of values for these parameters instead of experimenting over a large set of different algorithms. This is because we already knew the XGboost is capable of producing good results. In typical machine learning applications, selecting a good algorithm and selecting optimal parameter values for the algorithm are equally important to obtain a quality model. In this project, we adopted a novel approach for tuning the algorithm parameters. The typical method of doing this is to determine the parameter values after a careful study of the training dataset. However, we decided to follow a different procedure because fully comprehending given data requires a sound knowledge on particle physics. We used OpenTuner[2], a framework for building program autotuners to implement a tuner for XGboost algorithm. We provided all tunable parameters and their legal domains to the autotuner. Then we defined the quality measurement, the AMS score as specified in the ATLAS experiment.

MS A = √2 ((s b )ln(1 ) )+ b + reg + sb+ breg − s

where all symbols have their usual meanings as specified in the problem statement(section 2.1). The autotuner tries to find a parameter configuration that maximizes the AMS score for a given model. This approach has been very successful that the optimized solution was ranked 65th in the Kaggle leaderboard. The autotuning process improved model accuracy dramatically that the optimized model produced an AMS score of 3.712 where the AMS score for XGboost with default parameter values was 3.60. We made only 19 submission to reach the final position. That shows how fast the incremental optimizations converge the XGboost algorithm parameters to increase AMS score.

3.4. 1 Algorithm(XGboost) optimization OpenTuner uses a number of search strategies in its search for an optimal parameter configuration. Among these search techniques[2], meta search techniques such as multiarmed bandit with sliding window, area under the curve credit assignment (AUC Bandit) and many other techniques like differential evolution, variants of Nelder Mead and Torczon hillclimbers are prominent. This is not all and also OpenTuner allows user defined search heuristics and domainspecific search algorithms to be implemented. Following code explains how the selected parameters are fed into the auto tuner with their domains. We selected following set of parameters because autotuning all the parameters tend to overfit the model.

max_depth [5,10] eta [0.01,0.5] subsample [0,1] min_child_weight [0.1, 1] colsample_bytree [0.1, 1] base_score [0.1 , 1] gamma [0.5, 3.5]

The portion of the source code that handles the parameter configuration is given below.

TREE_PARAM_INT = [ ('max_depth', 5, 10)] TREE_PARAM_FLOAT = [('eta',0.01,0.5), ('subsample',0,1), ('min_child_weight',0.5,5), ('colsample_bytree',0.1,1), ('base_score',0.1,1), ('gamma', 0.5,3.5)]

def manipulator(self): manipulator = ConfigurationManipulator() for param, min, max in TREE_PARAM_INT: manipulator.add_parameter(IntegerParameter(param, min, max)) for param, min, max in TREE_PARAM_FLOAT: manipulator.add_parameter(FloatParameter(param, min, max)) manipulator.add_parameter(EnumParameter('objective', ['binary:logistic', 'binary:logitraw', 'rank:pairwise'])) manipulator.add_parameter(EnumParameter('eval_metric', ['auc', 'error'])) manipulator.add_parameter(IntegerParameter('num_trees', 100,500)) return manipulator

3.4.2. Run function of the autotuner

Figure 2. Workflow of the autotuner’s run function

As mentioned in the previous section, we implemented an autotuner based on OpenTuner framework in order to find an optimum parameter configuration for XGboost algorithm. Given the parameter domains and the quality measurement, OpenTuner tries to find a parameter configuration in the search space that optimizes the quality function.

Luboš Motl, another fellow competitor (leader in the public leaderboard) has published this Higgs Kaggle approximate solution file[5]. The file includes approximated AMS values for the test set. We used this file to measure the quality of the model we built. As explained in 3.1.1, in the actual implementation we did not use crossvalidation for the models we built but used an AMS approximation of the original test set that gave us a metric to measure model quality.

4. Evidence for submission

Figure 3. Final standing of team SapientS in the Kaggle leaderboard as of 19th September, 2014

Figure 4. Team members

Figure 5. Ranked 65th from 1785 contestants/teams (top 4%)

Figure 6.a Submissions (part1)

Figure 6.b Submissions (part2)

Figure 6.c Submissions (part3)

5. Results and Discussion Figure 7 shows how the AMS varied with different parameter configurations for XGboosting algorithm. It can be seen that as the autotuner runs, it tends to stabilizes around an optimized value (3.71).

Figure 7. Variation of AMS against each tuning iteration

There were several reasons why we thought of adopting this methodology in solving HiggsBosonML challenge. It was quite challenging to experient over a large number of different algorithms to find the best possible algorithm that outputs the ‘best’ classifier. This process is purely experimental and consumes a lot of time. During the experiment, it was occured to us that instead of searching for the best algorithm, it is more productive to utilize the existing solution(i.e. xgboosting tree classifier) in the best possible way. So we devised a different experiment to tune the parameters of the algorithm so that it would perform better than the instances that use its default parameter configurations. Further, data and attribute analyzing was a hard task due to the lack of domain knowledge in the field of particle physics. However we performed some statistical tests to check the importance of each attribute and we observed that simply eliminating different attributes did not make the solution better. So we decided to use all 30 attributes.

This solution is actually an amalgamation of machine learning techniques and high performance computing strategies where the fundamental objective was to build an effective model in a very limited timeframe.

6. Future Work In this project we experimented a classical machine learning approach with a novel extension of autotuning. Autotuning is a common practice in the field of high performance and scientific computing. To the best of our knowledge, this is the first time OpenTuner was applied to solve real world machine learning problems and the results have been promising. We wish to research the possibility to extend the solution as a machine learning framework that is based on OpenTuner and it will obviously be an interesting topic due to OpenTuner’s fast convergence and the capability of the system to make the best use of a selected machine learning algorithm by optimizing its tunable parameters. References [1] Kaggle HiggsBosonML Challenge competition link https://www.kaggle.com/c/higgsboson [2] OpenTuner: An Extensible Framework for Program Autotuning. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan RaganKelley, Jeffrey Bosboom, UnaMay O'Reilly, Saman Amarasinghe. International Conference on Parallel Architectures and Compilation Techniques. Edmonton, Canada. August, 2014. [3] Competition documentation Accessed on 18th September, 2014 Available [online] http://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf [4] XGboost implementation on GitHub, Accessed on 18th September 2014, Available [online] https://github.com/tqchen/xgboost [5] Higgs Kaggle approximate solutions file by Luboš Motl, Accessed on 18th September, 2014 Available [online] https://onedrive.live.com/?cid=9cd81cfa06ff7718&id=9CD81CFA06FF7718!68751&ithint=folder,zip&authkey=!s4NY6ZGgB84$ [6] Learning to Discover by Claire AdamBourdarios, Glen Cowan, Cecile Germain, Isabelle Guyon, Bal zs K gl, David Rousseau [7] Luboš Motl’s report on HiggsBoson ML challenge and the use of AMS approximations for the test data set, Accessed on 18th Spetember 2014, Available [online] http://motls.blogspot.com/2014/08/kagglehiggscontestsolutionfilefor.html

https://www.google.com/url?q=https%3A%2F%2Fwww.kaggle.com%2Fc%2Fhiggs-boson&sa=D&sntz=1&usg=AFQjCNGT8lDHteqeCnj8x_TLVJ9vujUY3g

http://www.google.com/url?q=http%3A%2F%2Fhiggsml.lal.in2p3.fr%2Ffiles%2F2014%2F04%2Fdocumentation_v1.8.pdf&sa=D&sntz=1&usg=AFQjCNGhYLRQqVQgHSs6i2A31ckdr-A2Fg

https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Ftqchen%2Fxgboost&sa=D&sntz=1&usg=AFQjCNHuwxYCCHT_F-mEE3GmGTT1nu7AXA

https://www.google.com/url?q=https%3A%2F%2Fonedrive.live.com%2F%3Fcid%3D9cd81cfa06ff7718%26id%3D9CD81CFA06FF7718!68751%26ithint%3Dfolder%2Czip%26authkey%3D!s4NY6ZGgB84%24&sa=D&sntz=1&usg=AFQjCNGJ9hZ1-wlDMSjyCvaC0A24NJPmaw

https://www.google.com/url?q=https%3A%2F%2Fonedrive.live.com%2F%3Fcid%3D9cd81cfa06ff7718%26id%3D9CD81CFA06FF7718!68751%26ithint%3Dfolder%2Czip%26authkey%3D!s4NY6ZGgB84%24&sa=D&sntz=1&usg=AFQjCNGJ9hZ1-wlDMSjyCvaC0A24NJPmaw

http://www.google.com/url?q=http%3A%2F%2Fmotls.blogspot.com%2F2014%2F08%2Fkaggle-higgs-contest-solution-file-for.html&sa=D&sntz=1&usg=AFQjCNEYiZCGGgDt-CwB49YmCM036qTlQg