An introduction to machine learning for particle physics

1

AN INTRODUCTION TO MACHINE LEARNING FORPARTICLE PHYSICS

ANDREW JOHN LOWE3 AUGUST 2016

2

ABOUT THE SPEAKER

Research Fellow, Wigner Research Centre for Physics, Hungary (2013–present)

Postdoctoral Fellow, California State University, Fresno, USA (2010–2012)

Postdoctoral Fellow, Indiana University, USA (2008–2009)

PhD student, Royal Holloway, University of London, UK (2001–2008)

MSc student, Royal Holloway, University of London, UK (2000–2001)

Assistant Research Scientist, National Physical Laboratory, UK (1998–2000)

BSc student, Royal Holloway, University of London, UK (1993–1996)

3

WHAT IS DATA SCIENCE?

The Data Science Venn Diagram, by Drew Conway

4

ANOTHER INTERPRETATION

People with all these skills are rare!

5

WHAT DOES THE “BIG DATA” LANDSCAPE LOOK LIKEOUTSIDE PARTICLE PHYSICS?

67

WHERE IS MACHINE LEARNING USED?

89

WHAT IS MACHINE LEARNING?Arthur Samuel (1959): Field of study that gives computers theability to learn without being explicitly programmedTom Mitchell (1998): A computer program is said to learn fromexperience with respect to some task and some performancemeasure , if its performance on , as measured by , improveswith experience Traditional programming versus machine learning:

E TP T P

E

10

MACHINE LEARNING & PARTICLE PHYSICSMachine learning is more or less what is commonly known inparticle physics as multivariate analysis (MVA)Used for many years but faced widespread scepticismUse of multivariate pattern recognition algorithms was basicallytaboo in new particle searches until recentlyMuch prejudice against using what were considered “black box”selection algorithmsArtificial neural networks and Fisher discriminants (lineardiscriminant analysis) were used somewhat in the 1990’sBoosted Decision Trees (AdaBoost, 1996) is the favouritealgorithm used for many analyses (1st use: 2004)

, Helge Voss, 2015 J. Phys.: Conf. Ser.608 (2015) 012058; , 19 May 2015, CERN;

, Hai-Jun Yang et al., Nucl.Instrum.Meth.A543 (2005) 577-584

Successes, Challenges and Future Outlook of Multivariate Analysis In HEPHiggs Machine Learning Challenge visits CERN Boosted Decision Trees

as an Alternative to Artificial Neural Networks for Particle Identification

http://iopscience.iop.org/article/10.1088/1742-6596/608/1/012058/meta

http://indico.cern.ch/event/382895/

http://arxiv.org/abs/physics/0408124

11

STATISTICAL LEARNING VERSUS MACHINE LEARNINGMachine learning arose as a subfield of Artificial Intelligence.Statistical learning arose as a subfield of StatisticsThere is much overlapMachine learning has a greater emphasis on large scaleapplications and prediction accuracyStatistical learning emphasizes models and their interpretability,and precision and uncertaintyBut the distinction has become more and more blurred, and thereis a great deal of “cross-fertilisation”In the following, we’ll use the term “machine learning”, regardlessof the origin of specific methods

12

TYPES OF MACHINE LEARNINGSupervised learning: learn from examples and make predictionsfor new dataUnsupervised learning: looking for patterns in the dataOther types exist (e.g., reinforcement learning, semi-supervisedlearning, recommender systems)

13

SUPERVISED LEARNINGWe have an outcome measurement (also called dependentvariable, response, target)We have a vector of predictor measurements (also calledregressors, covariates, features, attributes, independentvariables)In the regression problem, is quantitative (e.g., price, bloodpressure, voltage)In the classification problem, is categorical (e.g., dead/alive,signal/background, digit 0–9, particle type, malignant/benign)We have training data : observations(examples, instances, cases, events) of these measurementsLearns a mapping from the inputs to the outputs

Y

p X

Y

Y

( , ), … , ( , )x1 y1 xN yN N

14

SUPERVISED LEARNING OBJECTIVESOn the basis of the training data we would like to:

Accurately predict unseen test casesUnderstand which inputs affect the outcomeAssess the quality of our predictions and inferences

15

UNSUPERVISED LEARNINGNo outcome variable, just a set of predictors (features) measuredon a set of samplesObjective is more fuzzy: find groups of samples that behavesimilarly, find features that behave similarly, find linearcombinations of features with the most variationDifficult to know how well your are doingDifferent from supervised learning, but can be useful as a pre-processing step for supervised learning

16

AN EXAMPLE OF UNSUPERVISED LEARNING IN HEPJET RECONSTRUCTION

A jet is a cone-shaped spray of hadrons and other particlesproduced by the fragmentation of a quark or a gluonJet finding involves clustering (gathering together) chargedparticle tracks and/or calorimeter energy deposits in a detectorExact definition of what constitutes a jet in the detector dependsa lot on the specific algorithm used to build jets from the hadronsthat are detectedJet finding is the approximate attempt to reverse-engineer theprocess of hadronisationSeveral different approaches and algorithms exist, but the mostpopular are sequential recombination algorithms (akahierarchical agglomerative clustering)Cluster analysis is a large sub-field of machine learning

17

JET RECONSTRUCTION

Jets are viewed as a proxy to the initial quarks and gluons that wecan’t measure and are a common feature in high-energy particle

collisions

18

A SIX-JET EVENT SEEN BY CMS

View along detector beam axis

19

LINEAR REGRESSION WITH ONE VARIABLEIn regression problems, we are taking input variables and trying tofit the output onto a continuous expected result functionWe have a single output (we are doing supervised learning) anda single feature Our hypothesis function has the general form: Choose so that is close to for training examples For notational compactness, we can arrange the parameters in avector:

yx

= + xy θ0 θ1,θ0 θ1 y y (x, y)

Θ = [ ]θ0

θ1

20

COST (OR LOSS) FUNCTIONWe can measure the accuracy of our hypothesis function by usinga cost function (also known as a loss function)Estimate parameters as the values that minimise the followingcost function (sum of squared residuals):

J(Θ) = ∑i=1

N

( + )yi yi2

Squaring the residuals ensures their contribution to isalways positive

is a measure of how bad our fit is to the training data

+yi yi J

J

21

GRADIENT DESCENTWe want to find the minimum of the cost function The gradient descent algorithm is:Start at some Repeat until convergence (measured by some stopping criterion):

J

= ( , )Θ0 θ0 θ1

= + α J( )Θj+1 Θj��Θj

Θj

is the learning rateIntuitively, what we are doing is walking down the cost function in the direction of the tangent to , with step size controlled bythe learning rate , to find the minimum of

αJ

Jα J

22

PROBLEMS WITH GRADIENT DESCENTIf is too small, gradient descent can be slowIf is too big, gradient descent might overshoot the minimum, failto converge, or even divergeIf has more than one minimum, gradient descent might get stuckin a local minimum instead of finding the global minimum: thesolution can be sensitive to starting point

αα

J

Θ0

23 . 1

IMPROVEMENTS TO GRADIENT DESCENTMore sophisticated algorithms than gradient descent are used in practice,but the general principle is the sameIf you are performing linear regression and your training sample has manyobservations, gradient descent would require a large summation beforegradient descent can make a single step can be very slow!

We can parallelise the computation by breaking the summation intochunks and spreading them across several CPUs

For each step of gradient descent, we can perform the summation on arandomly chosen single training example; this is stochastic gradientdescent (SGD)

SGD’s path to a minimum jitters and is less direct,SGD unlikely to converge at a minimum and will instead wanderaround it randomly, but usually result is close enough

Other, more advanced, algorithms are available that are often faster thangradient descent and have the advantage that there is no need tomanually pick

→

α

23 . 2

COMPARISON OF GRADIENT-BASED MINIMISATION METHODS

MULTIVARIATE LINEAR REGRESSION

24

MULTIVARIATE LINEAR REGRESSION Linear regression with multiple variables is a trivial extension2 We have a feature vector — our set of features:2

X = [ , , ⋯ , ]x0 x1 xp

For convenience of notation, we define 2 = 1x0

We define our hypothesis function as follows:2

= + + ⋯ + = Xy θ0 θ1x1 θpxp ΘT

Our cost function is (where is the vector of all output values):2 y

J(Θ) = (y + XΘ (y + XΘ))T

Our gradient descent rule becomes:2= + α J( )Θj+1 Θj Θj

25

FEATURE SCALINGSuppose we are performing multivariate linear regression withtwo features which differ greatly in scale

will have the shape of a bowl that has been “stretched” —boat shaped, or like a long thin valleyGradient descent can be slow as it oscillates inefficiently down tothe minimum, bouncing between opposite sides of the “valley”Speed up gradient descent by making the features similar in scaleWe can make the ranges of the features the sameUsually apply z-score normalisation: , where

and are the sample mean and standard deviation of feature Some machine learning methods (e.g., support vector machines)require feature scaling, and it’s good practice to do it anyway,although it’s not always necessaryOften log transform features with skewed distributions

,x1 x2J(Θ)

= ( + )/x′i xi xi si xi

si i

26

POLYNOMIAL REGRESSIONWe can improve our features and the form of our hypothesisfunction in a couple different ways.Our hypothesis function need not be linear if that does not fit thedata wellWe can add higher-order terms that can also account forinteractions between featuresFeature scaling is required to avoid features blowing up (e.g., if has range – then range of becomes – )

For a single variable, our hypothesis function looks like this:

x11 103 x2

1 1 106

= + + + ⋯ +y θ0 θ1x1 θ2x21 θpxp

1

27

POLYNOMIAL REGRESSION ON CAR DATA

What degree polynomial is best? We’ll discuss how this choice is made later.

28

LOGISTIC REGRESSIONNow switching from regression problems to classificationproblemsDon’t be confused by the name “logistic regression”; it is namedthat way for historical reasonsTo start with, consider a binary classification problem where

, where 0 is usually taken as the “negative class” and 1as the “positive class” and

0 = background, 1 = signal0 = good email, 1 = spametc.

We could use linear regression to predict the class probabilities,and map all predictions 0.5 as 1 and 0.5 as 1, but we run intoproblems with probabilities less than 0 or greater than 1Our hypothesis function should satisfy:

y ! {0, 1}

< >

0 } } 1y

29

LOGISTIC FUNCTIONThe function

shown here, maps any real number to the (0,1) interval, making ituseful for transforming an arbitrary-valued function into afunction better suited for classification

f (x) =1

1 + e+x

30

LOGISTIC REGRESSION HYPOTHESIS REPRESENTATION Logistic regression uses the function:2

P(X) =e + +⋯+θ0 θ1x1 θpxp

1 + e + +⋯+θ0 θ1x1 θpxp

or, more compactly:2

P(X) =e XΘT

1 + e XΘT

A bit of rearrangement gives:2

log( ) = + + ⋯ + = XP(X)

1 + P(X)θ0 θ1x1 θpxp ΘT

This transformation is called the log odds or logit of 2 P(X)

31

FROM PROBABILITIES TO CLASS PREDICTIONSTo estimate the parameters , we can use gradient descent withthe cost function:

Θ

J(Θ) = (+ log( ) + (1 + y log(1 + ))yT y )T y

In order to get our discrete 0 or 1 classification, we can translatethe output of the hypothesis function as follows:

~ 0.5 ⟹ y = 1y

< 0.5 ⟹ y = 0y

The decision boundary is the line that separates the area where and where y = 0 y = 1

32

LOGISTIC REGRESSION DECISION BOUNDARYLogistic regression classifies new samples based on any thresholdyou want, so it doesn’t inherently have one decision boundary; herewe show a contour plot for a range of threshold values:

33

CONFUSION MATRIXA confusion matrix is a (contingency) table that is often used todescribe the performance of a classification model, for example:

Types of errors:False positive rate: The fraction of negative examples that areclassified as positiveFalse negative rate: The fraction of positive examples that areclassified as negative

We can change the two error rates by changing the thresholdfrom 0.5 to some other value in [0, 1]

34

ROC CURVEA receiver operating characteristic (ROC), or ROC curve, is a graphicalplot that illustrates the performance of a binary classifier:

Displays both the false positive and false negative ratesThe ROC curve is traced out as we change the thresholdSometimes we use the AUC or area under the curve to summarisethe overall performance (higher AUC is good)

35

THE PROBLEM OF OVERFITTINGIf we have too many features, the learned hypothesis may fit thetraining data very well but fail to generalize to new examplesExample: linear regression

36

AN EXAMPLE OF OVERFITTING IN CLASSIFICATION

37

REGULARISATIONRegularization is designed to address the problem of overfittingThis technique that constrains or regularises the coefficientestimates, or equivalently, that shrinks the coefficient estimatestowards zeroIt may not be immediately obvious why such a constraint shouldimprove the fit, but it turns out that shrinking the coefficientestimates can significantly reduce their variance

38

RIDGE REGRESSIONThe least squares fitting procedure in linear regression estimatesthe parameters for features using the values that minimise:Θ p

J(Θ) = =∑i=1

N

( + )yi yi2

∑i=1

N

+ +⎛⎝⎜⎜yi θ0 ∑

j=1

p

θjxij

⎞⎠⎟⎟

2

In contrast, the ridge regression coefficient estimates are thevalues that minimize (where is a tuning parameter to bedetermined separately):

λ ~ 0

J(Θ) = + λ∑i=1

N

+ +⎛⎝⎜⎜yi θ0 ∑

j=1

p

θjxij

⎞⎠⎟⎟

2

∑j=1

p

θ2j

39

RIDGE REGRESSION: CONTINUEDAs with least squares, ridge regression seeks coefficient estimatesthat fit the data well, by making the RSS smallHowever, the second term in the cost function, , called

the shrinkage penalty, is small when the are close to zero, and soit has the effect of shrinking the estimates of towards zeroThe tuning parameter serves to control the relative impact ofthese two terms on the regression coefficient estimatesThe shrinkage penalty sets the budget for how large thecoefficient estimates can be; we minimise the RSS subject to

Selecting a good value for is critical; use cross-validationRegularisation used for both regression and classificationFeature scaling must be used due to the sum of squaredcoefficients in the shrinkage term

λ*j θ2j

ΘΘ

λ

} s*j θ2j

λ

40

REGULARISATION: SOME INTUITIONBy shrinking some of the coefficient estimates towards zero, wereduce the influence of some of the features on the model,thereby making it less complex; the model flexibilityWe can use regularisation to smooth the output of our hypothesisfunction to reduce overfittingIf is chosen to be too large, it may smooth out the function toomuch and cause underfittingTo help you think about this:

Imagine you’re performing a least squares fit to some data withsticks of spaghetti; how long you cook the spaghetti forcontrols how well you can bend it to fit the data!If you overcook, you’ll be able to fit every data point, in effectmemorising the training data — but the resultant model will notgeneralise to new data points, and is useless

λ

λ

41

SELECTING FOR RIDGE REGRESSIONλ(A BRIEF LOOK-AHEAD AT CROSS-VALIDATION, BEFORE WE EXPLORE IN MORE DEPTH)

Ridge regression requires a method to determine which of themodels under consideration is bestThat is, we require a method selecting a value for the tuningparameter Cross-validation provides a simple way to tackle this problem. Wechoose a grid of values, and compute the cross-validation errorrate for each value of We then select the tuning parameter value for which the cross-validation error is smallestFinally, the model is re-fit using all of the available observationsand the selected value of the tuning parameter

λ

λλ

42

TRAINING ERROR VERSUS TEST ERRORThe test error is the average error that results from using astatistical learning method to predict the response on a newobservation, one that was not used in training the methodIn contrast, the training error can be easily calculated by applyingthe statistical learning method to the observations used in itstrainingBut the training error rate often is quite different from the testerror rate, and in particular the former can dramaticallyunderestimate the latter

43

TRAINING- VERSUS TEST-SET PERFORMANCE

Left: underfitting, right: overfittingOptimal model is where the test error is at a minimum

44

VALIDATION SET APPROACH TO ESTIMATE TEST ERRORHere we randomly divide the available set of samples into twoparts: a training set and a validation or hold-out setThe model is fit on the training set, and the fitted model is used topredict the responses for the observations in the validation setValidation set error provides an estimate of the test errorMost common scheme used in HEP, if any is used at all!

45

MODEL SELECTION AND TRAIN/VALIDATION/TEST SETSIf we perform model selection or tune paramaters such aspolynomial degree (for polynomial regression) or theregularisation parameter using the test set, the error estimate isvery likely to be overly optimisticTo solve this, we can introduce a third set, the cross-validation set,to serve as an intermediate set that we can use to perform tuningand model selection, e.g.:

Optimise the parameters in using the training set for eachtuning parameter valuePick the tuning parameter with the lowest error evaluated onthe CV setEstimate the generalisation error using the test set

A typical train/CV/test split is 60%/20%/20%

λ

Θ

46

DRAWBACKS OF VALIDATION SET APPROACHThe validation estimate of the test error can be highly variable,depending on precisely which observations are included in thetraining set and which observations are included in the validationsetIn the validation approach, only a subset of the observations —those that are included in the training set rather than in thevalidation set — are used to fit the modelThis suggests that the validation set error may tend tooverestimate the test error for the model fit on the entire data set

47

-FOLD CROSS-VALIDATIONKWidely used approach for estimating test errorEstimates can be used to select best model, and to give an idea ofthe test error of the final chosen modelIdea is to randomly divide the data into equal-sized parts. Weleave out part , fit the model to the other parts(combined), and then obtain predictions for the left-out th partThis is done in turn for each part The results from the folds can then be averaged to produce asingle estimationThe advantage of this method over the simple hold-out method isthat all observations are used for both training and validation, andeach observation is used for validation exactly onceTypical values for are 5 or 10

Kk K + 1

kk = 1, 2,… , K

k

K

48

-FOLD CROSS-VALIDATION ILLUSTRATEDK

49

WHAT TO DO NEXT?What do we do next, after having used cross-validation to choosea value of a tuning parameter (e.g., )?It may be an obvious point, but worth being clear: we now fit ourestimator to the entire training set using the tuning parametervalue we obtainedDeploy model!

λ

50

MODEL SELECTION VERSUS MODEL EVALUATIONCross-validation is used for both model selection/tuning and toevaluate the test error of the final chosen modelShould we use the error obtained on the CV set during modelselection/tuning as an estimate of the test error of the finalchosen model? No!The CV error estimate is likely to be overly optimistic, because weobtained it using the same observations as those used toselect/tune the model to optimise performance!The solution is to use an outer loop of cross-validation to estimatethe performance of the final model

51

NESTED -FOLD CROSS VALIDATION ILLUSTRATEDK

52

NESTED CROSS VALIDATION: SOME FINAL DETAILSIf you use cross validation to estimate the hyperparameters of amodel and then use those hyperparameters to fit a model to thewhole dataset, then that is fine, provided that you recognise thatthe cross validation estimate of performance is likely to be(possibly substantially) optimistically biasedThis is because part of the model (the hyperparameters) havebeen selected to optimise the cross validation performance, so ifthe cross validation statistic has a non-zero variance (and it will)there is the possibility of overfitting the model selection criterionIf you want to choose the hyperparameters and estimate theperformance of the resulting model then you need to perform anested cross-validation, where the outer cross validation is usedto assess the performance of the model, and the inner crossvalidation is used to determine the hyperparameters

53

INFORMATION LEAKAGEInformation leakage is the creation of unexpected additionalinformation in the training data, allowing a model or machinelearning algorithm to make unrealistically good predictionsLeakage is a pervasive challenge in applied machine learning,causing models to over-represent their generalization error andoften rendering them useless in the real worldGiven that methods such as cross validation and other steps toprotect against information leakage are largely unknown in theHEP community, it’s reasonable to assume that it occurs in HEPanalyses frequently

54

INFORMATION LEAKAGE EXAMPLESIf you evaluate the performance of your final model using testdata, and tune your model to get a better result, you are using thetest data for tuning!If you apply feature scaling on the whole data set before creatingpartitions or folds, the means and standard deviations (and anydata transformation parameters) have seen the test data!If you choose the features for your model before creatingpartitions or folds, you are using information from the test data,and you get choose uninformative features that are spuriouslycorrelated with the output useless final model!If you tune any cuts using the test data, your final model is likely tohave overly optimistic performance!

This is true even if you are using simple one-dimensionalrectangular cuts on variables — i.e., the traditional cut-basedmethod commonly used in HEP!

→

55

FEATURE SELECTIONPROBLEMS OF TOO MANY FEATURES

Correlated features can skew predictionIrrelevant features (not correlated to class variable) causeunnecessary blowup of the model spaceIrrelevant features can drown the information provided byinformative features in noiseIrrelevant features in a model reduce its explanatory value (alsowhen predictive accuracy is not reduced)Training may be slower and more computationally expensiveIncreased risk of overfitting

56

REDUNDANT & IRRELEVANT FEATURESWhat should we do when it is likely that the data contains manyredundant or irrelevant features?Redundant features are those which provide no moreinformation than the currently selected featuresIrrelevant features provide no useful information in any contextVarious methods exist to remove them:

Wrapper methods consider the selection of a set of features as asearch problem, where different combinations are prepared,evaluated and compared to other combinations — typicallyslow but give good performanceFilter methods apply a statistical measure to assign a scoring toeach feature — fast, but consider each feature independentlyEmbedded methods learn which features best contribute to theaccuracy of the model while the model is being created

57

CHOOSING A CLASSIFIERWe have only considered simple linear models so farLinear models are seldom correct, but they can be informative!In some cases, the true relationship between the input and outputmight actually be close to linear — so a more complex model willnot increase performance muchSimple models may do as well as complex models if the featuresare uniformativeIn any case, your “go-to” algorithm before trying anything elseshould be a simple linear model; this will give you a baseline withwhich to compare againstBoosted decision trees and neural nets are fashionable in HEPnowadays, but in some cases a simple model might get reasonableperformance and be more interpretable, and is probably easy totune and faster to train

58

One advantage of crude models is that we know they are crude andwill not try to read too much from them. With more sophisticated

models,

… there is an awful temptation to squeeze thelemon until it is dry and to present a picture of the

future which through its very precision andverisimilitude carries conviction. Yet a man who

uses an imaginary map, thinking it is a true one, islike to be worse off than someone with no map atall; for he will fail to inquire whenever he can, toobserve every detail on his way, and to search

continuously with all his senses and all hisintelligence for indications of where he should go.

From Small is Beautiful by E. F. Schumacher

58From Small is Beautiful by E. F. Schumacher

5960 . 1

THAT’S PROBABLY ENOUGH THEORY FOR NOW!