Ensemble Learning: The Pros and Cons of Combining Multiple

SOA Predictive Analytics SymposiumSeptember 25, 2020

Ensemble Learning

Sometimes one model can’t do it all….

2

Ensemble Learning

Machine Learning algorithms are getting better all the time…

But each model is constrained to learn only some part of the structure of your data (i.e. linear structure, nested dependency inherent in trees

Combining the strengths of different models can produce superior predictions

3

What is an Ensemble?

• Combine predictions from multiple models generated from training data

• May be of the same class (i.e. trees) or different

• Empirical studies show predictions from combinations of models often perform better. “Wisdom of crowds”

4

One Model Is Rarely Uniformly Superior

5

6

Traditional approach is to hone a model using data, i.e. adding additional predictors to a regression model

Complexity can be controlled by regularization, train/test

Model Ensembles approach differently – instead of a single “master” model we combine multiple weaker models

Hope that each model accurately captures some aspect of data structure

Model Supervision vs Model Ensembles

7

• Statistical. Training data is small relative to size of space we want to search, averaging reduces risk of choosing the wrong classifier

• Computational. Individual models get stuck in local optima, averaging can get to better overall prediction

• Representational. Individual learners may not span large enough space to find real solution, combining can enable search in expanded space

Why does ensemble learning work?

How to Build an Ensemble?Ensemble of classifiers should be accurate (low bias) and diverse (so average has lower variance)

• “accurate” means better than random guessing – not a high bar• “diverse” can mean predictions are uncorrelated, or that the class

of possible models spans the space well

Ensemble Method:

Step 1: Develop population of base learners (usually “weak”) Trees are often usedStep 2: Combine them to form a prediction

8

Ensemble Learning: Sequential vs Parallel

Sequential ensemble: base learners are generated sequentially (e.g. AdaBoost)• The basic motivation of sequential methods is to exploit the

dependence between the base learners.• Overall performance may be improved by weighing previously

mislabeled examples with higher weight.

Parallel ensemble: base learners are generated simultaneously without knowledge of each other (e.g. Random Forest).• The basic motivation of parallel methods is to exploit

independence between the base learners since then the error can be reduced by averaging.

9

Bias Variance Tradeoff

MSE, is expected squared error of model

• Overfitting = High Variance• Underfitting = High Bias

10

Bias Variance Tradeoffi.e. Decision Tree, two few nodes then piecewise constant approx. too crude, “bias”

11

If we grow the tree too large, we could continue until each node has only one observation, “variance” – future data sets very unlikely to have same exact characteristics.

Ensembles for classification vs regression

Once we have an ensemble of models, what do we do with it?

Classification: Majority voteRegression: Averaging of predictions

Intelligent selection of predictors better than averaging all predictors that we can cook up• Based on accuracy and diversity• Remove weakest learners and average the rest• Not uniformly accurate: models will represent different aspects of

target function better

12

Ensemble Learning – BaggingBootstrap Aggregation (Bagging) Sample uniformly and

with replacement from training set

Size of each sample can be small or large

Build predictor on each sample and average results

13

Works well for unstable learning algorithms, i.e. decision tree, neural networks

Ensemble Learning – Random Forests

14

Ensemble Learning – Extremely Randomized Trees

Randomness goes one step further: the splitting thresholds are randomized.

Instead of looking for the most discriminative threshold, draw at random for each candidate feature. Best of these is splitting rule.

Reduces of the variance of the model, at the expense of a slightly greater increase in bias.

15

Ensemble Learning – Boosting• Family of learning algorithms to convert weak learners into strong ones• Data points initially equally weighted• Apply weak learners (i.e. shallow decision trees) and calculate error• Increase weight of misclassified, decrease weight of correctly classified• Iterate and finally predict using weighted average of most recently

constructed N models, weighted by overall accuracy

16

By Sirakorn - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=85888769

17

• A collection of networks with the same configuration and different initial random weights is trained on the same dataset

• Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.

• Number of models in the ensemble is often kept small both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members.

• Ensembles may be as small as three, five, or 10 trained models.

Ensemble of Neural Networks –“Committee of Networks”

Bayesian Ensemble Learning

Bayesian Adaptive Regression Trees (BART)

BART uses a prior and likelihood to construct a sequence of trees using a Markov Monte Carlo Method

Define prior distributions for tree growth, distribution in the terminal node, and residual error

4 choices at each step: grow, prune, swap terminal nodes, switch parent/child node

Empirical studies show high predictive accuracy

18

Injecting randomness to improve estimates and avoid “local minima” pitfall Initial weights for neural network or boosting can be set

randomly…different classifiers can be produced with different seeds

Decision tree selects splitting node among random sample of variables or cut points

Apply random weights to training set

Random selection of train/test sets

Randomly chosen subset of predictors in regression model

19

Evaluation of Ensemble Model

• Harder to inspect these models and check for reasonability• Risk of overfitting the training data (so many parameters!)• Train/Test/Validation samples and /or cross validation is more important

20

Summary Statistics for Ensemble Models

• Importance scores: which variables are most relevant, and relative influence or contribution of each variable

• Shapley as game theory application to identify contribution to prediction at member level

• Interaction statistic which variables interact with which others, strength and degree of those interactions

• Partial Dependence plots nature of dependence of the response on influential inputs, i.e. response increases monotonically with a predictor

21

Variable Importance In a regression model, coefficient magnitude relative

to its standard error can indicate variable importance

Average improvement in purity or reduction in error from a decision tree split (i.e Gini or Entropy)

Randomly permute values of feature and see by how much estimation error of model increases Don’t want to delete the feature since then would need to re-

construct the model

Metrics can be valuable on training or test data

22

Shapley Statistic From Game Theory, it estimates the reward earned by

players in a cooperative game (meaning, proportional to their contribution)

Reward can be construed as reduction in mean squared error relative to naïve model

Calculates marginal contribution of each predictor to all coalitions excluding that predictor, relative to total number of predictors

23

Variable Interaction Variable importance calculations will be misleading if there

are significant interactions Partial Dependence plots marginalize impact by looking at

change in prediction holding all other variables constant at average

24

Friedman’s H-Statistic identifies whether higher-order interactions exist (do features j and k interact in any way?)

Actuarial Applications

• Pricing Estimation – Ensemble learning can be used to both create new estimates and indicate optimal way to combine

• Actuaries are already familiar with a simple ensemble of experience and manual

• From too few estimators to too many

• Partial and Semi-Partial correlation

25

Documents

Ensemble Learning: The Pros and Cons of Combining Multiple