Debugging ML Algorithms - Amazon S3...•Machine learning algorithms are full of numerical calculations •Numerical errors can be due to: •limited precision arithmetic •we just

Practical Advice for

Debugging ML AlgorithmsStephen Gould, Cheng Soon Ong, Mark Reid

[email protected]

November 2015

mailto:[email protected]

Answer: 0

import numpy as np

x = np.array([0, 1, 41, 255], dtype=‘uint8’)

x += 1

print(x)

What is 255 + 1?

3

Answer: 16,777,216

import numpy as np

x = np.array([16777216], dtype=‘float32’)

x += 1

print(x)

What is 16,777,216 + 1?

4

Implementation Issues

• Numerical calculations on a computer are always subject to errors

• Machine learning algorithms are full of numerical calculations

• Numerical errors can be due to:• limited precision arithmetic

• we just saw two examples

• algorithmic limitations (e.g., generating true random numbers)

• careless implementations• we will see some example soon

• bugs!!!

5

Numerical Robustness Example

Consider the simple problem of computing a vector norm,

Problem: numerical overflow or underflow

6

Numerical Robustness Example (2)

The standard deviation of a set of measurements can be calculated as

where

but this takes two passes through the data.

7

Numerical Robustness Example (3)

A “better approach” is to perform the following equivalent calculation

which only requires one pass through the data.

What can go wrong with this implementation?

8

Bugs in ML Algorithms are Hard to Find

As an example, let’s say we are trying to minimize the following scalar function using (damped) Newton’s method,

which updates iterates as

9


What happens if we introduce a small bug?

10


11

Feature Scaling

• Numerical algorithms work best on well-scaled data. A common pre-processing step is to scale the input to have zero mean and unit variance (sometimes called whitening), e.g.,

• Note. Estimating the scaling parameters must not use test set data.

• Feature scaling does not change the “strength” of a classifier but it does help with convergence during training.

12

Feature Scaling Example

• Dataset: Iris [Fisher, 1936],three classes, four features, 50 examples per class

• Feature vector: squared raw features plus bias term

• Classifier: multi-class logistic (a.k.a. soft-max)

13

Numerical Issues Take Home Message

Wherever possible, use tried-and-tested third-party implementations

(but remember that not all open source code is created equally)

14

Machine Learning Pipeline

15

Regression Example

A classic regression problem is to fit a curve (e.g., a polynomial) to a set of points

16Adapted from http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

Dataset Partitioning

• training set: learn model parameters

• validation set: tune meta-parameters• (e.g., regularization strength, number of iterations, etc.)

• test/evaluation set: report performance• ideally used only once!

17

training set validation set test set

1(train)

2(train)

3(train)

4(train)

5(train)

6(test)

7(train)

8(train)

9(train)

10(train)

Cross-validation

Cross-validation is a common method used to estimate how well a model will generalise to unseen data.

• K-fold: Split the data into K sets of roughly equal size. For the k-thfold, train the model on K - 1 parts and test on the k-th part. We can now use all the data to estimate the prediction error.

• Leave-one-out (LOOCV): set K to the size of the dataset.

18

Dataset Bias

Every dataset is biased.

19

Don’t think dataset bias won’t happen to you

20

Sampling Strategies on Regression Example

21Adapted from http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

• random sampling• Example: curve fitting• Example: classifying video frames

• unbalanced datasets• stratified sampling• data weighting (re-sampling with replacement)

Unbalanced Datasets (Random Sampling)

22

training set test set

dataset

Unbalanced Datasets (Stratified Sampling)

23

training set test set

dataset

Confusion Matrix for Classification Problems

• (i, j) entry: number of examples of class ithat were predicted as class j

• row sum: number of ground-truth examples of class i

• column sum: number of examples predicted as class j

• diagonal sum: number of correctly classified examples

• total sum: number of total examples

• Need not have equal number of rows and columns.

• Sometimes you will see the matrix transposed.

24

• Often we care about overall classification accuracy. This is an example of micro-averaging,

• However, something we have an unbalanced dataset but wish to treat each class equally. This is an example of macro-averaging,

• More generally, we may also want to compute weighted accuracy.

Accuracy: Macro vs. Micro Averaging

25

Precision and Recall

• Terminology• true positive (TP), hit, detection

• true negative (TN), correct rejection

• false positive (FP), false alarm, type I error

• false negative (FN), miss, type II error

26

Derived Statistic Equation

recall, true positive rate, sensitivity, hit rate TP / (TP + FN)

positive predictive power, precision TP / (TP + FP)

true negative rate, specificity TN / (TN + FP)

accuracy (TP + TN) / (TP + FP + TN + FN)

F1-score 2 (precision . recall) / (precision + recall)

Precision-Recall Curves

27

Precision-Recall Curve Operating Points

28

classification rule:Pr(y = 1 | x) > t

Comparing PR Curves

Below we plot two algorithms, which is better?

29

Other Ways to Compare Algorithms

30

Analysis Take Home Message

1. Measure everything• This will save analysis and debugging time later

2. Choice of metrics can have a huge effect on interpretation of results

3. Ask questions of your metrics

31

Repeatability―Controlling Randomness

• Often you will want to compare differentvariants of an algorithm

• However, comparing different runs can isdifficult if the algorithm is stochastic

• transform random algorithm A(x) into deterministic algorithm A’(x, r) where ris a sequence of random numbers

• one way to do this is to use random seeds (np.random.seed(0))

“random chance seems to have operated in our favour”

randomizedalgorithm

deterministicalgorithm

x yx

yr

32

Exploring Features and Meta-parameters

33

change regularisation

change label weights

collect more data

run for more iterations

change regularisation

change label weights

collect more data

add features

run for more iterations

baseline model

Feature Selection

Often the number of available features is very large but there are only a small number of relevant features. Some recent methods try to learn features directly from data. However, often we are faced with the task of having to come up with a good set of features manually.

• Filter Feature Selection: Use a computationally cheap heuristic to evaluate features, e.g., mutual information between a feature and the class labels.

• Wrapper Feature Selection: Incrementally add the best feature to a feature set (forward feature selection) or remove the worst features from the feature set (backward feature selection).

34

Example: Forward Feature Selection

• Start with an empty feature set, F = {}

• Repeatedly try each feature i F, createFi = F {i} and use cross-validation to evaluate Fi. Set F to the best Fi.

35

Diagnostics

• Diagnostics are about workingout why your algorithm is notgiving your the performanceyou want. What could theproblem be?• problem statement• data• features• algorithm/model• implementation• something else

• Take time to set up a good experimentalframework for repeated experiments

36

“give me six hours to cut down a tree and I will spend the first four sharpening my axe”

Visualise Your Data

37

Diagnostic Example

Suppose that our test error is unacceptably high and we suspect the problem is either that the model is overfitting or the features are not good enough.

Diagnostic:

• The first hypothesis (overfitting) suggests that the training error will be much lower than the test error

• The second hypothesis (features) suggests that the training error and test error will both be high

38

Learning Curves

39

erro

r ra

te

training set size

training set

test set

Learning Curves: Bias vs. Variance

40

erro

r ra

te

training set size

training set

test set

target error rate

erro

r ra

te

training set size

training set

test set

target error rate

high variance high bias

Bias/Variance Trade-off

41

erro

r ra

te

model complexity

target error rate

training set

test set


Fixes for Bias/Variance Problems

Diagnosing bias and variance problems provides us with hints as to what to try next.

For bias problems:• try a larger set of features

• try a richer model class

For variance problems:• try getting more training examples

• try a smaller set of features

42er

ror

rate

model complexity

target error rate

training set

test set


Objective/Optimisation Problems

We may suspect that our poor performance is due to either a problem with our optimisation algorithm (e.g., not running for long enough) or a problem with our objective.

Unfortunately it is often verydifficult to determine whetheran iterative algorithm hasconverged.

43er

ror

rate

iteration count

converged?

Diagnosing Optimisation Problems

Suppose we care about maximising some accuracy measure, perf(), and a learning algorithm is trying to minimise a surrogate loss().

• Let * be the parameters returned by our learning algorithm

• Let † be any other parameters (e.g., guesses or obtained from a different learning algorithm)

44

perf(†) > perf(*) perf(†) < perf(*)

loss(*) < loss(†) wrongobjective

no problem (?)

loss(*) > loss(†) pooroptimisation

poor optimisation(got lucky)

what we care about (higher better)

wh

at w

e o

pti

mis

e(l

ow

er b

ette

r)

Fixes for Optimisation/Objective Problems

Diagnosing optimisation versus objective problems provides us with hints as to what to try next.

For optimisation problems:• try running for more iterations

• try using a different algorithm (e.g., Newton’s method instead of gradient descent)

• try random restarts (e.g., for non-convex objectives)

• try smoothing

For objective problems:• try different regularisation

• try weighting training examples

• try a different loss function

• change the model

45

Approximate Search Algorithms

Suppose we are using an approximate nearest neighbour algorithm to find similar objects. We define a similarity measure that our algorithm can use.

How can we tell if we have a problem with the nearest neighbour algorithm or our similarity measure?

• Let x† be a match found by the algorithm

• Let x* be a hand selected match (ground-truth)

• If similarity(x, x†) < similarity(x, x*) then the problem is with the measure

• Otherwise, initialise the approximate nearest neighbour algorithm with the true solution:• If the algorithm moves away from the true solution then the problem is with the measure• Otherwise the problem is with the nearest neighbour algorithm

46

Diagnostic Summary

• Diagnostics are an important tool when developing your machine learning algorithms• We showed examples for bias/variance, optimisation/objective, and search/score, but there

are many others

• Diagnostics can save a lot of wasted effort by guiding your choice of what to try next

• They also allow you to develop insights into your particular application and justify your design decisions

• Diagnostics often involve repeated experiments with different parameter settings while keeping everything else fixed

• Another important diagnostic tool is that of error analysis, i.e., understanding where your algorithm is making mistakes

47

Error Analysis

Error analysis tries to explain the difference between currentperformance and perfect performance.

• How much error is due to various different machine learning components in the application?Plug in the ground-truth (if available) into each component of the application and see how it affects accuracy. Alternatively, we could add noise to each component and, again, see how it affects accuracy.

• Does the algorithm fail on a particular subclass of examples?Visualise the data and results!

48

Ablative Analysis

Ablative analysis tries to explain the difference between some baselineperformance and the current performance.

Example: You’ve been working on your application for the past several months and now have a number of sophisticated features that you pass to a classifier. Which features account for the good performance of your classifier over a baseline classifier with some simple features?

Ablative analysis removes features from the application one at a time and sees which results in the biggest decrease in performance―similar to backward feature selection.

• Note that the order of removal matters.

49

Diagnosing Your Implementation

Whenever you write some code or assemble machine learning components into a pipeline you’ll want to test your implementation.

• run against small synthetic test cases

• see what happens with ground-truth features

• see what happens with random features

• check boundary cases

• re-use known working components

50

Diagnostics Take Home Messages

1. Visualize your data and learning progress2. Develop diagnostic tests3. Use good software development practices

• Revision control, revision control, revision control

“experimental confirmation of a prediction is merely a measurement; experimental disproving of a prediction is a discovery”

51

Documents

Debugging ML Algorithms - Amazon S3...•Machine learning algorithms are full of numerical calculations •Numerical errors can be due to: •limited precision arithmetic •we just