Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Practical Advice for
Debugging ML AlgorithmsStephen Gould, Cheng Soon Ong, Mark Reid
November 2015
Answer: 0
import numpy as np
x = np.array([0, 1, 41, 255], dtype=‘uint8’)
x += 1
print(x)
What is 255 + 1?
3
Answer: 16,777,216
import numpy as np
x = np.array([16777216], dtype=‘float32’)
x += 1
print(x)
What is 16,777,216 + 1?
4
Implementation Issues
• Numerical calculations on a computer are always subject to errors
• Machine learning algorithms are full of numerical calculations
• Numerical errors can be due to:• limited precision arithmetic
• we just saw two examples
• algorithmic limitations (e.g., generating true random numbers)
• careless implementations• we will see some example soon
• bugs!!!
5
Numerical Robustness Example
Consider the simple problem of computing a vector norm,
Problem: numerical overflow or underflow
6
Numerical Robustness Example (2)
The standard deviation of a set of measurements can be calculated as
where
but this takes two passes through the data.
7
Numerical Robustness Example (3)
A “better approach” is to perform the following equivalent calculation
which only requires one pass through the data.
What can go wrong with this implementation?
8
Bugs in ML Algorithms are Hard to Find
As an example, let’s say we are trying to minimize the following scalar function using (damped) Newton’s method,
which updates iterates as
9
Bugs in ML Algorithms are Hard to Find
What happens if we introduce a small bug?
10
Bugs in ML Algorithms are Hard to Find
11
Feature Scaling
• Numerical algorithms work best on well-scaled data. A common pre-processing step is to scale the input to have zero mean and unit variance (sometimes called whitening), e.g.,
• Note. Estimating the scaling parameters must not use test set data.
• Feature scaling does not change the “strength” of a classifier but it does help with convergence during training.
12
Feature Scaling Example
• Dataset: Iris [Fisher, 1936],three classes, four features, 50 examples per class
• Feature vector: squared raw features plus bias term
• Classifier: multi-class logistic (a.k.a. soft-max)
13
Numerical Issues Take Home Message
Wherever possible, use tried-and-tested third-party implementations
(but remember that not all open source code is created equally)
14
Machine Learning Pipeline
15
Regression Example
A classic regression problem is to fit a curve (e.g., a polynomial) to a set of points
16Adapted from http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html
Dataset Partitioning
• training set: learn model parameters
• validation set: tune meta-parameters• (e.g., regularization strength, number of iterations, etc.)
• test/evaluation set: report performance• ideally used only once!
17
training set validation set test set
1(train)
2(train)
3(train)
4(train)
5(train)
6(test)
7(train)
8(train)
9(train)
10(train)
Cross-validation
Cross-validation is a common method used to estimate how well a model will generalise to unseen data.
• K-fold: Split the data into K sets of roughly equal size. For the k-thfold, train the model on K - 1 parts and test on the k-th part. We can now use all the data to estimate the prediction error.
• Leave-one-out (LOOCV): set K to the size of the dataset.
18
Dataset Bias
Every dataset is biased.
19
Don’t think dataset bias won’t happen to you
20
Sampling Strategies on Regression Example
21Adapted from http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html
• random sampling• Example: curve fitting• Example: classifying video frames
• unbalanced datasets• stratified sampling• data weighting (re-sampling with replacement)
Unbalanced Datasets (Random Sampling)
22
training set test set
dataset
Unbalanced Datasets (Stratified Sampling)
23
training set test set
dataset
Confusion Matrix for Classification Problems
• (i, j) entry: number of examples of class ithat were predicted as class j
• row sum: number of ground-truth examples of class i
• column sum: number of examples predicted as class j
• diagonal sum: number of correctly classified examples
• total sum: number of total examples
• Need not have equal number of rows and columns.
• Sometimes you will see the matrix transposed.
24
• Often we care about overall classification accuracy. This is an example of micro-averaging,
• However, something we have an unbalanced dataset but wish to treat each class equally. This is an example of macro-averaging,
• More generally, we may also want to compute weighted accuracy.
Accuracy: Macro vs. Micro Averaging
25
Precision and Recall
• Terminology• true positive (TP), hit, detection
• true negative (TN), correct rejection
• false positive (FP), false alarm, type I error
• false negative (FN), miss, type II error
26
Derived Statistic Equation
recall, true positive rate, sensitivity, hit rate TP / (TP + FN)
positive predictive power, precision TP / (TP + FP)
true negative rate, specificity TN / (TN + FP)
accuracy (TP + TN) / (TP + FP + TN + FN)
F1-score 2 (precision . recall) / (precision + recall)
Precision-Recall Curves
27
Precision-Recall Curve Operating Points
28
classification rule:Pr(y = 1 | x) > t
Comparing PR Curves
Below we plot two algorithms, which is better?
29
Other Ways to Compare Algorithms
30
Analysis Take Home Message
1. Measure everything• This will save analysis and debugging time later
2. Choice of metrics can have a huge effect on interpretation of results
3. Ask questions of your metrics
31
Repeatability―Controlling Randomness
• Often you will want to compare differentvariants of an algorithm
• However, comparing different runs can isdifficult if the algorithm is stochastic
• transform random algorithm A(x) into deterministic algorithm A’(x, r) where ris a sequence of random numbers
• one way to do this is to use random seeds (np.random.seed(0))
“random chance seems to have operated in our favour”
randomizedalgorithm
deterministicalgorithm
x yx
yr
32
Exploring Features and Meta-parameters
33
change regularisation
change label weights
collect more data
run for more iterations
change regularisation
change label weights
collect more data
add features
run for more iterations
baseline model
Feature Selection
Often the number of available features is very large but there are only a small number of relevant features. Some recent methods try to learn features directly from data. However, often we are faced with the task of having to come up with a good set of features manually.
• Filter Feature Selection: Use a computationally cheap heuristic to evaluate features, e.g., mutual information between a feature and the class labels.
• Wrapper Feature Selection: Incrementally add the best feature to a feature set (forward feature selection) or remove the worst features from the feature set (backward feature selection).
34
Example: Forward Feature Selection
• Start with an empty feature set, F = {}
• Repeatedly try each feature i F, createFi = F {i} and use cross-validation to evaluate Fi. Set F to the best Fi.
35
Diagnostics
• Diagnostics are about workingout why your algorithm is notgiving your the performanceyou want. What could theproblem be?• problem statement• data• features• algorithm/model• implementation• something else
• Take time to set up a good experimentalframework for repeated experiments
36
“give me six hours to cut down a tree and I will spend the first four sharpening my axe”
Visualise Your Data
37
Diagnostic Example
Suppose that our test error is unacceptably high and we suspect the problem is either that the model is overfitting or the features are not good enough.
Diagnostic:
• The first hypothesis (overfitting) suggests that the training error will be much lower than the test error
• The second hypothesis (features) suggests that the training error and test error will both be high
38
Learning Curves
39
erro
r ra
te
training set size
training set
test set
Learning Curves: Bias vs. Variance
40
erro
r ra
te
training set size
training set
test set
target error rate
erro
r ra
te
training set size
training set
test set
target error rate
high variance high bias
Bias/Variance Trade-off
41
erro
r ra
te
model complexity
target error rate
training set
test set
high variance high bias
Fixes for Bias/Variance Problems
Diagnosing bias and variance problems provides us with hints as to what to try next.
For bias problems:• try a larger set of features
• try a richer model class
For variance problems:• try getting more training examples
• try a smaller set of features
42er
ror
rate
model complexity
target error rate
training set
test set
high variance high bias
Objective/Optimisation Problems
We may suspect that our poor performance is due to either a problem with our optimisation algorithm (e.g., not running for long enough) or a problem with our objective.
Unfortunately it is often verydifficult to determine whetheran iterative algorithm hasconverged.
43er
ror
rate
iteration count
converged?
Diagnosing Optimisation Problems
Suppose we care about maximising some accuracy measure, perf(), and a learning algorithm is trying to minimise a surrogate loss().
• Let * be the parameters returned by our learning algorithm
• Let † be any other parameters (e.g., guesses or obtained from a different learning algorithm)
44
perf(†) > perf(*) perf(†) < perf(*)
loss(*) < loss(†) wrongobjective
no problem (?)
loss(*) > loss(†) pooroptimisation
poor optimisation(got lucky)
what we care about (higher better)
wh
at w
e o
pti
mis
e(l
ow
er b
ette
r)
Fixes for Optimisation/Objective Problems
Diagnosing optimisation versus objective problems provides us with hints as to what to try next.
For optimisation problems:• try running for more iterations
• try using a different algorithm (e.g., Newton’s method instead of gradient descent)
• try random restarts (e.g., for non-convex objectives)
• try smoothing
For objective problems:• try different regularisation
• try weighting training examples
• try a different loss function
• change the model
45
Approximate Search Algorithms
Suppose we are using an approximate nearest neighbour algorithm to find similar objects. We define a similarity measure that our algorithm can use.
How can we tell if we have a problem with the nearest neighbour algorithm or our similarity measure?
• Let x† be a match found by the algorithm
• Let x* be a hand selected match (ground-truth)
• If similarity(x, x†) < similarity(x, x*) then the problem is with the measure
• Otherwise, initialise the approximate nearest neighbour algorithm with the true solution:• If the algorithm moves away from the true solution then the problem is with the measure• Otherwise the problem is with the nearest neighbour algorithm
46
Diagnostic Summary
• Diagnostics are an important tool when developing your machine learning algorithms• We showed examples for bias/variance, optimisation/objective, and search/score, but there
are many others
• Diagnostics can save a lot of wasted effort by guiding your choice of what to try next
• They also allow you to develop insights into your particular application and justify your design decisions
• Diagnostics often involve repeated experiments with different parameter settings while keeping everything else fixed
• Another important diagnostic tool is that of error analysis, i.e., understanding where your algorithm is making mistakes
47
Error Analysis
Error analysis tries to explain the difference between currentperformance and perfect performance.
• How much error is due to various different machine learning components in the application?Plug in the ground-truth (if available) into each component of the application and see how it affects accuracy. Alternatively, we could add noise to each component and, again, see how it affects accuracy.
• Does the algorithm fail on a particular subclass of examples?Visualise the data and results!
48
Ablative Analysis
Ablative analysis tries to explain the difference between some baselineperformance and the current performance.
Example: You’ve been working on your application for the past several months and now have a number of sophisticated features that you pass to a classifier. Which features account for the good performance of your classifier over a baseline classifier with some simple features?
Ablative analysis removes features from the application one at a time and sees which results in the biggest decrease in performance―similar to backward feature selection.
• Note that the order of removal matters.
49
Diagnosing Your Implementation
Whenever you write some code or assemble machine learning components into a pipeline you’ll want to test your implementation.
• run against small synthetic test cases
• see what happens with ground-truth features
• see what happens with random features
• check boundary cases
• re-use known working components
50
Diagnostics Take Home Messages
1. Visualize your data and learning progress2. Develop diagnostic tests3. Use good software development practices
• Revision control, revision control, revision control
“experimental confirmation of a prediction is merely a measurement; experimental disproving of a prediction is a discovery”
51