14
Instructor’s notes Ch.2 - Supervised Learning Notation: Means pencil-and-paper QUIZ Means coding QUIZ VI. SVMs with (non-linear) kernels (pp.94-106) Remember the blobs dataset: Idea: Apply a transformation to the data that “pulls” the classes apart in a new dimension: Note: Use visualization code provided by instructor!

VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

  • Upload
    others

  • View
    3

  • Download
    1

Embed Size (px)

Citation preview

Page 1: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Notation: □Means pencil-and-paper QUIZ ► Means coding QUIZ

VI. SVMs with (non-linear) kernels (pp.94-106)

Remember the blobs dataset:

Idea: Apply a transformation to the data that “pulls” the classes apart in a new dimension:

Note: Use visualization code provided by instructor!

Page 2: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Now it is possible to separate the classes linearly, with a (hyper)plane:

Note: Use visualization code provided by instructor!

When projected onto the original plane of feature0 and feature1, the boundary is, of course, a parabola:

□What is the best feature to add in order to make the data linearly separable?

(a) x^2 + y^2 (b) x^2 (c) |x| (d) |y| (e) sqrt(x)

Page 3: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Answer:

□What function do you think was used for the Z dimension above?

(Not in the textbook) What exactly are the “kernel” and the “kernel trick”?

The kernel is a shortcut (“trick”) used to avoid calculating the new feature or features, as we did above. At thecore of finding the support vectors and the optimal separating hyperplane is solving a quadratic optimizationproblem. The equation of the separating hyperplane in a linear problem has the form:

1

1Martin Hofmann, Support Vector Machines — Kernels and the Kernel Trick, 2006.

Page 4: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

If we replace the dot products with a kernel function , the optimization problemcan be solved directly in terms of the coordinates xi. As far as we are concerned, this is done “under the hood”,in the classifier code.

Instead of LinearSVC, we now use the more general SVC:

C is the cost of misclassification, the same regularization hyper-parameter we had in previous classifiers.

□ Does large C mean less or more complexity? Less or more regularization?

Gamma needs some explaining:

RBF = Radial Basis Function, a.k.a. Gaussian kernel:

□ A large gamma means that the RBF has a long-range influence (it is wide) or short-range influence (it isnarrow)?

□ A large gamma means less or more complexity? Less or more regularization?

□ Read the documentation to find out what is the default value of gamma.

Page 5: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Remember the explanation of support vectors in part 2 of this chapter:

□Why is the boundary not going in between the support vectors?

A: Because of the regularization parameters C and gamma.

Page 6: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Conclusion: Both C and gamma control complexity in the same direction (largelarge), but through differentmechanisms.

Not in Ch.2 of text (see Ch.5): Grid-search to find best C-gamma combination2:

C = 100, gamma = 0.1

2This example was taken from http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_examples/svm/plot_svm_parameters_selection.html

Page 7: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Applying non-linear SVM to the cancer dataset

□What is your diagnostic?

A: Massive overfitting. Score is perfect on training data (the algorithm has memorized it!), but very poor ontesting data (generalization). This is the worst out-of-the-box classifier we’ve had so far, and by a large margin!(All the others were in the > 0.9 range.)

What is going on?

Page 8: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

The problem of scaleSVMs are very vulnerable to scale mismatch among features!

□ Do you remember which classification algorithm is at the opposite end of the spectrum, i.e. immune to scalemismatch?

A: DTs, because the decisions are made on one feature at a time, so the features do not “interact”.

We can use matplotlib’s boxplot to visualize the ranges of the features in the cancer dataset. Note that thevertical axis is logarithmic (actually “symlog” - keep reading!), so features can be as far as five orders ofmagnitude (a factor of 100,000) apart!

Let us explain the plot above: statistics digression (not in text)

Box and whiskers plots have a “box” between the upper and lower quartiles of the data. The line inside the boxshows the median.

3

3Image source: https://en.wikipedia.org/wiki/Box_plot

Page 9: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Definition: The median is the numerical value with the property that 50% of the data points are less and 50%greater than it.

If the nr. of points is odd, the median is the point in the middle.

If the nr. of points is even, the median is the mean (average) of the two middle points.

Do not confuse the median with the other two “m” descriptors, the mean (a.k.a. average) or the mode (themost frequent value in the dataset)! This figure shows two distributions of points, one in which the three “m”are close to each other (continuous lines), and one in which they are far apart (dashed lines):

4

□ Calculate the median, mean, and mode for this set of points: { 1, 2, 2, 3, 4, 7, 9 }

A:

Definition: The quartiles are the three numerical values that have 25%, 50%, and 75% of the data points belowthem.

The first or lower quartile has 25% below and 75% above.

The second quartile is the median.

The third or upper quartile has 75% below and 25% above.

4Image source: https://en.wikipedia.org/wiki/Median

Page 10: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Definition: The inter-quartile range IQR is Q3 - Q1.

□ Calculate the median, quartiles, and IQR for this set of points:

The standard whiskers extend to the lowest and highest point in the dataset.

The whiskers in the Pyplot boxplot have a different default behavior: They extend to 1.5*IQR, and the pointsbeyond them are considered outliers (a.k.a. fliers):

5

6

► Display the plot above:

with the outliers as red plus signs.

with the whiskers extending from the smallest to the largest datapoint.

means a y axis with a “symmetrical logarithmic” scale. This is like a logarithmicscale, but with two twists that are mad ein order to accomodate very small and even negative values:

There is a range around zero where the scale is linear instead of logarithmic. This avoids the “explosion”caused by the logarithm when very small numbers need to be represented.

If present, negative values are also represented (the cancer dataset does not have any).

► Display the plot above with logarithmic (‘log’) instead of ‘symlog’ scale on y.

End of statistics digression

5https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html6Ibid.

Page 11: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Conclusion: Some features in the cancer dataset are 5 orders of magnitude apart!

One way to rescale/normalize/preprocess the data is such that each feature is between 0 and 1:

► Display the box plot for the rescaled data. Hint: we don’t need a log axis anymore.

Of course, the test data has to be rescaled identically:

Is rescaling any use? Let’s find out:

Page 12: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Not in text:

7

We see that the nr. of samples (datapoints) is the limiting factor in the SVM algorithm.

The size of the kernel cache has a strong impact on run times for larger problems. If you have enough RAMavailable, it is recommended to set cache_size to a higher value than the default of 200(MB), such as 500(MB)or 1000(MB).8

Conclusions on non-linear SVMs:

They work well with low numbers of features, if complexity is increased (within reason).

They can deal with large numbers of features, but not with large numbers of data points (stop in the10,000 range).

Sensitive to scale mismatch, so it is highly recommended to rescale our data. For example, rescale eachattribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Thesame scaling must be applied to the test vector to obtain meaningful results.

They are hard to understand and interpret.

Regularization hyper-parameters C (for all kernels) and gamma (only for the rbf kernel) control thecomplexity in the same direction, and they should be adjusted together, through grid-search.

7 https://scikit-learn.org/stable/modules/svm.html#complexity8https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use

Page 13: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

Solutions:□ Read the documentation to find out what is the default value of gamma.

□ Calculate the median, quartiles, and IQR for this set of points:

► Display the plot above:

with the outliers as red plus signs.

with the whiskers extending from the smallest to the largest data point.

Page 14: VI. SVMs with (non-linear) kernels - Tarleton State University · 2019. 12. 16. · Applying non-linear SVM to the cancer dataset What is your diagnostic? A: Massive overfitting

Instructor’s notes Ch.2 - Supervised Learning

► Display the box plot for the rescaled data.

► Display the plot above with logarithmic (‘log’) instead of ‘symlog’ scale on y.