Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectorsby Dennis DeCoste and Dominic MazzoniInternational Conference on Machine Learning (ICML-03), August 2003

Presented by Despina KontosCIS 525 Neural Computation

Spring 2004Instructor: S.Vucetic

Overview Introduction

Motivation and the main idea. Background and related work

A little bit about Kernel Machines (KMs) and previous work. Methodology

The Nearest Support Vectors (NSVs). Some enhancements.

Experiments and results Discussion

Introduction

Why Kernel Machines?? They overcome the “curse of dimensionality”, using kernel functions, while

exploring large nonlinear feature spaces.

What is the problem?? The tradeoff for this power is that a KM's query-time complexity scales

linearly with the number of Support Vectors, making KMs often orders of

magnitude more expensive at query-time than other popular machine

learning alternatives.

KM costs are identical for each query, even for “easy” ones that

alternatives (e.g. decision trees) can classify much faster than harder ones.

Introduction So, what would be an ideal approach?

Use a simple linearclassifier for the (majority of) queries it is likely to correctly classify.

Implement the query-time cost exact KM only for those queries for which such precision likely matters.

For the rest of the cases, use something in between with complexity proportional to the difficulty of the query.

A new idea!! One can often achieve the same classification as the exact KM by

using only small fraction of the nearest support vectors (SVs) of a query.

Approximate the exact KM with a k nearest-neighbor (k-NN) KM, whose output sums only over the (weighted) kernel values involving the k nearest (according to some distance) selected SVs.

Background Kernel Machines Summary

Binary SVM classifier is trained by optimizing an n-by-1 weighting vector to satisfy the Quadratic Programming (QP) dual form:

The kernel avoids curse of dimensionality by projecting any two d-dimensional example vectors into feature space vectors returning their dot product:

Popular kernels include:

The exact KM output f(x) is computed via:

Some related work Early methods compressed a KM's SVs into a reduced set, in order to reduce

the query time costs.

When small ρ ≈ 0 can be achieved with nz«n speedups with little loss of classification accuracy have been reported.

Problem: A key problem with all such reduced set approaches is that they do not provide any guarantees or control concerning how much classification error might be introduced by such approximations.

Methodology

The intuition behind the NEW idea:

Order the SVs for each query using a distance metric and use the k nearest-

neighboring (w.r.t the query sample) SVs. The largest terms tend to get

added first.

During incremental computation of the KM, once the partial KM output leans

“strong enough” either positively or negatively, it will not be able to

completely change sign as remaining βi K(Xi,x) terms are added.

Small k nearest-neighbor classifiers can often classify well, but that the best

k will vary from query to query.

Methodology Nearest Support Vectors (NSV)Nearest Support Vectors (NSV)

Let NSV’s distance like scoring be defined as:

The βi K(Xi,x) terms corresponding to the NNscore-ordered SVs tend to follow a

steady progression, such that soon the remaining terms become too small to

overcome any strong leanings.

Methodology The main algorithm:

Methodology Statistical thresholds for NSV

Derive thresholds Lk and Hk by running the algorithm over a large representative

sample of pre-query data.

Compute Lk as the minimum value of gk(x) over all x such that gk(x) < 0 and f(x) >

0. This identifies Lk as the worst-case wrong-way leaning of any sample that the

exact KM classifies as positive. Similarly, Hk is assigned the maximum gk(x) such

that gk(x) > 0 and f(x) < 0.

In practice, the test and training data distributions will not be identical. We can

replace each Hk (Lk) with the maximum (minimum) of all threshold values over

adjacent steps k-w through k+w (variation using a window w).

Methodology Sorting NSVs by NNscorei(x) leads to relatively wide and skewed

thresholds whenever there is imbalance in the number of positive SVs

versus negative SVs. Adjusting the NNscore-based ordering so that the cumulative sums of the

positive β and the negative β at each step k are as equal as possible.

Full linear scan for searching the k-nearest neighbors can be very

computationally expensive even when using indexing techniques. Perform pre-query principal component analysis (PCA) on the matrix of

SVs. Use these small k-dimensional vectors, to approximate kernels and to

order NSVs for each Q as needed.

Methodology Some enhancements

Use a linear SVM as an initial filter. Compute the threshold bounds

as before, except using the linear SVM’s output for the first step of

the computation.

Generate additional “difficult” data in order to obtain better

threshold levels from the representative sample.

Experiments and resultsData: MNIST dataset (digit recognition)

large input dimensionality large number of SVs

Experiments and results

Speedup advantage compared to accuracy loss

Conclusions

A new Kernel Machine at query time implementing a k nearest

neighbor approach to improve performance.

The approach is applicable to any form of Kernel Machine classifier,

regardless of the way it is trained.

Some exciting speedup results are reported without significant loss in

accuracy.

Future work toward combining the machine learning methods of

kernels, nearest-neighbors and decision trees.

Any questions???

.....THANK YOU!!!!

Documents

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International