Upload
bernice-strickland
View
216
Download
0
Embed Size (px)
DESCRIPTION
Introduction Why Kernel Machines?? They overcome the “curse of dimensionality”, using kernel functions, while exploring large nonlinear feature spaces. What is the problem?? The tradeoff for this power is that a KM's query-time complexity scales linearly with the number of Support Vectors, making KMs often orders of magnitude more expensive at query-time than other popular machine learning alternatives. KM costs are identical for each query, even for “easy” ones that alternatives (e.g. decision trees) can classify much faster than harder ones.
Citation preview
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectorsby Dennis DeCoste and Dominic MazzoniInternational Conference on Machine Learning (ICML-03), August 2003
Presented by Despina KontosCIS 525 Neural Computation
Spring 2004Instructor: S.Vucetic
Overview Introduction
Motivation and the main idea. Background and related work
A little bit about Kernel Machines (KMs) and previous work. Methodology
The Nearest Support Vectors (NSVs). Some enhancements.
Experiments and results Discussion
Introduction
Why Kernel Machines?? They overcome the “curse of dimensionality”, using kernel functions, while
exploring large nonlinear feature spaces.
What is the problem?? The tradeoff for this power is that a KM's query-time complexity scales
linearly with the number of Support Vectors, making KMs often orders of
magnitude more expensive at query-time than other popular machine
learning alternatives.
KM costs are identical for each query, even for “easy” ones that
alternatives (e.g. decision trees) can classify much faster than harder ones.
Introduction So, what would be an ideal approach?
Use a simple linearclassifier for the (majority of) queries it is likely to correctly classify.
Implement the query-time cost exact KM only for those queries for which such precision likely matters.
For the rest of the cases, use something in between with complexity proportional to the difficulty of the query.
A new idea!! One can often achieve the same classification as the exact KM by
using only small fraction of the nearest support vectors (SVs) of a query.
Approximate the exact KM with a k nearest-neighbor (k-NN) KM, whose output sums only over the (weighted) kernel values involving the k nearest (according to some distance) selected SVs.
Background Kernel Machines Summary
Binary SVM classifier is trained by optimizing an n-by-1 weighting vector to satisfy the Quadratic Programming (QP) dual form:
The kernel avoids curse of dimensionality by projecting any two d-dimensional example vectors into feature space vectors returning their dot product:
Popular kernels include:
The exact KM output f(x) is computed via:
Some related work Early methods compressed a KM's SVs into a reduced set, in order to reduce
the query time costs.
When small ρ ≈ 0 can be achieved with nz«n speedups with little loss of classification accuracy have been reported.
Problem: A key problem with all such reduced set approaches is that they do not provide any guarantees or control concerning how much classification error might be introduced by such approximations.
Methodology
The intuition behind the NEW idea:
Order the SVs for each query using a distance metric and use the k nearest-
neighboring (w.r.t the query sample) SVs. The largest terms tend to get
added first.
During incremental computation of the KM, once the partial KM output leans
“strong enough” either positively or negatively, it will not be able to
completely change sign as remaining βi K(Xi,x) terms are added.
Small k nearest-neighbor classifiers can often classify well, but that the best
k will vary from query to query.
Methodology Nearest Support Vectors (NSV)Nearest Support Vectors (NSV)
Let NSV’s distance like scoring be defined as:
The βi K(Xi,x) terms corresponding to the NNscore-ordered SVs tend to follow a
steady progression, such that soon the remaining terms become too small to
overcome any strong leanings.
Methodology The main algorithm:
Methodology Statistical thresholds for NSV
Derive thresholds Lk and Hk by running the algorithm over a large representative
sample of pre-query data.
Compute Lk as the minimum value of gk(x) over all x such that gk(x) < 0 and f(x) >
0. This identifies Lk as the worst-case wrong-way leaning of any sample that the
exact KM classifies as positive. Similarly, Hk is assigned the maximum gk(x) such
that gk(x) > 0 and f(x) < 0.
In practice, the test and training data distributions will not be identical. We can
replace each Hk (Lk) with the maximum (minimum) of all threshold values over
adjacent steps k-w through k+w (variation using a window w).
Methodology Sorting NSVs by NNscorei(x) leads to relatively wide and skewed
thresholds whenever there is imbalance in the number of positive SVs
versus negative SVs. Adjusting the NNscore-based ordering so that the cumulative sums of the
positive β and the negative β at each step k are as equal as possible.
Full linear scan for searching the k-nearest neighbors can be very
computationally expensive even when using indexing techniques. Perform pre-query principal component analysis (PCA) on the matrix of
SVs. Use these small k-dimensional vectors, to approximate kernels and to
order NSVs for each Q as needed.
Methodology Some enhancements
Use a linear SVM as an initial filter. Compute the threshold bounds
as before, except using the linear SVM’s output for the first step of
the computation.
Generate additional “difficult” data in order to obtain better
threshold levels from the representative sample.
Experiments and resultsData: MNIST dataset (digit recognition)
large input dimensionality large number of SVs
Experiments and results
Speedup advantage compared to accuracy loss
Conclusions
A new Kernel Machine at query time implementing a k nearest
neighbor approach to improve performance.
The approach is applicable to any form of Kernel Machine classifier,
regardless of the way it is trained.
Some exciting speedup results are reported without significant loss in
accuracy.
Future work toward combining the machine learning methods of
kernels, nearest-neighbors and decision trees.
Any questions???
.....THANK YOU!!!!