Nonlinear Knowledge in Kernel Machines
Olvi MangasarianUW Madison & UCSD La Jolla
Edward WildUW Madison
Data Mining and Mathematical Programming WorkshopCentre de Recherches Mathématiques
Université de Montréal, Québec October 10-13, 2006
ObjectivesPrimary objective: Incorporate prior knowledge
over completely arbitrary sets into: function approximation, and classification without transforming (kernelizing) the knowledge
Secondary objective: Achieve transparency of the prior knowledge for practical applications
Graphical Example of Knowledge Incorporation
+
++
++
+++
g(x) · 0
h1(x) · 0
h2(x) · 0Similar approach for approximation
K(x0, B0)u =
Outline
Kernels in classification and function approximation Incorporation of prior knowledge
Previous approaches: require transformation of knowledge New approach does not require any transformation of knowledge
• Knowledge given over completely arbitrary sets
Fundamental tool used Theorem of the alternative for convex functions
Experimental results Four synthetic examples and two examples related to breast cancer prognosis Classifiers and approximations with prior knowledge more accurate than those
without prior knowledge
Farkas: For A2 Rk£ n, b2 Rn, exactly one of the following must hold:I. Ax 0, b0x > 0 has solution x 2 Rn, ,or II. A0 v =b, v 0, has solution v 2 Rk
Let h:½ Rn! Rk, : ! R, convex set, h and convex on andh(x)<0 for some x2 . Then, exactly one of the following must hold:I. h(x) 0, (x) < 0 has a solution x , orII. v Rk, v 0: v0h(x)+(x) 0 x
To get II of Farkas:v 0, v0Ax- b0x 0 8 x 2 Rn , v 0, A0 v -b=0.
Theorem of the Alternative for Convex Functions:Generalization of the Farkas Theorem
Classification and Function Approximation
Given a set of m points in n-dimensional real space Rn with corresponding labels
Labels in {+1, -1} for classification problems Labels in R for approximation problems
Points are represented by rows of a matrix A 2 Rm£n
Corresponding labels or function values are given by a vector y
Classification: y 2 {+1, -1}m
Approximation: y Rm
Find a function f(Ai) = yi based on the given data points Ai
f : Rn ! {+1, -1} for classification f : Rn! R for approximation
Classification and Function Approximation
Problem: utilizing only given data may result in a poor classifier or approximationPoints may be noisySampling may be costly
Solution: use prior knowledge to improve the classifier or approximation
Adding Prior Knowledge
Standard approximation and classification: fit function at given data points without knowledge
Constrained approximation: satisfy inequalities at given points
Previous approaches (2001 FMS, 2003 FMS, and 2004 MSW ): satisfy linear inequalities over polyhedral regions
Proposed new approach: satisfy nonlinear inequalities over arbitrary regions without kernelizing (transforming) knowledge
Kernel Machines
Approximate f by a nonlinear kernel function K using parameters u 2 Rk and in R
A kernel function is a nonlinear generalization of the scalar product
f(x) K(x0, B0)u - , x 2 Rn, K:Rn £ Rn£k ! Rk
Gaussian K(x0, B0)i=-||x-Bi||2, i=1,…..,k
B 2 Rk£n is a basis matrixUsually, B = A2Rm£n = Input data matrix In Reduced Support Vector Machines, B is a small
subset of the rows of AB may be any matrix with n columns
Kernel Machines
Introduce slack variable s to measure error in classification or approximation
Error s in kernel approximation of given data: -s K(A, B0)u - e - y s, e is a vector of ones in Rm
Function approximation: f(x) x0, B0)u - Error s in kernel classification of given data
K(A+, B0)u - e + s+ ¸ e, s+ ¸ 0K(A- , B0)u - e - s- - e, s- ¸ 0
More succinctly, let: D = diag(y), the m£m matrix with diagonal y of § 1’s, then:D(K(A, B0)u - e) + s ¸ e, s ¸ 0Classifier: f(x) sign(x0, B0)u -
Trade off between solution complexity (||u||1) and data fitting (||s||1)At solution
e0a = ||u||1 e0s = ||s||1
Kernel Machines in Approximation OR Classification
OR
Kernel Machines in Approximation
Cx d w0x- h0x + Need to “kernelize” knowledge from input space to
transformed (feature) space of kernel Requires change of variable x = A0t, w = A0uCA0t d u0AA0t - h0A0t + K(C, A0)t d u0K(A, A0)t - h0A0t + Use a linear theorem of the alternative in the t spaceLost connection with original knowledgeAchieves good numerical results, but is not readily
interpretable in the original space
Incorporating Nonlinear Prior Knowledge: Previous Approaches
Nonlinear Prior Knowledge in Function Approximation: New Approach
Start with arbitrary nonlinear knowledge implication
g(x) 0 K(x0, B0)u - h(x), 8x 2 ½ Rn
g, h are arbitrary functions on g:! Rk, h:! RLinear in v, u, 9v ¸ 0: v0g(x) + K(x0, B0)u - - h(x) ¸ 0 8x 2
Assume that g(x), K(x0, B0)u - , -h(x) are convex functions of x, that is convex and 9 x 2 : g(x)<0. Then either: I. g(x) 0, K(x0, B0)u - - h(x) < 0 has a solution x , or II. v Rk, v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x But never both.
If we can find v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x , then by above theorem
g(x) 0, K(x0, B0)u - - h(x) < 0 has no solution x or equivalently:
g(x) 0 K(x0, B0)u - h(x), 8x 2
Theorem of the Alternative for Convex Functions
Proof I II (Not needed for present application)
Follows from a fundamental theorem of Fan-Glicksburg-Hoffman for convex functions [1957] and the existenceof an x 2 such that g(x) <0.
I II (Requires no assumptions on g, h, K, or uwhatsoever)Suppose not. That is, there exists x 2 v 2 Rk,, v¸ 0:g(x) 0, K(x0, B0)u - - h(x) < 0, (I) v 0, v0g(x) +K(x0, B0)u - - h(x) 0 , 8 x 2 Then we have the contradiction:
0 > v0g(x) +K(x0, B0)u - - h(x) 0 · 0 < 0
Linear semi-infinite program: infinite number of constraints
Discretize: finite linear programg(xi) · 0 ) K(xi0, B0)u - ¸ h(xi), i = 1, …, kSlacks allow knowledge to be satisfied inexactlyAdd term to objective function to drive slacks to zero
Incorporating Prior Knowledge
Numerical Experience: Approximation
Evaluate on three datasetsTwo synthetic datasetsWisconsin Prognostic Breast Cancer Database (WPBC)
194 patients £ 2 histogical features • tumor size & number of metastasized lymph nodes
Compare approximation with prior knowledge to approximation without prior knowledgePrior knowledge leads to an improved accuracyGeneral prior knowledge used cannot be handled
exactly by previous work (MSW 2004)No kernelization needed on knowledge set
Two-Dimensional Hyperboloid
f(x1, x2) = x1x2
Two-Dimensional Hyperboloid
Given exact values only at 11 points along line x1 = x2
At x1 2 {-5, …, 5}
x1
x2
Two-Dimensional Hyperboloid Approximation without Prior Knowledge
Two-Dimensional Hyperboloid
Add prior (inexact) knowledge: x1x2 1 f(x1, x2) x1x2
Nonlinear term x1x2 can not be handled exactly by any previous approaches
Discretization used only 11 points along the line x1 = -x2, x1 {-5, -4, …, 4, 5}
Two-Dimensional Hyperboloid Approximation with Prior Knowledge
Two-Dimensional Tower Function
Two-Dimensional Tower Function (Misleading) Data
Given 400 points on the grid [-4, 4] [-4, 4]Values are min{g(x), 2}, where g(x) is the
exact tower function
Two-Dimensional Tower Function Approximation without Prior Knowledge
Two-Dimensional Tower FunctionPrior Knowledge
Add prior knowledge:(x1, x2) [-4, 4] [-4, 4] f(x) = g(x)
Prior knowledge is the exact function value.Enforced at 2500 points on the grid
[-4, 4] [-4, 4] through above implicationPrincipal objective of prior knowledge here
is to overcome poor given data
Two-Dimensional Tower Function Approximation with Prior Knowledge
Breast Cancer Application:Predicting Lymph Node Metastasis as a Function of
Tumor SizeNumber of metastasized lymph nodes is an important
prognostic indicator for breast cancer recurrence Determined by surgery in addition to the removal of the
tumorOptional procedure especially if tumor size is small
Wisconsin Prognostic Breast Cancer (WPBC) dataLymph node metastasis and tumor size for 194 patients
Task: predict the number of metastasized lymph nodes given tumor size alone
Predicting Lymph Node Metastasis
Split data into two portionsPast data: 20% used to find prior knowledgePresent data: 80% used to evaluate performance
Past data simulates prior knowledge obtained from an expert
Prior Knowledge for Lymph Node Metastasis as a Function of Tumor Size
Generate prior knowledge by fitting past data: h(x) := K(x0, B0)u - B is the matrix of the past data points
Use density estimation to decide where to enforce knowledgep(x) is the empirical density of the past data
Prior knowledge utilized on approximating function f(x):Number of metastasized lymph nodes is greater than the
predicted value on past data, with tolerance of 0.01p(x) 0.1 f(x) ¸ h(x) - 0.01
Predicting Lymph Node Metastasis: Results
RMSE: root-mean-squared-error LOO: leave-one-out error Improvement due to knowledge: 14.9%
Approximation ErrorPrior knowledge h(x) based
on past data 20%6.12 RMSE
f(x) without knowledge based on present data 80%
5.92 LOO
f(x) with knowledge based on present data 80%
5.04 LOO
Incorporating Prior Knowledge in Classification (Very Similar)
Implication for positive regiong(x) 0 K(x0, B0)u - , 8x 2 ½ Rn
K(x0, B0)u - - + v0g(x) ¸ 0, v ¸ 0, 8x 2 Similar implication for negative regionsAdd discretized constraints to linear program
Incorporating Prior Knowledege: Classification
Numerical Experience: Classification
Evaluate on three datasetsTwo synthetic datasetsWisconsin Prognostic Breast Cancer Database
Compare classifier with prior knowledge to one without prior knowledgePrior knowledge leads to an improved accuracyGeneral prior knowledge used cannot be handled
exactly by previous work (FMS 2001, FMS 2003)No kernelization needed on knowledge set
Checkerboard DatasetBlack & White Points in R2
Checkerboard Classifier Without Knowledge Using 16 Center Points
Prior Knowledge for 16-Point Checkerboard Classifier
Checkerboard Classifier With Knowledge Using 16 Center Points
100100
Spiral Dataset 194 Points in R2
No misclassified points Labels given only at 100 correctly classified circled points
Spiral Classifier With Knowledge
Spiral Classifier Without KnowledgeBased on 100 Labeled PointsPrior Knowledge Function for Spiral
White ) +Gray ) •
Note the many incorrectly
classified +’s
Prior knowledge imposed at 291 points in each region
Predicting Breast Cancer Recurrence Within 24 Months
Wisconsin Prognostic Breast Cancer (WPBC) dataset 155 patients monitored for recurrence within 24 months 30 cytological features 2 histological features: number of metastasized lymph nodes and tumor size
Predict whether or not a patient remains cancer free after 24 months 82% of patients remain disease free 86% accuracy (Bennett, 1992) best previously attained Prior knowledge allows us to incorporate additional information to
improve accuracy
Generating WPBC Prior KnowledgeGray regions indicate areas where g(x) · 0
Simulate oncological surgeon’s advice about recurrence
Tumor Size in CentimetersNum
ber o
f Met
asta
size
d Ly
mph
Nod
es
Knowledge imposed at dataset points inside given regions
+ Recur within 24 months
• Cancer free within 24 months+ Recur• Cancer free
WPBC Results
49.7 % improvement due to knowledge35.7 % improvement over best previous predictor
Classifier Misclassification Rate
Without Knowledge 18.1% .
With Knowledge 9.0% .
ConclusionGeneral nonlinear prior knowledge incorporated into kernel
classification and approximation Implemented as linear inequalities in a linear programming
problemKnowledge appears transparently
Demonstrated effectivenessFour synthetic examplesTwo real world problems from breast cancer prognosis
Future workPrior knowledge with more general implicationsUser-friendly interface for knowledge specificationGenerate prior knowledge for real-world datasets
Website Links to Talk & Papers
http://www.cs.wisc.edu/~olvihttp://www.cs.wisc.edu/~wildt