Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L....

Support Vector Machinesfor Data Fitting and Classification

David R. Musicant

with Olvi L. Mangasarian

UW-Madison Data Mining InstituteAnnual ReviewJune 2, 2000

Overview Regression and its role in data mining Robust support vector regression

– Our general formulation

Tolerant support vector regression– Our contributions– Massive support vector regression– Integration with data mining tools

Active support vector machines Other research and future

directions

What is regression? Regression forms a rule for predicting an

unknown numerical feature from known ones. Example: Predicting purchase habits. Can we use...

– age, income, level of education

To predict...– purchasing patterns?

And simultaneously...– avoid the “pitfalls” that standard statistical regression

falls into?

Regression example Can we use.

Age Income Years of Education $ spent on software30 $56,000 / yr 16 $80050 $60,000 / yr 12 $016 $2,000 / yr 11 $200

Age Income Years of Education $ spent on software40 $48,000 / yr 17 ?29 $60,000 / yr 18 ?

To predict:

Role in data mining Goal: Find new relationships in data

– e.g. customer behavior, scientific experimentation

Regression explores importance of each known feature in predicting the unknown one.– Feature selection

Regression is a form of supervised learning– Use data where the predictive value is known for

given instances, to form a rule

Massive datasets

Regression is a fundamental task in data mining.

Part I:Robust Regression

a.k.a. Huber Regression

-2 -1 0 1 2

“Standard” Linear Regression

yê= Aw+be

Aw+beù yFind w, b such that:

Optimization problem Find w, b such that:

Aw+beù y Bound the error by s:

à s ô Aw+beà yô s Minimize the error:

à s ô Aw+beà yô smin

Traditional approach: minimize squared error.

Examining the loss function Standard regression uses a squared error loss

function.– Points which are far from the predicted line (outliers)

are overemphasized.

-2 -1 0 1 2

Alternative loss function Instead of squared error, try absolute value of

the error:

-2 -1 0 1 2

This is called the 1-norm loss function.

1-Norm Problems And Solution– Overemphasizes error on points close to the

predicted line

Solution: Huber loss function hybrid approach

Quadratic

Linear

Many practitioners prefer the Huber loss function.

-2 -1 0 1 2

Mathematical Formulation indicates switchover from quadratic to linear

-2 -1 0 1 2

ú(t) = t2=2; if jtj ô íí jtj à í 2=2; if jtj > í

Larger means “more quadratic.”

Regression Approach Summary Quadratic Loss Function

– Standard method in statistics– Over-emphasizes outliers

Linear Loss Function (1-norm)– Formulates well as a linear

program– Over-emphasizes small errors

Huber Loss Function (hybrid approach)– Appropriate emphasis on large

and small errors

-2 -1 0 1 2

Previous attempts complicated Earlier efforts to solve Huber regression:

– Huber: Gauss-Seidel method– Madsen/Nielsen: Newton Method– Li: Conjugate Gradient Method– Smola: Dual Quadratic Program

Our new approach: convex quadratic program

zà t ô Aw+beà yô z+t

minw2R d; z2R l; t2R l

Pz2i=2+ í

Our new approach is simpler and faster.

Experimental Results: Census20k

0 200 400 600

Madsen/Nielsen

Time (CPU sec)

Faster!

20,000 points11 features

Experimental Results: CPUSmall

0 50 100 150 200

Madsen/Nielsen

Time (CPU sec)

Faster!

8,192 points12 features

Introduce nonlinear kernel! Begin with previous formulation:

Substitute w = A’ and minimize instead:

zà t ô AA0ë +beà yô z+t

Substitute K(A,A’) for AA’:

zà t ô K (A;A0)ë +beà yô z+t

A kernel is nonlinear function.

zà t ô Aw+beà yô z+t

minw2R d; z2R l; t2R l

Pz2i=2+ í

Nonlinear results

Dataset Kernel Training Accuracy Testing Accuracy

CPUSmall Linear 94.50% 94.06%Gaussian 97.26% 95.90%

Boston Linear 85.60% 83.81%Housing Gaussian 92.36% 88.15%

Nonlinear kernels improve accuracy.

Part II:Tolerant Regression

a.k.a. Tolerant Training

Regression Approach Summary Quadratic Loss Function

– Standard method in statistics– Over-emphasizes outliers

Linear Loss Function (1-norm)– Formulates well as a linear

program– Over-emphasizes small errors

Huber Loss Function (hybrid approach)– Appropriate emphasis on large

and small errors

-2 -1 0 1 2

Optimization problem (1-norm) Find w, b such that:

Aw+beù y Bound the error by s:

à s ô Aw+beà yô s Minimize the error:

Minimize the magnitude of the error.

The overfitting issue

Noisy training data can be fitted “too well”– leads to poor generalization on future data

Prefer simpler regressions, i.e. where– some w coefficients are

zero– line is “flatter”

yê= Aw+be

Reducing overfitting To achieve both goals

– minimize magnitude of w vector

Pjwij +C

C is a parameter to balance the two goals– Chosen by experimentation

Reduces overfitting due to points far from surface

Overfitting again: “close” points

“Close points” may be wrong due to noise only– Line should be influenced by “real” data, not noise

Ignore errors from those points which are close!yê= Aw+be

Tolerant regression

Pjwij +C

Allow an interval of size with uniform error

e" ô s How large should be?

– Large as possible, while preserving accuracy

Pjwij +C

Psi à Cö"

e" ô s

How about a nonlinear surface?

Introduce nonlinear kernel! Begin with previous formulation:

Pjwij +C

Psi à Cö"

e" ô s Substitute w = A’ and minimize instead:

à s ô AA0ë +beà yô s

Substitute K(A,A’) for AA’:

à s ô K (A;A0)ë +beà yô s

A kernel is nonlinear function.

Our improvements This formulation and interpretation is new!

– Improves intuition from prior results– Uses less variables– Solves faster!

Computational tests run on DMI Locop2– Dell PowerEdge 6300 server with– Four gigabytes of memory, 36 gigabytes of disk

space– Windows NT Server 4.0– CPLEX 6.5 solver

Donated to UW by Microsoft Corporation

Comparison Results

Dataset 0 0.1 0.2 ... 0.7 Total Time

Time (sec) Improvement

Census Tuning set error 5.10% 4.74% Max 0.00 0.02 79.7%

SSR time (sec) 980 935 5086 Avg

MM time (sec) 199 294 3765 26.0%

Comp- Tuning set error 6.60% 6.32% Max

Activ 0.00 3.09 65.7%

SSR time (sec) 1364 1286 7604 Avg

MM time (sec) 468 660 6533 14.1%

Boston Tuning set error 14.69% 14.62% Max

Housing 0.00 0.42 52.0%

SSR time (sec) 36 34 170 Avg

MM time (sec) 17 23 140 17.6%

Problem size concerns How does the problem scale?

– m = number of points– n = number of features

For linear kernel: problem size is O(mn)

à s ô Aw+beà yô s For nonlinear kernel: problem size is O(m2)

à s ô K (A;A0)ë +beà yô s

Thousands of data points ==> massive problem!

Need an algorithm that will scale well.

Chunking approach Idea: Use a chunking method

– Bring as much into memory as possible– Solve this subset of the problem– Retain solution and integrate into next subset

Explored in depth by Paul Bradley and O.L. Mangasarian for linear kernels

Solve in pieces, one chunk at a time.

Row-Column Chunking

Why column chunking also?– If non-linear kernel is used, chunks are very

wide.– A wide chunk must have a small number of

rows to fit in memory.

Both these chunks use the same memory!

Chunking Experimental Results

Dataset: 16,000 point subset of Census in R 11+ noiseKernel: Gaussian Radial Basis KernelLP Size: 32,000 nonsparse rows and columnsProblem Size: 1.024 billion nonzero valuesTime to termination: 18.8 daysNumber of SVs: 1621 support vectorsSolution variables: 33 nonzero componentsFinal tuning set error: 9.8%Tuning set error on first chunk (1000 points)

Objective Value & Tuning Set Errorfor Billion-Element MatrixObjective Value

150013

200018

Row-Column Chunk Iteration NumberTime in Days

Objective Value

Tuning Set Error

150013

200018

Row-Column Chunk Iteration NumberTime in Days

Tuning Set Error

Given enough time, we find the right answer!

Integration into data mining tools Method runs as a stand-alone application, with

data resident on disk With minimal effort, could sit on top of a RDBMS

to manage data input/output– Queries select a subset of data - easily SQLable

Database queries occur “infrequently”– Data mining can be performed on a different machine

from the one maintaining the DBMS

Licensing of a linear program solver necessary

Algorithm can integrate with data mining tools.

Part III:Active Support Vector Machines

a.k.a. ASVM

The Classification Problem

x0w= í +1

x0w= í à 1x0w= íSeparating Surface:

Find surface to best separate two classes.

Active Support Vector Machine Features

– Solves classification problems– No special software tools necessary! No LP or QP!– FAST. Works on very large problems.– Web page: www.cs.wisc.edu/~musicant/asvm

• Available for download and can be integrated into data mining tools

• MATLAB integration already provided

# of points Features Iterations Time (CPU min)4 million 32 5 38.047 million 32 5 95.57

Summary and Future Work Summary

– Robust regression can be modeled simply and efficiently as a quadratic program

– Tolerant regression can be used to solve massive regression problems

– ASVM can solve massive classification problems quickly

Future work– Parallel approaches– Distributed approaches– ASVM for various types of regression

Questions?

Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L....

Documents

Ouderbetrokkenheid ‘good practices’. OLVI Gent BSO-school: Haarzorg Personenzorg -> logistiek assistent in zorginstellingen Kantoor -> office

Managing Undergraduate Research – Lessons Learned Dave Musicant Carleton College Thursday, March 8, 2007

MIGSYS Ian Evans, Genetics Max Musicant, SOM Grant Patterson, EPH Jia Kang, Med Informatics Colin Shaw, Nursing YBPS Case Competition November 20, 2009

Is Life Worth Living Without Immortality? - M. M. MANGASARIAN

Multiple Instance Learning via Successive Linear Programming Olvi Mangasarian Edward Wild University of Wisconsin-Madison

Nonlinear Programming - Olvi L. Mangasarian

Knowledge-Based Support Vector Machine Classifierspapers.nips.cc/paper/2222-knowledge-based-support...Knowledge-Based Support Vector Machine Classifiers Glenn M. Fung, Olvi L. Mangasarian

Makalah Pengantar Metabolisme (Olvi Merdeka Putri TK1 Kebidanan STIKes Piala Sakti)

Winter Downtown Activation Strategies · The ADRR Max Musicant The American Downtown Revitalization Review / Fall 2020 / Volume 1 / Number 2 1 Winter Downtown Activation Strategies

Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier

in de Vrije Basisschool ‘OLVI De Reuzenboom’ in Boom

Privacy-Preserving Linear Programming Olvi Mangasarian UW Madison & UCSD La Jolla UCSD – Center for Computational Mathematics Seminar January 11, 2011

20150226 jongeren & nieuwe media 2014 - olvi boom

OLVI OYJ Y-tunnus 0170318-9...solla yhteenlasketun myyntivolyymin kasva-essa 1,5 prosenttia. Myyntivolyymi, miljoonaa litraa 1-12/ 2017 1-12/ 2016 Suomi (Olvi Oyj) 199,7 178,0 Viro

Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop

Support Vector Machines in Data Mining AFOSR Software & Systems Annual Meeting Syracuse, NY June 3-7, 2002 Olvi L. Mangasarian Data Mining Institute University

Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data

OLVI De Reuzenboom - Verbetersleutel wiskunde, taal en ......2020/04/02 · Verbetersleutel – wiskunde, taal en spelling – Week 2Spelling

The Martyrdom of Hypatia By Mangasar Mugurditch Mangasarian