CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Data Mining with Statistical Learning

Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

OutlineOutline

I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?

II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data

II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools

Concluding remarksConcluding remarks

CBCL MIT

March 8, 2000

INSEAD

Part IPart I





CBCL MIT

March 8, 2000

INSEAD

What is Data Mining?What is Data Mining?

Goal:Goal: To classifyTo classify or find trends in data in order to or find trends in data in order to

improve future decisions improve future decisions

Examples: Examples: - financial data modeling - forecasting - financial data modeling - forecasting

- customer profiling- customer profiling

- fraud detection - fraud detection

CBCL MIT

March 8, 2000

INSEAD

Example: Fraud DetectionExample: Fraud Detection

Age: 24, Occ.: student, Spend: $100, Buy: … Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: …Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: …Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Age: 53, Occ.: small b., Spend: $1300, Buy: … Spend: $1300, Buy: …

OK OK OKOK FRAUD FRAUD

OKOK

………………..

………………..

……..

……..

Age:.. Occ:..Age:.. Occ:..

FRAUD?FRAUD?

OK?OK?

data miningdata mining

fraud fraud systemsystem

??

CBCL MIT

March 8, 2000

INSEAD

Example: Customer ProfilingExample: Customer Profiling

Age: 24, Occ.: student, Spend: $100, Buy: … Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: …Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: …Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Age: 53, Occ.: small b., Spend: $1300, Buy: … Spend: $1300, Buy: …

NO NO NONO BUY BUY

NONO

………………..

………………..

……..

……..

Age:.. Occ:..Age:.. Occ:..

BUY?BUY?

NO?NO?


profiling profiling systemsystem

??

CBCL MIT

March 8, 2000

INSEAD

Data Mining: More ExamplesData Mining: More Examples

• Sales analysis for inventory control Sales analysis for inventory control

• Diagnostics (manufacturing, health, …) Diagnostics (manufacturing, health, …)

• Information filtering/retrieval (e.g. emails, multimedia)Information filtering/retrieval (e.g. emails, multimedia)

• E-Customer Relationship Management : E-Customer Relationship Management :

E-customer profiling (personalization, marketing…)E-customer profiling (personalization, marketing…)

E-customer support E-customer support

CBCL MIT

March 8, 2000

INSEAD

Market Interest Market Interest

• Only 30% of Fortune 500 Only 30% of Fortune 500 using email respond to it using email respond to it on time on time (IDC)(IDC)

Email filtering/response software: Email filtering/response software: $20M now, $350M in 2003 (IDC)$20M now, $350M in 2003 (IDC) Kana, eGain, aptex… Kana, eGain, aptex…: ~$10b: ~$10b

Personalization, targeted marketing, Personalization, targeted marketing, collaborative filtering … (privacy?) collaborative filtering … (privacy?) engage, netperceptions… engage, netperceptions…:: ~$10b ~$10b

• 20% of e-companies use 20% of e-companies use internet customer info, internet customer info, 70% by 2001 70% by 2001 (Forrester R.)(Forrester R.)

• US 1999: $12b credit card US 1999: $12b credit card fraud, 50% on internet fraud, 50% on internet (IDC)(IDC) (insurance, telecom…)(insurance, telecom…)

Fraud detection using data mining:Fraud detection using data mining: HNC/eHNC HNC/eHNC : 1999: ~ $500m M.C.: 1999: ~ $500m M.C.

2000: ~ $2b M.C. 2000: ~ $2b M.C.

CBCL MIT

March 8, 2000

INSEAD

Part IIPart II



III.III. Mathematics for complex data mining Mathematics for complex data mining - Statistical Learning Theory- Statistical Learning Theory- Data mining tools- Data mining tools


CBCL MIT

March 8, 2000

INSEAD

An E-Support SystemAn E-Support System

companies need to respond efficiently and companies need to respond efficiently and accurately to customers’ emails…accurately to customers’ emails…

……how can they manage this when they how can they manage this when they receive thousands of emails a day ?receive thousands of emails a day ?

1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)

CBCL MIT

March 8, 2000

INSEAD

An Email Classification SystemAn Email Classification System

……bought a piece of… some broken part… bought a piece of… some broken part… …would like to return… not satisfied with…. …would like to return… not satisfied with….

……send a receipt… previous payment… send a receipt… previous payment… … …

request a copy of the report… balance of…request a copy of the report… balance of…

PROBLEM PROBLEM

PROBLEM PROBLEM

ACCOUNT ACCOUNT

ACCOUNTACCOUNT

………………..

………………..

……..

……..


newnew emailemail

PROBLEMPROBLEM

ACCOUNTACCOUNT

e-supporte-support ……

??

CBCL MIT

March 8, 2000

INSEAD

An Image Mining SystemAn Image Mining System

How can we detect objects in an image?How can we detect objects in an image?

CBCL MIT

March 8, 2000

INSEAD

An Image Mining SystemAn Image Mining System

. . .

. . .

data mining data mining

newnew imageimage

PedestrianPedestrian

CarCar

Image SystemImage System ……....

??

CBCL MIT

March 8, 2000

INSEAD

General System ArchitectureGeneral System Architecture

data mining data mining

new new datadata

Decision ADecision A

Decision BDecision B

SystemSystem ……....

Example data

??

CBCL MIT

March 8, 2000

INSEAD

A Data Mining Process A Data Mining Process

Data exist in many different forms (text, images, web clicks …)Data exist in many different forms (text, images, web clicks …)

Raw DataRaw Data Feature vectorFeature vectorFeatures extractionFeatures extraction

text text imagesimages

(12, 3, …)(12, 3, …)

STEP 1:STEP 1: Represent data in numerical form (feature vectors) Represent data in numerical form (feature vectors)

(Problem Specific)(Problem Specific)

CBCL MIT

March 8, 2000

INSEAD

A Data Mining Process (cont.)A Data Mining Process (cont.)

Numerical Data (featureNumerical Data (feature vectors)vectors)

Regression Regression ClassificationClassification ClusteringClustering

STEP 2 :STEP 2 : Statistical analysis of numerical data Statistical analysis of numerical data

CBCL MIT

March 8, 2000

INSEAD

Step 1: Text RepresentationStep 1: Text Representation

(2, 0, 1, 1, 1, 1 , ….)(2, 0, 1, 1, 1, 1 , ….)

WHAT IS THE WHAT IS THE REPRESENTATION?REPRESENTATION?

• Bag of words Bag of words

• Bag of combinations of wordsBag of combinations of words

• Natural language processing featuresNatural language processing features

Yang, McCallum, Joachims, … Yang, McCallum, Joachims, …

……drive..far..see.. drive..far..see.. later… left.. drive.. later… left.. drive..

CBCL MIT

March 8, 2000

INSEAD

Step 1: Image Representation Step 1: Image Representation

(Papageorgiou et al, 1999, Evgeniou et al, 2000)(Papageorgiou et al, 1999, Evgeniou et al, 2000)

(12, 92, 74, 0, 12, …., 124)(12, 92, 74, 0, 12, …., 124)

WHAT IS THE WHAT IS THE REPRESENTATION?REPRESENTATION?

• Pixel ValuesPixel Values

• Projections on filters (Wavelets)Projections on filters (Wavelets)

• PCAPCA

Feature selectionFeature selection

CBCL MIT

March 8, 2000

INSEAD

Step 2: “Learn” a Decision SurfaceStep 2: “Learn” a Decision Surface

(1,13,…)(1,13,…)

(92,10,…)(92,10,…)

(41,11,…)(41,11,…)

(19,3,…)(19,3,…)

(4,24,…)(4,24,…) (7,33,…)(7,33,…)

(4,71,…)(4,71,…)

decision decision surface surface

CBCL MIT

March 8, 2000

INSEAD

Learning Methods Learning Methods

Other approaches:Other approaches:

• Bayesian methods Bayesian methods

• Nearest Neighbor Nearest Neighbor

• Neural Networks Neural Networks

• Decision Trees Decision Trees

• Expert systems Expert systems

New approach:New approach:

• The Statistical Learning approachThe Statistical Learning approach

CBCL MIT

March 8, 2000

INSEAD

Part IIIPart III





CBCL MIT

March 8, 2000

INSEAD

RoadmapRoadmap

• Formal setting of learning from examplesFormal setting of learning from examples

• Standard learning methodsStandard learning methods

• The Statistical Learning approachThe Statistical Learning approach

• Tools and contributionsTools and contributions

CBCL MIT

March 8, 2000

INSEAD

Formal Setting of the ProblemFormal Setting of the Problem

Given Given a set of a set of examples examples (data)(data)

QuestionQuestion: : find function find function ff such that such that

is ais a good predictorgood predictor of of yy for a for a futurefuture input input xx

yxf ˆ)(

),(...,,),(),,( 2211 yxyxyx

CBCL MIT

March 8, 2000

INSEAD

The Ideal SolutionThe Ideal Solution

What is “good predictor”?What is “good predictor”?

If data If data ((xx,y),y) appear according to an (unknown) probability appear according to an (unknown) probability

distribution distribution P(P(xx,y),,y), then we want our solution to: then we want our solution to:

dxdyyxPxfyR[f]f

),())(,V( minimize Error Expected

V(V(y, y, f f ((xx)) : Loss function measuring “cost” from )) : Loss function measuring “cost” from

predicting predicting ff ((xx) instead of ) instead of y (e.g. (y - y (e.g. (y - ff((xx))))2 2 ) )

CBCL MIT

March 8, 2000

INSEAD

(I) Empirical Error Minimization(I) Empirical Error Minimization

We only have example data, so go for the obvious…We only have example data, so go for the obvious…

… … and hope that the solution has a small expected errorand hope that the solution has a small expected error

1

))(,(V minimizei

iiempf

xfy[f]RError Empirical

Where?Where?

CBCL MIT

March 8, 2000

INSEAD

(II) Function Space(II) Function Space

Where do we choose Where do we choose ff from? from?

ff can be any constant function? can be any constant function? ff can be any polynomial? can be any polynomial?

CBCL MIT

March 8, 2000

INSEAD

Standard Learning MethodsStandard Learning Methods

A standard way of building learning methods:A standard way of building learning methods:

• Step 1:Step 1: define a function space define a function space HH

• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))

• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error

1

))(,(V minimizei

iiempf

xfy[f]RH

CBCL MIT

March 8, 2000

INSEAD

Standard Learning MethodsStandard Learning Methods


• Step 1:Step 1: define a function space define a function space HH How?How?


• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error Ok?Ok?

1

))(,(V minimizei

iiempf

xfy[f]RH

Enough ?Enough ?

CBCL MIT

March 8, 2000

INSEAD

The Central QuestionsThe Central Questions

I.I. How do we choose the function space How do we choose the function space H H ??

II.II. What if there are many solutions in What if there are many solutions in H H minimizing minimizing the empirical error (ill-posed problem) ?the empirical error (ill-posed problem) ?

III.III. Does a function Does a function ff that minimizes the empirical that minimizes the empirical error also minimize the expected error in error also minimize the expected error in H H ??

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning Approach Statistical Learning Approach (Vapnik, Chervonenkis, 1968- )(Vapnik, Chervonenkis, 1968- )

I.I. Choose function space Choose function space H H according to its according to its complexitycomplexity. . Formal measures are provided (i.e. VC-dimension). Formal measures are provided (i.e. VC-dimension).

II.II. With appropriate control of the With appropriate control of the complexitycomplexity of the of the function space, the problem becomes function space, the problem becomes well-posedwell-posed : : there is a unique solution.there is a unique solution.

III.III. The theory provides The theory provides necessary and sufficientnecessary and sufficient conditions for the conditions for the uniform convergenceuniform convergence of the of the empirical error to the expected error in a function empirical error to the expected error in a function space in terms of the complexity of the space.space in terms of the complexity of the space.

CBCL MIT

March 8, 2000

INSEAD

Important Bound Important Bound (Vapnik, Chervonenkis, 1971)(Vapnik, Chervonenkis, 1971)

The theory provides bounds on the distance between The theory provides bounds on the distance between the expected and empirical error :the expected and empirical error :

)(][][h

OfRfR emp

These bounds can be used to choose the These bounds can be used to choose the function space function space HH

data ofnumber : , spacefunction of complexity :

error Empirical :))(,(V][

error Expected :)())(,(V][

1

Hh

xfyfR

dxdyx,yPxfyfR

ii

iemp

CBCL MIT

March 8, 2000

INSEAD

Using the BoundUsing the Bound

Underfit Overfit Underfit Overfit

CBCL MIT

March 8, 2000

INSEAD

Using the BoundUsing the Bound

ErrorError

Complexity Complexity hh

ExpectedExpected

EmpiricalEmpirical

hhoptopt

CBCL MIT

March 8, 2000

INSEAD

Standard ApproachesStandard Approaches


• Step 1:Step 1: define a function space define a function space HH How?How?


• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error Ok?Ok?

1

))(,(V minimizei

iiempf

xfy[f]RH

Enough ?Enough ?

CBCL MIT

March 8, 2000

INSEAD

The Statistical Learning ApproachThe Statistical Learning Approach

The The new new way of building learning methods:way of building learning methods:

Minimize:Minimize: Empirical Error Empirical Error + Complexity+ Complexity

) )(y (complexit))(,(min1

,H

H

i

ii

fxfyV

By trying many By trying many HH

)(][][h

OfRfR emp

CBCL MIT

March 8, 2000

INSEAD

Solves the problems of the standard methodsSolves the problems of the standard methods

• Step 1:Step 1: define a function space define a function space HH


• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error

1

))(,(V minimizei

iiempf

xfy[f]RH

The Statistical Learning ApproachThe Statistical Learning Approach

CBCL MIT

March 8, 2000

INSEAD

ExampleExample

})({ bxwxf H classified correctly if ,0

not if 1,))(,( xxfyV

CBCL MIT

March 8, 2000

INSEAD

aka Perceptron (Neural Network)aka Perceptron (Neural Network)

1

))(,(V minimizei

iiempf

xfy[f]RH

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning ApproachStatistical Learning Approach

What if we restrict the set of lines - function space? What if we restrict the set of lines - function space? (therefore control complexity)(therefore control complexity)

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning ApproachStatistical Learning Approach

}scaling plus ,;)({2Awbxwxf H

20

0

||);(

w

bxwbxwxd

CBCL MIT

March 8, 2000

INSEAD

Benefits of Statistical LearningBenefits of Statistical Learning

a)a) The problem becomes well-posed The problem becomes well-posed

b)b) Solution has smaller expected errorSolution has smaller expected error

CBCL MIT

March 8, 2000

INSEAD

Empirical Error vs ComplexityEmpirical Error vs Complexity

What if we further restrict complexity?What if we further restrict complexity?

CBCL MIT

March 8, 2000

INSEAD

Benefits of Statistical LearningBenefits of Statistical Learning

Avoid overfitting Avoid overfitting

(Important for large dimensional data!)(Important for large dimensional data!)

CBCL MIT

March 8, 2000

INSEAD

Support Vector Machines Support Vector Machines (Vapnik, Cortes, 1995)(Vapnik, Cortes, 1995)

2

1i

w|y| min wxw i

i

0 if ,00 if ,||

xwyxwyxwyxwy

Empirical ErrorEmpirical Error ComplexityComplexity

CBCL MIT

March 8, 2000

INSEAD

Non-linear Function SpacesNon-linear Function Spaces

Generally Generally ff can be any can be any “linear”“linear” function in some very function in some very complex featurecomplex feature space: space:

AwfxwxfN

nn

N

nnn

1

22

1

;)()(

H

)largevery (possibly features ofnumber :

featurecomplex some:)(

N

xn

CBCL MIT

March 8, 2000

INSEAD

Example: Second Order FeaturesExample: Second Order Features

215224

2132211

215224

213

2211

21

xxxxxx)(

xx)(x)(x)(

x)(x)(

)x,x(

wwwwwxf

xxx

xx

x

CBCL MIT

March 8, 2000

INSEAD

Second Order PolynomialsSecond Order Polynomials

Using more complex features Using more complex features (second order features)(second order features)

CBCL MIT

March 8, 2000

INSEAD

Reproducing Kernel Hilbert SpaceReproducing Kernel Hilbert Space

RKHSRKHS: A space of linear functions in a feature space : A space of linear functions in a feature space satisfying some conditions satisfying some conditions (functional (functional analysis…)analysis…)Examples:Examples:

)xcos()(

)(

x)(x

jn

nn

njn

nx

ex

x

j

CBCL MIT

March 8, 2000

INSEAD

Support Vector Machines: GeneralSupport Vector Machines: General

2

1

|)(y| min

fxf ii

if

H


Training: Quadratic ProgrammingTraining: Quadratic Programming

CBCL MIT

March 8, 2000

INSEAD

Kernel MachinesKernel Machines

2

1

))(,V(y min

fxf ii

if

H

Choices to make: V , Choices to make: V , , ,


CBCL MIT

March 8, 2000

INSEAD

Some Kernel Machines Some Kernel Machines (Vapnik 1998, Evgeniou et al 1999)(Vapnik 1998, Evgeniou et al 1999)

With appropriate choices of the complex features With appropriate choices of the complex features and the loss function V we can get:and the loss function V we can get:

• Support Vector Machines (SVM)Support Vector Machines (SVM)

• A type of multi-layer perceptronsA type of multi-layer perceptrons

• A type of radial basis functions A type of radial basis functions

• A type of spline models A type of spline models

• A type of additive modelsA type of additive models

• A type of ridge regression models A type of ridge regression models

CBCL MIT

March 8, 2000

INSEAD

Kernel Machines Analysis Kernel Machines Analysis (the difficult questions)(the difficult questions)

Does the empirical error of general kernel Does the empirical error of general kernel machines converge to the expected error?machines converge to the expected error?

What is the distance between empirical and What is the distance between empirical and expected error for these machines?expected error for these machines?

Are these machines well-posed?Are these machines well-posed?

CBCL MIT

March 8, 2000

INSEAD

Convergence of Kernel MachinesConvergence of Kernel Machines

infinite). for (also any for converges

;)()(

:RKHS ain space hypothesis

withmachine learning ,function loss SVM the

for and )))(( i.e.(function loss Lany For

1999) Poggio, Pontil, (Evgeniou,

1

22

1

NA

Awfxwxf

xfy

N

nn

N

nnn

pp

H

any

Theorem

CBCL MIT

March 8, 2000

INSEAD

Implications of the TheoremImplications of the Theorem

The empirical error converges to the expected one for:The empirical error converges to the expected one for:

Support Vector MachinesSupport Vector Machines

A type of:A type of:

Multi-layer Perceptrons (i.e. Neural Networks)Multi-layer Perceptrons (i.e. Neural Networks)

Radial Basis FunctionsRadial Basis Functions

Spline models (i.e piece-wise linear functions)Spline models (i.e piece-wise linear functions)

… …....

CBCL MIT

March 8, 2000

INSEAD

Bounds on Expected Error Bounds on Expected Error (Evgeniou, Pontil, 2000)(Evgeniou, Pontil, 2000)

Furthermore, we can get bounds on the distance Furthermore, we can get bounds on the distance between the expected error and the empirical error between the expected error and the empirical error

“of the form”: “of the form”:

)(][][h

OfRfR emp

By measuring hBy measuring h ( (the complexity of sets in RKHS)the complexity of sets in RKHS) (Evgeniou, Pontil, 1999)(Evgeniou, Pontil, 1999)

CBCL MIT

March 8, 2000

INSEAD

Kernel Machines: ContributionsKernel Machines: Contributions

Does the empirical error of general kernel Does the empirical error of general kernel machines converge to the expected error? machines converge to the expected error? YES!YES!

What is the distance between empirical and What is the distance between empirical and expected error for these machines? expected error for these machines? BOUNDS! BOUNDS!

Are these machines well-posed? Are these machines well-posed? YESYES

CBCL MIT

March 8, 2000

INSEAD

Characteristics of Kernel MachinesCharacteristics of Kernel Machines

• Automatic complexity controlAutomatic complexity control

• Guaranteed bounds on expected errorGuaranteed bounds on expected error

• Unique optimal solutionUnique optimal solution

• Good with very large dimensional dataGood with very large dimensional data

• Little parameter tuning (V, RKHS, Little parameter tuning (V, RKHS,

CBCL MIT

March 8, 2000

INSEAD

The Email Classification SystemThe Email Classification System

data minedata mine

……bought a piece of… some broken part… bought a piece of… some broken part… …would like to return… not satisfied with…. …would like to return… not satisfied with….

……send a receipt… previous payment… send a receipt… previous payment… … …

request a copy of the report… balance of…request a copy of the report… balance of…

problem problem

problem problem

account account

accountaccount

………………..

………………..

……..

……..

??

CBCL MIT

March 8, 2000

INSEAD

The Email Classification SystemThe Email Classification System

Representation - high dimensional feature vectors

text system

……bought a piece of… some broken part… …would bought a piece of… some broken part… …would like to return… not satisfied with….like to return… not satisfied with….

……send a receipt… previous payment… send a receipt… previous payment… … …request a copy of the report… balance of…request a copy of the report… balance of…

problem problem problem problem

account account accountaccount

………………..

………………..

……..

……..

CBCL MIT

March 8, 2000

INSEAD

The Image Mining SystemThe Image Mining System

Representation - high dimensional feature vectors

. . .

. . .

image system

CBCL MIT

March 8, 2000

INSEAD

Image System PerformanceImage System Performance

Train size: 700 pedestrians, 6000 non-pedestriansTrain size: 700 pedestrians, 6000 non-pedestrians

Test size: 224 pedestrians, 3000 non-pedestrians Test size: 224 pedestrians, 3000 non-pedestrians

50556065707580859095

100

% detect

Pixels Wav.29a Wav.29b Wav.1326

Collaboration with C. Papageorgiou and M. Pontil Collaboration with C. Papageorgiou and M. Pontil

Comparing representationsComparing representations

CBCL MIT

March 8, 2000

INSEAD

Comparing learning methods Comparing learning methods (29 wavelets)(29 wavelets)

50

60

70

80

90

100

% correct

CART SVM

Collaboration with L. Perez-Breva Collaboration with L. Perez-Breva

Image System PerformanceImage System Performance

CBCL MIT

March 8, 2000

INSEAD

50

60

70

80

90

100

% correct

BAYES SVM

Some Text System PerformanceSome Text System Performance

Preliminary results on a 2-class news groups email Preliminary results on a 2-class news groups email classification problem (800 train data, 1200 test data)classification problem (800 train data, 1200 test data)

Collaboration with R. Rifkin and C. Papageorgiou – in progressCollaboration with R. Rifkin and C. Papageorgiou – in progress

CBCL MIT

March 8, 2000

INSEAD

0

20

40

60

80

100

% correct

Chance Bayes SVM

Preliminary Multi-Class Text SystemPreliminary Multi-Class Text System

20 classes (multi-class) categorization: 20 classes (multi-class) categorization: How is it done?How is it done?

Collaboration with R. Rifkin and C. Papageorgiou – in progressCollaboration with R. Rifkin and C. Papageorgiou – in progress

CBCL MIT

March 8, 2000

INSEAD

Examples of the Image SystemExamples of the Image System

CBCL MIT

March 8, 2000

INSEAD

Summary Summary andand Contributions Contributions

I.I. The importance of data miningThe importance of data mining

II.II. Text and image systemsText and image systems

- Choosing representations, feature selection- Choosing representations, feature selection

III.III. Statistical Learning : powerful tools Statistical Learning : powerful tools

- Theoretical analysis of kernel learning machines- Theoretical analysis of kernel learning machines

- Unified analysis of many “standard” methods- Unified analysis of many “standard” methods

- Important conceptual and formal tools - Important conceptual and formal tools

CBCL MIT

March 8, 2000

INSEAD

Further PlansFurther Plans

• Choosing data representationsChoosing data representations

• Multi-class categorizationMulti-class categorization

• Unsupervised learningUnsupervised learning

• Text / multimedia systemsText / multimedia systems

• Web click analysisWeb click analysis

• Intelligent agentsIntelligent agents

• E-Customer supportE-Customer support

• PersonalizationPersonalization

• Fraud / Trust controlFraud / Trust control

THEORYTHEORY MARKETMARKET

SYSTEMSSYSTEMS

CBCL MIT

March 8, 2000

INSEAD

CBCL MIT

March 8, 2000

INSEAD

Image System ROC CurvesImage System ROC Curves

CBCL MIT

March 8, 2000

INSEAD

SVM vs Neural NetworksSVM vs Neural Networks

• Complexity controlComplexity control

• Quadratic ProgrammingQuadratic Programming

• Unique SolutionUnique Solution

• Few parameters to tuneFew parameters to tune

• Guaranteed performanceGuaranteed performance

• Empirical error controlEmpirical error control

• Difficult trainingDifficult training

• Many local optimaMany local optima

• Often many parametersOften many parameters

• Asymptotic analysisAsymptotic analysis

SVM Neural NetworksSVM Neural Networks

CBCL MIT

March 8, 2000

INSEAD

Convergence of Learning MachinesConvergence of Learning Machines

0)|][][|sup(

])[( min ])[(min

fRfRP

fRfR

empf

fempf

H

HH

examples ofnumber :

;)()( :

oferror expected :][

oferror empirical :][

1

2

N

nknn

emp

Afxwxf

ffR

ffR

H

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning TheoryStatistical Learning Theory

Learning from examplesLearning from examples = given examples of an = given examples of an input/output relation, find function input/output relation, find function ff such that such that

output = output = ff (input)(input)

Developed mainly by Vapnik and Chervonenkis in Developed mainly by Vapnik and Chervonenkis in the late 60’s, 70’s, 80’s, 90’s, … the late 60’s, 70’s, 80’s, 90’s, …

FunctionFunction

IINNPPUUTT

OOUUTTPPUUTT

CBCL MIT

March 8, 2000

INSEAD

Data Mining: Driving Forces Data Mining: Driving Forces

• 1 trillion emails, 5 trillion by 2003 1 trillion emails, 5 trillion by 2003 (IDC)(IDC)

• Yahoo! collects 400 Gbytes/day web-click info Yahoo! collects 400 Gbytes/day web-click info (Business Week)(Business Week)

• $150 billion ecommerce, $1.3 trillion by 2003 $150 billion ecommerce, $1.3 trillion by 2003 (IDC)(IDC)

• 1 billion web pages, with 50% increase/year 1 billion web pages, with 50% increase/year (IDC)(IDC)

• 100s millions digital images, audio, video, ….100s millions digital images, audio, video, ….

CBCL MIT

March 8, 2000

INSEAD

Driving ForcesDriving Forces

1997 1998 1999 2000 2005

MEMORY MEMORY COMPUTATION COMPUTATION DATA TOOLS DATA TOOLS

DATA DATA Emails Emails

Multimedia Multimedia EcommerceEcommerce

CBCL MIT

March 8, 2000

INSEAD

Further StudiesFurther Studies

Bounds on the expected risk of kernel machines Bounds on the expected risk of kernel machines are developed are developed (Evgeniou, Pontil, 2000).(Evgeniou, Pontil, 2000).

Connections with other learning methods are Connections with other learning methods are made made (Evgeniou, Pontil, Poggio, 1999).(Evgeniou, Pontil, Poggio, 1999).

Combinations of learning machines (e.g. many Combinations of learning machines (e.g. many machines each using different information) are machines each using different information) are studied studied (Evgeniou, Perez-Breva, Pontil, Poggio, 2000).(Evgeniou, Perez-Breva, Pontil, Poggio, 2000).

CBCL MIT

March 8, 2000

INSEAD

Text MiningText Mining

How can we reliably decide if a text is about a How can we reliably decide if a text is about a particular topic?particular topic?

……drive…drive…used… ..brokused… ..brok

e…send…e…send…

CAR CAR

COMPUTERCOMPUTER

TRAVELTRAVEL

COMMERCECOMMERCE

CBCL MIT

March 8, 2000

INSEAD

What is Data Mining?What is Data Mining?

Data Data MMiine ne

DataData MMooneneyy

Documents

CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology