Upload
phyllis-henderson
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
CBCL MIT
March 8, 2000
INSEAD
Data Mining with Statistical Learning
Theodoros Evgeniou Massachusetts Institute of Technology
CBCL MIT
March 8, 2000
INSEAD
OutlineOutline
I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?
II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data
II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools
Concluding remarksConcluding remarks
CBCL MIT
March 8, 2000
INSEAD
Part IPart I
I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?
II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data
II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools
Concluding remarksConcluding remarks
CBCL MIT
March 8, 2000
INSEAD
What is Data Mining?What is Data Mining?
Goal:Goal: To classifyTo classify or find trends in data in order to or find trends in data in order to
improve future decisions improve future decisions
Examples: Examples: - financial data modeling - forecasting - financial data modeling - forecasting
- customer profiling- customer profiling
- fraud detection - fraud detection
CBCL MIT
March 8, 2000
INSEAD
Example: Fraud DetectionExample: Fraud Detection
Age: 24, Occ.: student, Spend: $100, Buy: … Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: …Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: …Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Age: 53, Occ.: small b., Spend: $1300, Buy: … Spend: $1300, Buy: …
OK OK OKOK FRAUD FRAUD
OKOK
………………..
………………..
……..
……..
Age:.. Occ:..Age:.. Occ:..
FRAUD?FRAUD?
OK?OK?
data miningdata mining
fraud fraud systemsystem
??
CBCL MIT
March 8, 2000
INSEAD
Example: Customer ProfilingExample: Customer Profiling
Age: 24, Occ.: student, Spend: $100, Buy: … Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: …Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: …Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Age: 53, Occ.: small b., Spend: $1300, Buy: … Spend: $1300, Buy: …
NO NO NONO BUY BUY
NONO
………………..
………………..
……..
……..
Age:.. Occ:..Age:.. Occ:..
BUY?BUY?
NO?NO?
data miningdata mining
profiling profiling systemsystem
??
CBCL MIT
March 8, 2000
INSEAD
Data Mining: More ExamplesData Mining: More Examples
• Sales analysis for inventory control Sales analysis for inventory control
• Diagnostics (manufacturing, health, …) Diagnostics (manufacturing, health, …)
• Information filtering/retrieval (e.g. emails, multimedia)Information filtering/retrieval (e.g. emails, multimedia)
• E-Customer Relationship Management : E-Customer Relationship Management :
E-customer profiling (personalization, marketing…)E-customer profiling (personalization, marketing…)
E-customer support E-customer support
CBCL MIT
March 8, 2000
INSEAD
Market Interest Market Interest
• Only 30% of Fortune 500 Only 30% of Fortune 500 using email respond to it using email respond to it on time on time (IDC)(IDC)
Email filtering/response software: Email filtering/response software: $20M now, $350M in 2003 (IDC)$20M now, $350M in 2003 (IDC) Kana, eGain, aptex… Kana, eGain, aptex…: ~$10b: ~$10b
Personalization, targeted marketing, Personalization, targeted marketing, collaborative filtering … (privacy?) collaborative filtering … (privacy?) engage, netperceptions… engage, netperceptions…:: ~$10b ~$10b
• 20% of e-companies use 20% of e-companies use internet customer info, internet customer info, 70% by 2001 70% by 2001 (Forrester R.)(Forrester R.)
• US 1999: $12b credit card US 1999: $12b credit card fraud, 50% on internet fraud, 50% on internet (IDC)(IDC) (insurance, telecom…)(insurance, telecom…)
Fraud detection using data mining:Fraud detection using data mining: HNC/eHNC HNC/eHNC : 1999: ~ $500m M.C.: 1999: ~ $500m M.C.
2000: ~ $2b M.C. 2000: ~ $2b M.C.
CBCL MIT
March 8, 2000
INSEAD
Part IIPart II
I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?
II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data
III.III. Mathematics for complex data mining Mathematics for complex data mining - Statistical Learning Theory- Statistical Learning Theory- Data mining tools- Data mining tools
Concluding remarksConcluding remarks
CBCL MIT
March 8, 2000
INSEAD
An E-Support SystemAn E-Support System
companies need to respond efficiently and companies need to respond efficiently and accurately to customers’ emails…accurately to customers’ emails…
……how can they manage this when they how can they manage this when they receive thousands of emails a day ?receive thousands of emails a day ?
1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)
CBCL MIT
March 8, 2000
INSEAD
An Email Classification SystemAn Email Classification System
……bought a piece of… some broken part… bought a piece of… some broken part… …would like to return… not satisfied with…. …would like to return… not satisfied with….
……send a receipt… previous payment… send a receipt… previous payment… … …
request a copy of the report… balance of…request a copy of the report… balance of…
PROBLEM PROBLEM
PROBLEM PROBLEM
ACCOUNT ACCOUNT
ACCOUNTACCOUNT
………………..
………………..
……..
……..
data miningdata mining
newnew emailemail
PROBLEMPROBLEM
ACCOUNTACCOUNT
e-supporte-support ……
??
CBCL MIT
March 8, 2000
INSEAD
An Image Mining SystemAn Image Mining System
How can we detect objects in an image?How can we detect objects in an image?
CBCL MIT
March 8, 2000
INSEAD
An Image Mining SystemAn Image Mining System
. . .
. . .
data mining data mining
newnew imageimage
PedestrianPedestrian
CarCar
Image SystemImage System ……....
??
CBCL MIT
March 8, 2000
INSEAD
General System ArchitectureGeneral System Architecture
data mining data mining
new new datadata
Decision ADecision A
Decision BDecision B
SystemSystem ……....
Example data
??
CBCL MIT
March 8, 2000
INSEAD
A Data Mining Process A Data Mining Process
Data exist in many different forms (text, images, web clicks …)Data exist in many different forms (text, images, web clicks …)
Raw DataRaw Data Feature vectorFeature vectorFeatures extractionFeatures extraction
text text imagesimages
(12, 3, …)(12, 3, …)
STEP 1:STEP 1: Represent data in numerical form (feature vectors) Represent data in numerical form (feature vectors)
(Problem Specific)(Problem Specific)
CBCL MIT
March 8, 2000
INSEAD
A Data Mining Process (cont.)A Data Mining Process (cont.)
Numerical Data (featureNumerical Data (feature vectors)vectors)
Regression Regression ClassificationClassification ClusteringClustering
STEP 2 :STEP 2 : Statistical analysis of numerical data Statistical analysis of numerical data
CBCL MIT
March 8, 2000
INSEAD
Step 1: Text RepresentationStep 1: Text Representation
(2, 0, 1, 1, 1, 1 , ….)(2, 0, 1, 1, 1, 1 , ….)
WHAT IS THE WHAT IS THE REPRESENTATION?REPRESENTATION?
• Bag of words Bag of words
• Bag of combinations of wordsBag of combinations of words
• Natural language processing featuresNatural language processing features
Yang, McCallum, Joachims, … Yang, McCallum, Joachims, …
……drive..far..see.. drive..far..see.. later… left.. drive.. later… left.. drive..
CBCL MIT
March 8, 2000
INSEAD
Step 1: Image Representation Step 1: Image Representation
(Papageorgiou et al, 1999, Evgeniou et al, 2000)(Papageorgiou et al, 1999, Evgeniou et al, 2000)
(12, 92, 74, 0, 12, …., 124)(12, 92, 74, 0, 12, …., 124)
WHAT IS THE WHAT IS THE REPRESENTATION?REPRESENTATION?
• Pixel ValuesPixel Values
• Projections on filters (Wavelets)Projections on filters (Wavelets)
• PCAPCA
Feature selectionFeature selection
CBCL MIT
March 8, 2000
INSEAD
Step 2: “Learn” a Decision SurfaceStep 2: “Learn” a Decision Surface
(1,13,…)(1,13,…)
(92,10,…)(92,10,…)
(41,11,…)(41,11,…)
(19,3,…)(19,3,…)
(4,24,…)(4,24,…) (7,33,…)(7,33,…)
(4,71,…)(4,71,…)
decision decision surface surface
CBCL MIT
March 8, 2000
INSEAD
Learning Methods Learning Methods
Other approaches:Other approaches:
• Bayesian methods Bayesian methods
• Nearest Neighbor Nearest Neighbor
• Neural Networks Neural Networks
• Decision Trees Decision Trees
• Expert systems Expert systems
New approach:New approach:
• The Statistical Learning approachThe Statistical Learning approach
CBCL MIT
March 8, 2000
INSEAD
Part IIIPart III
I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?
II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data
II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools
Concluding remarksConcluding remarks
CBCL MIT
March 8, 2000
INSEAD
RoadmapRoadmap
• Formal setting of learning from examplesFormal setting of learning from examples
• Standard learning methodsStandard learning methods
• The Statistical Learning approachThe Statistical Learning approach
• Tools and contributionsTools and contributions
CBCL MIT
March 8, 2000
INSEAD
Formal Setting of the ProblemFormal Setting of the Problem
Given Given a set of a set of examples examples (data)(data)
QuestionQuestion: : find function find function ff such that such that
is ais a good predictorgood predictor of of yy for a for a futurefuture input input xx
yxf ˆ)(
),(...,,),(),,( 2211 yxyxyx
CBCL MIT
March 8, 2000
INSEAD
The Ideal SolutionThe Ideal Solution
What is “good predictor”?What is “good predictor”?
If data If data ((xx,y),y) appear according to an (unknown) probability appear according to an (unknown) probability
distribution distribution P(P(xx,y),,y), then we want our solution to: then we want our solution to:
dxdyyxPxfyR[f]f
),())(,V( minimize Error Expected
V(V(y, y, f f ((xx)) : Loss function measuring “cost” from )) : Loss function measuring “cost” from
predicting predicting ff ((xx) instead of ) instead of y (e.g. (y - y (e.g. (y - ff((xx))))2 2 ) )
CBCL MIT
March 8, 2000
INSEAD
(I) Empirical Error Minimization(I) Empirical Error Minimization
We only have example data, so go for the obvious…We only have example data, so go for the obvious…
… … and hope that the solution has a small expected errorand hope that the solution has a small expected error
1
))(,(V minimizei
iiempf
xfy[f]RError Empirical
Where?Where?
CBCL MIT
March 8, 2000
INSEAD
(II) Function Space(II) Function Space
Where do we choose Where do we choose ff from? from?
ff can be any constant function? can be any constant function? ff can be any polynomial? can be any polynomial?
CBCL MIT
March 8, 2000
INSEAD
Standard Learning MethodsStandard Learning Methods
A standard way of building learning methods:A standard way of building learning methods:
• Step 1:Step 1: define a function space define a function space HH
• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))
• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error
1
))(,(V minimizei
iiempf
xfy[f]RH
CBCL MIT
March 8, 2000
INSEAD
Standard Learning MethodsStandard Learning Methods
A standard way of building learning methods:A standard way of building learning methods:
• Step 1:Step 1: define a function space define a function space HH How?How?
• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))
• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error Ok?Ok?
1
))(,(V minimizei
iiempf
xfy[f]RH
Enough ?Enough ?
CBCL MIT
March 8, 2000
INSEAD
The Central QuestionsThe Central Questions
I.I. How do we choose the function space How do we choose the function space H H ??
II.II. What if there are many solutions in What if there are many solutions in H H minimizing minimizing the empirical error (ill-posed problem) ?the empirical error (ill-posed problem) ?
III.III. Does a function Does a function ff that minimizes the empirical that minimizes the empirical error also minimize the expected error in error also minimize the expected error in H H ??
CBCL MIT
March 8, 2000
INSEAD
Statistical Learning Approach Statistical Learning Approach (Vapnik, Chervonenkis, 1968- )(Vapnik, Chervonenkis, 1968- )
I.I. Choose function space Choose function space H H according to its according to its complexitycomplexity. . Formal measures are provided (i.e. VC-dimension). Formal measures are provided (i.e. VC-dimension).
II.II. With appropriate control of the With appropriate control of the complexitycomplexity of the of the function space, the problem becomes function space, the problem becomes well-posedwell-posed : : there is a unique solution.there is a unique solution.
III.III. The theory provides The theory provides necessary and sufficientnecessary and sufficient conditions for the conditions for the uniform convergenceuniform convergence of the of the empirical error to the expected error in a function empirical error to the expected error in a function space in terms of the complexity of the space.space in terms of the complexity of the space.
CBCL MIT
March 8, 2000
INSEAD
Important Bound Important Bound (Vapnik, Chervonenkis, 1971)(Vapnik, Chervonenkis, 1971)
The theory provides bounds on the distance between The theory provides bounds on the distance between the expected and empirical error :the expected and empirical error :
)(][][h
OfRfR emp
These bounds can be used to choose the These bounds can be used to choose the function space function space HH
data ofnumber : , spacefunction of complexity :
error Empirical :))(,(V][
error Expected :)())(,(V][
1
Hh
xfyfR
dxdyx,yPxfyfR
ii
iemp
CBCL MIT
March 8, 2000
INSEAD
Using the BoundUsing the Bound
Underfit Overfit Underfit Overfit
CBCL MIT
March 8, 2000
INSEAD
Using the BoundUsing the Bound
ErrorError
Complexity Complexity hh
ExpectedExpected
EmpiricalEmpirical
hhoptopt
CBCL MIT
March 8, 2000
INSEAD
Standard ApproachesStandard Approaches
A standard way of building learning methods:A standard way of building learning methods:
• Step 1:Step 1: define a function space define a function space HH How?How?
• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))
• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error Ok?Ok?
1
))(,(V minimizei
iiempf
xfy[f]RH
Enough ?Enough ?
CBCL MIT
March 8, 2000
INSEAD
The Statistical Learning ApproachThe Statistical Learning Approach
The The new new way of building learning methods:way of building learning methods:
Minimize:Minimize: Empirical Error Empirical Error + Complexity+ Complexity
) )(y (complexit))(,(min1
,H
H
i
ii
fxfyV
By trying many By trying many HH
)(][][h
OfRfR emp
CBCL MIT
March 8, 2000
INSEAD
Solves the problems of the standard methodsSolves the problems of the standard methods
• Step 1:Step 1: define a function space define a function space HH
• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))
• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error
1
))(,(V minimizei
iiempf
xfy[f]RH
The Statistical Learning ApproachThe Statistical Learning Approach
CBCL MIT
March 8, 2000
INSEAD
ExampleExample
})({ bxwxf H classified correctly if ,0
not if 1,))(,( xxfyV
CBCL MIT
March 8, 2000
INSEAD
aka Perceptron (Neural Network)aka Perceptron (Neural Network)
1
))(,(V minimizei
iiempf
xfy[f]RH
CBCL MIT
March 8, 2000
INSEAD
Statistical Learning ApproachStatistical Learning Approach
What if we restrict the set of lines - function space? What if we restrict the set of lines - function space? (therefore control complexity)(therefore control complexity)
CBCL MIT
March 8, 2000
INSEAD
Statistical Learning ApproachStatistical Learning Approach
}scaling plus ,;)({2Awbxwxf H
20
0
||);(
w
bxwbxwxd
CBCL MIT
March 8, 2000
INSEAD
Benefits of Statistical LearningBenefits of Statistical Learning
a)a) The problem becomes well-posed The problem becomes well-posed
b)b) Solution has smaller expected errorSolution has smaller expected error
CBCL MIT
March 8, 2000
INSEAD
Empirical Error vs ComplexityEmpirical Error vs Complexity
What if we further restrict complexity?What if we further restrict complexity?
CBCL MIT
March 8, 2000
INSEAD
Benefits of Statistical LearningBenefits of Statistical Learning
Avoid overfitting Avoid overfitting
(Important for large dimensional data!)(Important for large dimensional data!)
CBCL MIT
March 8, 2000
INSEAD
Support Vector Machines Support Vector Machines (Vapnik, Cortes, 1995)(Vapnik, Cortes, 1995)
2
1i
w|y| min wxw i
i
0 if ,00 if ,||
xwyxwyxwyxwy
Empirical ErrorEmpirical Error ComplexityComplexity
CBCL MIT
March 8, 2000
INSEAD
Non-linear Function SpacesNon-linear Function Spaces
Generally Generally ff can be any can be any “linear”“linear” function in some very function in some very complex featurecomplex feature space: space:
AwfxwxfN
nn
N
nnn
1
22
1
;)()(
H
)largevery (possibly features ofnumber :
featurecomplex some:)(
N
xn
CBCL MIT
March 8, 2000
INSEAD
Example: Second Order FeaturesExample: Second Order Features
215224
2132211
215224
213
2211
21
xxxxxx)(
xx)(x)(x)(
x)(x)(
)x,x(
wwwwwxf
xxx
xx
x
CBCL MIT
March 8, 2000
INSEAD
Second Order PolynomialsSecond Order Polynomials
Using more complex features Using more complex features (second order features)(second order features)
CBCL MIT
March 8, 2000
INSEAD
Reproducing Kernel Hilbert SpaceReproducing Kernel Hilbert Space
RKHSRKHS: A space of linear functions in a feature space : A space of linear functions in a feature space satisfying some conditions satisfying some conditions (functional (functional analysis…)analysis…)Examples:Examples:
)xcos()(
)(
x)(x
jn
nn
njn
nx
ex
x
j
CBCL MIT
March 8, 2000
INSEAD
Support Vector Machines: GeneralSupport Vector Machines: General
2
1
|)(y| min
fxf ii
if
H
Empirical ErrorEmpirical Error ComplexityComplexity
Training: Quadratic ProgrammingTraining: Quadratic Programming
CBCL MIT
March 8, 2000
INSEAD
Kernel MachinesKernel Machines
2
1
))(,V(y min
fxf ii
if
H
Choices to make: V , Choices to make: V , , ,
Empirical ErrorEmpirical Error ComplexityComplexity
CBCL MIT
March 8, 2000
INSEAD
Some Kernel Machines Some Kernel Machines (Vapnik 1998, Evgeniou et al 1999)(Vapnik 1998, Evgeniou et al 1999)
With appropriate choices of the complex features With appropriate choices of the complex features and the loss function V we can get:and the loss function V we can get:
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• A type of multi-layer perceptronsA type of multi-layer perceptrons
• A type of radial basis functions A type of radial basis functions
• A type of spline models A type of spline models
• A type of additive modelsA type of additive models
• A type of ridge regression models A type of ridge regression models
CBCL MIT
March 8, 2000
INSEAD
Kernel Machines Analysis Kernel Machines Analysis (the difficult questions)(the difficult questions)
Does the empirical error of general kernel Does the empirical error of general kernel machines converge to the expected error?machines converge to the expected error?
What is the distance between empirical and What is the distance between empirical and expected error for these machines?expected error for these machines?
Are these machines well-posed?Are these machines well-posed?
CBCL MIT
March 8, 2000
INSEAD
Convergence of Kernel MachinesConvergence of Kernel Machines
infinite). for (also any for converges
;)()(
:RKHS ain space hypothesis
withmachine learning ,function loss SVM the
for and )))(( i.e.(function loss Lany For
1999) Poggio, Pontil, (Evgeniou,
1
22
1
NA
Awfxwxf
xfy
N
nn
N
nnn
pp
H
any
Theorem
CBCL MIT
March 8, 2000
INSEAD
Implications of the TheoremImplications of the Theorem
The empirical error converges to the expected one for:The empirical error converges to the expected one for:
Support Vector MachinesSupport Vector Machines
A type of:A type of:
Multi-layer Perceptrons (i.e. Neural Networks)Multi-layer Perceptrons (i.e. Neural Networks)
Radial Basis FunctionsRadial Basis Functions
Spline models (i.e piece-wise linear functions)Spline models (i.e piece-wise linear functions)
… …....
CBCL MIT
March 8, 2000
INSEAD
Bounds on Expected Error Bounds on Expected Error (Evgeniou, Pontil, 2000)(Evgeniou, Pontil, 2000)
Furthermore, we can get bounds on the distance Furthermore, we can get bounds on the distance between the expected error and the empirical error between the expected error and the empirical error
“of the form”: “of the form”:
)(][][h
OfRfR emp
By measuring hBy measuring h ( (the complexity of sets in RKHS)the complexity of sets in RKHS) (Evgeniou, Pontil, 1999)(Evgeniou, Pontil, 1999)
CBCL MIT
March 8, 2000
INSEAD
Kernel Machines: ContributionsKernel Machines: Contributions
Does the empirical error of general kernel Does the empirical error of general kernel machines converge to the expected error? machines converge to the expected error? YES!YES!
What is the distance between empirical and What is the distance between empirical and expected error for these machines? expected error for these machines? BOUNDS! BOUNDS!
Are these machines well-posed? Are these machines well-posed? YESYES
CBCL MIT
March 8, 2000
INSEAD
Characteristics of Kernel MachinesCharacteristics of Kernel Machines
• Automatic complexity controlAutomatic complexity control
• Guaranteed bounds on expected errorGuaranteed bounds on expected error
• Unique optimal solutionUnique optimal solution
• Good with very large dimensional dataGood with very large dimensional data
• Little parameter tuning (V, RKHS, Little parameter tuning (V, RKHS,
CBCL MIT
March 8, 2000
INSEAD
The Email Classification SystemThe Email Classification System
data minedata mine
……bought a piece of… some broken part… bought a piece of… some broken part… …would like to return… not satisfied with…. …would like to return… not satisfied with….
……send a receipt… previous payment… send a receipt… previous payment… … …
request a copy of the report… balance of…request a copy of the report… balance of…
problem problem
problem problem
account account
accountaccount
………………..
………………..
……..
……..
??
CBCL MIT
March 8, 2000
INSEAD
The Email Classification SystemThe Email Classification System
Representation - high dimensional feature vectors
text system
……bought a piece of… some broken part… …would bought a piece of… some broken part… …would like to return… not satisfied with….like to return… not satisfied with….
……send a receipt… previous payment… send a receipt… previous payment… … …request a copy of the report… balance of…request a copy of the report… balance of…
problem problem problem problem
account account accountaccount
………………..
………………..
……..
……..
CBCL MIT
March 8, 2000
INSEAD
The Image Mining SystemThe Image Mining System
Representation - high dimensional feature vectors
. . .
. . .
image system
CBCL MIT
March 8, 2000
INSEAD
Image System PerformanceImage System Performance
Train size: 700 pedestrians, 6000 non-pedestriansTrain size: 700 pedestrians, 6000 non-pedestrians
Test size: 224 pedestrians, 3000 non-pedestrians Test size: 224 pedestrians, 3000 non-pedestrians
50556065707580859095
100
% detect
Pixels Wav.29a Wav.29b Wav.1326
Collaboration with C. Papageorgiou and M. Pontil Collaboration with C. Papageorgiou and M. Pontil
Comparing representationsComparing representations
CBCL MIT
March 8, 2000
INSEAD
Comparing learning methods Comparing learning methods (29 wavelets)(29 wavelets)
50
60
70
80
90
100
% correct
CART SVM
Collaboration with L. Perez-Breva Collaboration with L. Perez-Breva
Image System PerformanceImage System Performance
CBCL MIT
March 8, 2000
INSEAD
50
60
70
80
90
100
% correct
BAYES SVM
Some Text System PerformanceSome Text System Performance
Preliminary results on a 2-class news groups email Preliminary results on a 2-class news groups email classification problem (800 train data, 1200 test data)classification problem (800 train data, 1200 test data)
Collaboration with R. Rifkin and C. Papageorgiou – in progressCollaboration with R. Rifkin and C. Papageorgiou – in progress
CBCL MIT
March 8, 2000
INSEAD
0
20
40
60
80
100
% correct
Chance Bayes SVM
Preliminary Multi-Class Text SystemPreliminary Multi-Class Text System
20 classes (multi-class) categorization: 20 classes (multi-class) categorization: How is it done?How is it done?
Collaboration with R. Rifkin and C. Papageorgiou – in progressCollaboration with R. Rifkin and C. Papageorgiou – in progress
CBCL MIT
March 8, 2000
INSEAD
Examples of the Image SystemExamples of the Image System
CBCL MIT
March 8, 2000
INSEAD
Summary Summary andand Contributions Contributions
I.I. The importance of data miningThe importance of data mining
II.II. Text and image systemsText and image systems
- Choosing representations, feature selection- Choosing representations, feature selection
III.III. Statistical Learning : powerful tools Statistical Learning : powerful tools
- Theoretical analysis of kernel learning machines- Theoretical analysis of kernel learning machines
- Unified analysis of many “standard” methods- Unified analysis of many “standard” methods
- Important conceptual and formal tools - Important conceptual and formal tools
CBCL MIT
March 8, 2000
INSEAD
Further PlansFurther Plans
• Choosing data representationsChoosing data representations
• Multi-class categorizationMulti-class categorization
• Unsupervised learningUnsupervised learning
• Text / multimedia systemsText / multimedia systems
• Web click analysisWeb click analysis
• Intelligent agentsIntelligent agents
• E-Customer supportE-Customer support
• PersonalizationPersonalization
• Fraud / Trust controlFraud / Trust control
THEORYTHEORY MARKETMARKET
SYSTEMSSYSTEMS
CBCL MIT
March 8, 2000
INSEAD
CBCL MIT
March 8, 2000
INSEAD
Image System ROC CurvesImage System ROC Curves
CBCL MIT
March 8, 2000
INSEAD
SVM vs Neural NetworksSVM vs Neural Networks
• Complexity controlComplexity control
• Quadratic ProgrammingQuadratic Programming
• Unique SolutionUnique Solution
• Few parameters to tuneFew parameters to tune
• Guaranteed performanceGuaranteed performance
• Empirical error controlEmpirical error control
• Difficult trainingDifficult training
• Many local optimaMany local optima
• Often many parametersOften many parameters
• Asymptotic analysisAsymptotic analysis
SVM Neural NetworksSVM Neural Networks
CBCL MIT
March 8, 2000
INSEAD
Convergence of Learning MachinesConvergence of Learning Machines
0)|][][|sup(
])[( min ])[(min
fRfRP
fRfR
empf
fempf
H
HH
examples ofnumber :
;)()( :
oferror expected :][
oferror empirical :][
1
2
N
nknn
emp
Afxwxf
ffR
ffR
H
CBCL MIT
March 8, 2000
INSEAD
Statistical Learning TheoryStatistical Learning Theory
Learning from examplesLearning from examples = given examples of an = given examples of an input/output relation, find function input/output relation, find function ff such that such that
output = output = ff (input)(input)
Developed mainly by Vapnik and Chervonenkis in Developed mainly by Vapnik and Chervonenkis in the late 60’s, 70’s, 80’s, 90’s, … the late 60’s, 70’s, 80’s, 90’s, …
FunctionFunction
IINNPPUUTT
OOUUTTPPUUTT
CBCL MIT
March 8, 2000
INSEAD
Data Mining: Driving Forces Data Mining: Driving Forces
• 1 trillion emails, 5 trillion by 2003 1 trillion emails, 5 trillion by 2003 (IDC)(IDC)
• Yahoo! collects 400 Gbytes/day web-click info Yahoo! collects 400 Gbytes/day web-click info (Business Week)(Business Week)
• $150 billion ecommerce, $1.3 trillion by 2003 $150 billion ecommerce, $1.3 trillion by 2003 (IDC)(IDC)
• 1 billion web pages, with 50% increase/year 1 billion web pages, with 50% increase/year (IDC)(IDC)
• 100s millions digital images, audio, video, ….100s millions digital images, audio, video, ….
CBCL MIT
March 8, 2000
INSEAD
Driving ForcesDriving Forces
1997 1998 1999 2000 2005
MEMORY MEMORY COMPUTATION COMPUTATION DATA TOOLS DATA TOOLS
DATA DATA Emails Emails
Multimedia Multimedia EcommerceEcommerce
CBCL MIT
March 8, 2000
INSEAD
Further StudiesFurther Studies
Bounds on the expected risk of kernel machines Bounds on the expected risk of kernel machines are developed are developed (Evgeniou, Pontil, 2000).(Evgeniou, Pontil, 2000).
Connections with other learning methods are Connections with other learning methods are made made (Evgeniou, Pontil, Poggio, 1999).(Evgeniou, Pontil, Poggio, 1999).
Combinations of learning machines (e.g. many Combinations of learning machines (e.g. many machines each using different information) are machines each using different information) are studied studied (Evgeniou, Perez-Breva, Pontil, Poggio, 2000).(Evgeniou, Perez-Breva, Pontil, Poggio, 2000).
CBCL MIT
March 8, 2000
INSEAD
Text MiningText Mining
How can we reliably decide if a text is about a How can we reliably decide if a text is about a particular topic?particular topic?
……drive…drive…used… ..brokused… ..brok
e…send…e…send…
CAR CAR
COMPUTERCOMPUTER
TRAVELTRAVEL
COMMERCECOMMERCE
CBCL MIT
March 8, 2000
INSEAD
What is Data Mining?What is Data Mining?
Data Data MMiine ne
DataData MMooneneyy