Upload
elijah-todd
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Bell Laboratories
Data Complexity Analysis:Linkage between Context and Solution in
Classification
Tin Kam Ho
With contributions from
Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law,Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia
2 All Rights Reserved © Alcatel-Lucent 2008
Pattern Recognition: Research vs. Practice
Steps to solve a practical pattern recognition problem
Feature
Extraction
Classifier Training
Classification
Sensory Data
Decision
Feature Vectors
Classifier
Data Collection
Study of the Problem Context
Study of the Mathematical
Solution
Practical Focus
Research Focus
Danger of Disconnectio
n
3 All Rights Reserved © Alcatel-Lucent 2008
Reconnecting Context and Solution
Feature Vectors
Study of the Problem Context
Study of the Mathematical
Solution
To understand how such properties may impact the classification solution
To understand how changes in the problem set-up and data collection procedures may affect such properties
Data Complexity Analysis: Analysis of
the properties of feature vectors
Improvements
Limitations
Expectations
4 All Rights Reserved © Alcatel-Lucent 2008
•Kolmogorov complexity
•Boundary length can be exponential in dimensionality
•A trivial description is to list all points & class labels
•Is there a shorter description?
Focus is on Boundary Complexity
5 All Rights Reserved © Alcatel-Lucent 2008
Early Discoveries
•Problems distribute in a continuum in complexity space
•Several key measures provide independent characterization
•There exist identifiable domains of classifier’s dominant competency
•Feature selection and transformation induce variability in complexity estimates
6 All Rights Reserved © Alcatel-Lucent 2008
Parameterization of Data Complexity
7 All Rights Reserved © Alcatel-Lucent 2008
Complexity Classes vs. Complexity Scales
•Study is driven by observed limits in classifier accuracy, even with new, sophisticated methods (e.g., ensembles, SVM, …)
•Analysis is needed for each instance of a classification problem, not just the worst case of a family of problems
•Linear separability: the earliest attempt to address classification complexity
•Observed in real-world problems: different degrees of linear non-separability
•Continuous scale is needed
8 All Rights Reserved © Alcatel-Lucent 2008
Some Useful Measures of Geometric Complexity
22
21
221
σσ)μ(μ
f
Classical measure of class separability
Maximize over all features to find the most discriminating
Fisher’s Discriminant Ratio
Degree of Linear Separability
Find separating hyper-plane by linear programming
Error counts and distances to plane measure separability
Length of Class Boundary
Compute minimum spanning tree
Count class-crossing edges
Shapes of Class Manifolds
Cover same-class pts with maximal balls
Ball counts describe shape of class manifold
9 All Rights Reserved © Alcatel-Lucent 2008
Real-World Data Sets:
Benchmarking data from UC-Irvine archive
844 two-class problems452 are linearly separable, 392 non-separable
Synthetic Data Sets:
Random labeling of
randomly located points100 problems in 1-100 dimensions
Continuous Distributions in Complexity Space
Random labeling
Linearly separable real-world data
Linearly non-separable real-world data
Complexity Metric 1
Metr
ic 2
10 All Rights Reserved © Alcatel-Lucent 2008
Measures of Geometrical Complexity
11 All Rights Reserved © Alcatel-Lucent 2008
The First 6 Principal Components
12 All Rights Reserved © Alcatel-Lucent 2008
Interpretation of the First 4 PCs
PC 1: 50% of variance: Linearity of boundary and proximity of opposite class neighbor
PC 2: 12% of variance: Balance between within-class scatter and between-class distance
PC 3: 11% of variance: Concentration & orientation of intrusion into opposite class
PC 4: 9% of variance: Within-class scatter
13 All Rights Reserved © Alcatel-Lucent 2008
• Continuous distribution
• Known easy & difficult problems occupy opposite ends
• Few outliers
• Empty regionsRandom labels
Linearly separable
Problem Distribution in 1st & 2nd Principal Components
14 All Rights Reserved © Alcatel-Lucent 2008
Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density
2 points 10 points
100 points 500 points 1000 points
Problem may appear deceptively simple or complex with small samples
15 All Rights Reserved © Alcatel-Lucent 2008
Observations
•Problems distribute in a continuum in complexity space
•Several key measures/dimensions provide independent characterization
•Need further analysis on uncertainty in complexity estimates due to small sample size effects
16 All Rights Reserved © Alcatel-Lucent 2008
Relating Classifier Behavior to Data Complexity
17 All Rights Reserved © Alcatel-Lucent 2008
Class Boundaries Inferred by Different Classifiers
XCS: a genetic algorithm
Nearest neighbor classifier
Linear classifier
18 All Rights Reserved © Alcatel-Lucent 2008
Accuracy Depends on the Goodness of Match between Classifiers and Problems
NNXCSerror=0.06%
error=1.9%
Better!
Problem A Problem B
error=0.6%
error=0.7%
XCS NN
Better!
19 All Rights Reserved © Alcatel-Lucent 2008
Domains of Competence of Classifiers
Given a classification problem,we want determine which classifier is the best for it.
Can data complexity give us a hint?
Complexity metric 1
Metr
ic 2
NN
LC
XCSDecisionForest
?
Here is my
problem !
20 All Rights Reserved © Alcatel-Lucent 2008
Domain of Competence Experiment
Use a set of 9 complexity measuresBoundary, Pretop, IntraInter, NonLinNN, NonLinLP,Fisher, MaxEff, VolumeOverlap, Npts/Ndim
Characterize 392 two-class problems from UCI data,all shown to be linearly non-separable
Evaluate 6 classifiersNN (1-nearest neighbor)LP (linear classifier by linear programming)Odt (oblique decision tree)Pdfc (random subspace decision forest)Bdfc (bagging based decision forest)XCS (a genetic-algorithm based classifier)
ensemble methodsensemble methodsensemble methods
21 All Rights Reserved © Alcatel-Lucent 2008
Identifiable Domains of Competence by NN and LP
Best Classifier for Benchmarking Data
22 All Rights Reserved © Alcatel-Lucent 2008
Regions in complexity space where the best classifier is (nn,lp, or odt) vs. an ensemble technique
Boundary-NonLinNN
IntraInter-Pretop
MaxEff-VolumeOverlap
ensemble+ nn,lp,odt
Less Identifiable Domains of Competence
23 All Rights Reserved © Alcatel-Lucent 2008
Uncertainty of Estimates at Two Levels
Sparse training data in each problem & complex geometry cause ill-posedness of class boundaries
(uncertainty in feature space)
Sparse sample of problems causes difficulty in identifying regions of dominant competence
(uncertainty in complexity space)
24 All Rights Reserved © Alcatel-Lucent 2008
Complexity and Data Dimensionality:Class Separability after Dimensionality Reduction
Feature selection/transformation may change the difficulty of a classification problem:
• Widening the gap between classes• Compressing the discriminatory information• Removing irrelevant dimensions
It is often unclear to what extent these happen We seek quantitative description of such changes
Feature selection Discrimination
25 All Rights Reserved © Alcatel-Lucent 2008
10 20 30 40 50 60 70 80 90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Boundary
1N
N e
rro
r FFS subsets all datasets
boundary versus 1NN classification error spectra1
colon spectra2eogat ovarian spectra3
Spread of classification accuracy and geometrical complexity due to forward feature selection
26 All Rights Reserved © Alcatel-Lucent 2008
Designing a Strategy for Classifier Evaluation
27 All Rights Reserved © Alcatel-Lucent 2008
A Complete Platform for Evaluating Learning Algorithms
To facilitate progress on learning algorithms:
• Need a way to systematically create learning problems
• Provide a complete coverage of the complexity space
• Be representative of all the known problems
i.e., every classification problem arising
in the real-world should have a close neighbor
representing it in the complexity space.
Is this possible?
28 All Rights Reserved © Alcatel-Lucent 2008
Ways to Synthesize Classification Problems
• Synthesizing data with targeted levels of complexity
• e.g. compute MST over a uniform point distribution, then assign class-crossing edges randomly [Macia et al. 2008]
• or, create partitions with increasing resolution
• can create continuous cover of complexity space
• but, are the data similar to those arising from reality?
29 All Rights Reserved © Alcatel-Lucent 2008
Ways to Synthesize Classification Problems
• Synthesizing data to simulate natural processes• e.g. Neyman-Scott process
• how many such processes have explicit models?
• how many are needed to cover all real-world problems?
• Systematically degrade real-world datasets• increase noise, reduce image resolution, …
30 All Rights Reserved © Alcatel-Lucent 2008
Simplification of Class Geometry
31 All Rights Reserved © Alcatel-Lucent 2008
Manifold Learning and Dimensionality Reduction
• Manifold learning techniques that highlight intrinsic dimensions
• But the class boundary may not follow the intrinsic dimensions
32 All Rights Reserved © Alcatel-Lucent 2008
Manifold Learning and Dimensionality Reduction
• Supervised manifold learning – seek mappings
that exaggerate class separation
[de Ridder et al., 2003]
• Best, the mapping should be sought to directly
minimize some measures of data complexity
33 All Rights Reserved © Alcatel-Lucent 2008
Seeking Optimizations Upstream
Back to the application context:• Use data complexity measures for guidance
• Change the setup, definition of the classification problem
• Collect more samples, in finer resolution, extract more features …
• Alternative representations:
• dissimilarity-based? [Pekalska & Duin 2005]
Data complexity gives an operational definition of learnability
Optimization in the upstream: formalize the intuition of seeking invariance, systematically optimize the problem setup and data acquisition scenario to reduce data complexity
34 All Rights Reserved © Alcatel-Lucent 2008
Recent Examples from the Internet
35 All Rights Reserved © Alcatel-Lucent 2008
CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart
Also known as
• Reverse Turing Test
• Human Interactive Proofs
[von Ahn et al., CMU 2000]
Exploit limitations in accuracy of machine pattern recognition
36 All Rights Reserved © Alcatel-Lucent 2008
The Netflix Challenge
• $1 Million Prize for the first team to improve 10% over the company’s own recommender system
• But, is the goal achievable? Do the training data support such possibility?
37 All Rights Reserved © Alcatel-Lucent 2008
Amazon’s Mechanical Turk
• “Crowd-sourcing” tedious human intelligence (pattern recognition) tasks
• Which ones are doable by machines?
38 All Rights Reserved © Alcatel-Lucent 2008
Conclusions
39 All Rights Reserved © Alcatel-Lucent 2008
Summary
Automatic classification is useful, but can be very difficult. We know the key steps and many promising methods. But we have not fully understood how they work, what else is needed.
We found measures for geometric complexity that are useful to characterize difficulties of classification problems and classifier domains of competence.
Better understanding of how data and classifiers interact can guide practice, and re-establish the linkage between context and solution.
40 All Rights Reserved © Alcatel-Lucent 2008
For the Future
Further progress in statistical and machine learning will need systematic, scientific evaluation of the algorithms with problems that are difficult for different reasons.
A “problem synthesizer” will be useful to provide a complete evaluation platform, and reveal the “blind spots” of current learning algorithms.
Rigorous statistical characterization of complexity estimates from limited training data will help gauge the uncertainty, and determine applicability of data complexity methods.