View
218
Download
0
Embed Size (px)
Citation preview
Apr. 26, 2010CMU
Density Ratio Estimation:A New Versatile Toolfor Machine Learning
Density Ratio Estimation:A New Versatile Toolfor Machine Learning
Department of Computer ScienceTokyo Institute of Technology
Masashi [email protected]
http://sugiyama-www.cs.titech.ac.jp/~sugi
2Overview of My Talk (1)Consider the ratio of two probability densities.
If the ratio is estimated, various machine learning problems can be solved!Non-stationarity adaptation, domain adaptation,
multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test
3Overview of My Talk (2)
Estimating density ratio is substantially easier than estimating densities!
Various direct density ratio estimation methods have been developed recently.
Vapnik said: When solving a problem of interest,one should not solve a more general problem
as an intermediate step
Knowing densities Knowing ratio
4Organization of My Talk
1. Usage of Density Ratios:Covariate shift adaptationOutlier detectionMutual information estimationConditional probability estimation
2. Methods of Density Ratio Estimation
5Covariate Shift Adaptation
Trainingsamples
Testsamples
Function
Targetfunction
Training/test input distributions are different, but target function remains unchanged:
“extrapolation”
Input density
Shimodaira (JSPI2000)
6 Adaptation Using Density Ratios Ordinary least-squares
is not consistent.
Applicable to any likelihood-based methods!
Density-ratio weighted least-squares is consistent.
7Model SelectionControlling bias-variance trade-off is importa
nt in covariate shift adaptation.No weighting: low-variance, high-biasDensity ratio weighting: low-bias, high-variance
Density ratio weighting plays an important role for unbiased model selection:Akaike information criterion (regular models)
Subspace information criterion (linear models)
Cross-validation (arbitrary models)
Shimodaira (JSPI2000)
Sugiyama & Müller (Stat&Dec.2005)
Sugiyama, Krauledat & Müller (JMLR2007)
8Active Learning
Covariate shift naturally occurs in active learning scenarios: is designed by users. is determined by environment.
Density ratio weighting plays an important role for low-bias active learning.Population-based methods:
Pool-based methods:
Wiens (JSPI2000)Kanamori & Shimodaira (JSPI2003)
Sugiyama, (JMLR2006)
Kanamori, (NeuroComp2007),Sugiyama & Nakajima (MLJ2009)
9Real-world ApplicationsAge prediction from faces (with NECsoft):
Illumination and angle change (KLS&CV)
Speaker identification (with Yamaha):Voice quality change (KLR&CV)
Brain-computer interface:Mental condition change (LDA&CV)
Ueki, Sugiyama & Ihara (ICPR2010)
Yamada, Sugiyama & Matsui (SP2010)
Sugiyama, Krauledat & Müller (JMLR2007)Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)
10Real-world Applications (cont.)Japanese text segmentation (with IBM):
Domain adaptation from general conversation to medicine (CRF&CV)
Semiconductor wafer alignment (with Nikon):Optical marker allocation (AL)
Robot control:Dynamic motion update (KLS&CV)
Optimal exploration (AL)
Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008)
Sugiyama & Nakajima (MLJ2009)
Akiyama, Hachiya, & Sugiyama (NN2010)
Hachiya, Akiyama, Sugiyama & Peters (NN2009) Hachiya, Peters & Sugiyama (ECML2009)
こんな・失敗・は・ご・愛敬・だ・よ・.
11Organization of My Talk
1. Usage of Density Ratios:Covariate shift adaptationOutlier detectionMutual information estimationConditional probability estimation
2. Methods of Density Ratio Estimation
12Inlier-based Outlier Detection
Test samples having different tendency from regular ones are regarded as outliers.
Outlier
Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008)Smola, Song & Teo (AISTATS2009)
13Real-world ApplicationsSteel plant diagnosis (with JFE Steel)
Printer roller quality control (with Canon)
Loan customer inspection (with IBM)
Sleep therapy from biometric data
Takimoto, Matsugu & Sugiyama (DMSS2009)
Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010)
Kawahara & Sugiyama (SDM2009)
14Organization of My Talk
1. Usage of Density Ratios:Covariate shift adaptationOutlier detectionMutual information estimationConditional probability estimation
2. Methods of Density Ratio Estimation
15Mutual Information EstimationMutual information (MI):
MI as an independence measure:
MI can be approximated using density ratio:
and arestatistically
independent
Suzuki, Sugiyama & Tanaka (ISIT2009)
16Usage of MI EstimatorIndependence between input and output:
Feature selectionSufficient dimension reduction
Independence between inputs:Independent component analysis
Independence between input and residual:Causal inference
x x’
yεFeature selection,
Sufficient dimension reduction
Independent component analysis
Causal inference
Suzuki & Sugiyama (AISTATS2010)
Yamada & Sugiyama (AAAI2010)
Suzuki, Sugiyama, Sese & Kanamori(BMC Bioinformatics 2009)
Suzuki & Sugiyama(ICA2009)
17Organization of My Talk
1. Usage of Density Ratios:Covariate shift adaptationOutlier detectionMutual information estimationConditional probability estimation
2. Methods of Density Ratio Estimation
18Conditional Probability Estimation
When is continuous:Conditional density
estimationMobile robots’
transition estimation
When is categorical:Probabilistic classification
Pattern
Class 1
Class 2
80%
20%Sugiyama (Submitted)
Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010)
19Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment matching Approach
C) Ratio matching Approach
D) Comparison
E) Dimensionality Reduction
20Density Ratio Estimation
Density ratios are shown to be versatile.In practice, however, the density ratio
should be estimated from data.
Naïve approach: Estimate two densities separately and take their ratio
21Vapnik’s Principle
Estimating density ratio is substantially easier than estimating densities!
We estimate density ratio without going through density estimation.
When solving a problem, don’t solvemore difficult problems as an intermediate step
Knowing densities Knowing ratio
22Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approach
D) Comparison
E) Dimensionality Reduction
23Probabilistic ClassificationSeparate numerator and denominator
samples by a probabilistic classifier:
Logistic regression achieves the minimum asymptotic variance when model is correct.
Qin (Biometrika1998)
24Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approach
D) Comparison
E) Dimensionality Reduction
25Moment Matching
Match moments of and :Ex) Matching the mean:
26Kernel Mean Matching
Matching a finite number of moments does not yield the true ratio even asymptotically.
Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS:
Ratio values at samples can be estimated via quadratic program.
Variants for learning entire ratio function and other loss functions are also available.
Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006)
Kanamori, Suzuki & Sugiyama (arXiv2009)
27Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approachi. Kullback-Leibler Divergence
ii. Squared Distance
D) Comparison
E) Dimensionality Reduction
28Ratio Matching
Match a ratio model with true ratio under the Bregman divergence:
Its empirical approximation is given by
(Constant is ignored)
Sugiyama, Suzuki & Kanamori (RIMS2010)
29Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approachi. Kullback-Leibler Divergence
ii. Squared Distance
D) Comparison
E) Dimensionality Reduction
30Kullback-Leibler Divergence (KL)
Put . Then BR yields
For a linear model ,
minimization of KL is convex.
Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007)Nguyen, Wainwright & Jordan (NIPS2007)
(ex. Gauss kernel)
31Properties
Parametric case:
Learned parameter converge to the optimal value with order .
Non-parametric case:
Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).
Nguyen, Wainwright & Jordan (NIPS2007)Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008)
32Experiments Setup: d-dim. Gaussian with covariance identity and
Denominator: mean (0,0,0,…,0)Numerator: mean (1,0,0,…,0)
Ratio of kernel density estimators (RKDE):Estimate two densities separately and take ratio.Gaussian width is chosen by CV.
KL method:Estimate density ratio directly.Gaussian width is chosen by CV.
33
RKDE: “curse of dimensionality”KL: works well
RKDE
KLNor
mal
ized
MS
E
Dim.
Accuracy as a Function
of Input Dimensionality
34Variations
EM algorithms for KL withLog-linear models
Gaussian mixture models
Probabilistic PCA mixture models
are also available.
Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009)
Yamada & Sugiyama (IEICE2009)
Yamada, Sugiyama, Wichern & Jaak (Submitted)
35Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approachi. Kullback-Leibler Divergence
ii. Squared Distance
D) Comparison
E) Dimensionality Reduction
36Squared Distance (SQ)
Put . Then BR yields
For a linear model ,
minimizer of SQ is analytic:
Computationally efficient!
Kanamori, Hido & Sugiyama (JMLR2009)
37Properties
Optimal parametric/non-parametric rates of convergence are achieved.
Leave-one-out cross-validation score can be computed analytically.
SQ has the smallest condition number.Analytic solution allows computation of the
derivative of mutual information estimator:Sufficient dimension reductionIndependent component analysisCausal inference
Kanamori, Suzuki & Sugiyama (arXiv2009)Kanamori, Hido & Sugiyama (JMLR2009)
38Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approach
D) Comparison
E) Dimensionality Reduction
39Density Ratio Estimation Methods
MethodDensity
estimationDomains of
de/nuModel
selectionComputation
time
RKDE Involved Could differ Possible Very fast
LR Free Same? Possible Slow
KMM Free Same? ??? Slow
KL Free Could differ Possible Slow
SQ Free Could differ Possible Fast
SQ would be preferable in practice.
40Organization of My Talk
1. Usage of Density Ratios
2. Methods of Density Ratio Estimation:A) Probabilistic Classification Approach
B) Moment Matching Approach
C) Ratio Matching Approach
D) Comparison
E) Dimensionality Reduction
41Direct Density Ratio Estimationwith Dimensionality Reduction (D3)
Directly density ratio estimation without going through density estimation is promising.
However, for high-dimensional data, density ratio estimation is still challenging.
We combine direct density ratio estimation with dimensionality reduction!
42Hetero-distributional Subspace (HS)Key assumption: and are
different only in a subspace (called HS).
This allows us to estimate the density ratio only within the low-dimensional HS!
: Full-rank and orthogonal
HS
43HS Search Based onSupervised Dimension
ReductionIn HS, and are different, so and would be separable.
We may use any supervised dimension reduction method for HS search.
Local Fisher discriminant analysis (LFDA):Can handle within-class multimodalityNo limitation on reduced dimensionsAnalytic solution available
Sugiyama (JMLR2007)
Sugiyama, Kawanabe & Chui (NN2010)
44Pros and Cons of SupervisedDimension Reduction
ApproachPros:SQ+LFDA gives analytic solution for D3.
Cons:Rather restrictive implicit assumption that
and are independent.
Separability does not always lead to HS (e.g., two overlapped, but different densities)
Separable HS
45Characterization of HS
HS is given as the maximizer of the Pearson divergence :
Thus, we can identify HS by finding maximizer of PE with respect to .
PE can be analytically approximated by SQ!
Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010)
46HS Search Gradient descent:Natural gradient descent:
Utilize the Stiefel-manifold structure for efficiency
Givens rotation:Choose a pair of axes and rotate in the subspace
Subspace rotation:Ignore rotation within subspaces for efficiency
Hetero-distributionalsubspace
Rotation acrossthe subspace
Rotation withinthe subspace
Amari (NeuralComp1998)
Plumbley (NeuroComp2005)
47
Samples (2d) True ratio (2d)
D3-SQ (2d)
Plain SQ (2d)
Experiments
Increasing dimensionality(by adding noisy dims)
Plain SQ
D3-SQ
48ConclusionsMany ML tasks can be solved via density rati
o estimation:Non-stationarity adaptation, domain adaptation,
multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test
Directly estimating density ratios without going through density estimation is the key.LR, KMM, KL, SQ, and D3 .
49BooksQuiñonero-Candela, Sugiyama, Schwaighofe
r & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009.
Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!)
Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)
50The World of Density Ratios
Theoretical analysis:Convergence, Information criteria, Numerical stability
Density ratio estimation:Fundamental algorithms (LR, KMM, KL, SQ)
Large-scale, High-dimensionality, Stabilization, Robustification
Machine learning algorithms:Covariate shift adaptation, Outlier detection,
Conditional probability estimation, Mutual information estimation
Real-world applications:Brain-computer interface, Robot control, Image understanding,
Speech recognition, Natural language processing, Bioinformatics