Upload
hoangnhi
View
222
Download
0
Embed Size (px)
Citation preview
7/6/2003 ICME Tutorial, Baltimore 1
Statistical Methods for Learning Multimedia Semantics
Edward ChangAssociate Professor,Electrical Engineering, UC Santa BarbaraCTO, VIMA Technologies
7/6/2003 ICME Tutorial, Baltimore 2
Outline
Statistical LearningMultimedia Applications’ Data CharacteristicsClassical ModelsKernel Methods
Linear Model ViewNearest Neighbor ViewGeometric View
Dimension Reduction Methods
7/6/2003 ICME Tutorial, Baltimore 3
Statistical Learning
Program the computers to learn!Computers improve performancewith experience at some taskExample:
Task: classify imagesPerformance: prediction accuracyExperience: labeled images
7/6/2003 ICME Tutorial, Baltimore 4
Definition
X: Data poolU: Unlabeled pool L: Labeled pool
G: LabelsRegression: G → RClassification: G → +1, -1
H: Learning algorithm
7/6/2003 ICME Tutorial, Baltimore 5
Statistical LearningExperience
Characterized by training data LTraining
f = H(L)Task (e.g., prediction)
ŷ = f(u), u ∈ UPerformance
Measured by some error functione.g., maximizing yf(u)
7/6/2003 ICME Tutorial, Baltimore 6
Learning Algorithms (H)
Linear RegressionK-NNBayesian AnalysisNeural NetworksDecision TreesKernel MethodsEtc.
7/6/2003 ICME Tutorial, Baltimore 7
H Having a hypothesis spaceFind the “best” hypothesis based on the training data (L) efficiently
Best solutionFitting L well? Predicting U accurately!
EfficiencyComputational complexity and resource requirements
7/6/2003 ICME Tutorial, Baltimore 8
Classical Model [Donoho 2000]
N:Number of training instancesN = |U|
N+, N-
D:DimensionalityN >> D N → ∞
E.g., PAC learnabilityN- ≈ N+
7/6/2003 ICME Tutorial, Baltimore 9
Emerging MM Applications
N < DN+ << N-
ExamplesInformation retrieval with relevance feedbackK-class classification⌧Image classification⌧Gene profiling
7/6/2003 ICME Tutorial, Baltimore 10
Gene Profiling ExampleN = 59 cases, D = 4026 genes
7/6/2003 ICME Tutorial, Baltimore 11
Image Retrieval Demo
N < DN < 50D = 150
N+ << N-
ACM SIGMOD 01; ACM MM 01,02; IEEE CVPR 03Also see my Web site
7/6/2003 ICME Tutorial, Baltimore 12
SVMactive
7/6/2003 ICME Tutorial, Baltimore 13
SVMactive
7/6/2003 ICME Tutorial, Baltimore 14
SVMactive
7/6/2003 ICME Tutorial, Baltimore 15
SVMactive
7/6/2003 ICME Tutorial, Baltimore 16
Ranking
7/6/2003 ICME Tutorial, Baltimore 17
Solution Summary N < D
ACM MM 2001 (SVM Active)⌧Make each u in U most informative
PCM 2002, ICIP 2003⌧Increase N- through co-training
ACM MM 2002 (DPF)⌧Reduce D
N+ << N-
ACM MM 2003, ICML 2003⌧Conformal transformation ⌧Kernel boundary alignment
7/6/2003 ICME Tutorial, Baltimore 18
Outline
Statistical LearningMM Applications’ Data CharacteristicsClassical Models (Classification)Kernel Methods
Linear Model ViewNearest Neighbor ViewGeometric View
Dimension Reduction Methods
7/6/2003 ICME Tutorial, Baltimore 19
Classical Methods
Linear ModelLeast SquareMaximum Likelihood Naïve BayesianLDAMaximum Margin Hyperplane
Nearest Neighbor
7/6/2003 ICME Tutorial, Baltimore 20
Linear Regression
7/6/2003 ICME Tutorial, Baltimore 21
Least Square
Y = β0 + ΣΣ βj Xj (j = 1 to D)Y = XTβRSS(β) = (Y – Xβ)T(Y – Xβ)
RSS: Residual Sum of Squareβ = (XTX)-1 XTY
7/6/2003 ICME Tutorial, Baltimore 22
Maximum Likelihood
Y = β0 + ΣΣ βj Xj (j = 1 to p)Y = XTβY = XTβ + ε
ε (noise signals) are independentε → N (0, ∂2)
P(y|βx) has a normal dist. withMean at y = βxVariance ∂2
7/6/2003 ICME Tutorial, Baltimore 23
Maximum Likelihood
P(y|βx) → N (0, ∂2) Training
Given (x1,y1) (x2,y2) … (xn,yn)Infer P(β | x1, x2,… xn, y1, y2,…yn )By Bayes rule, orMaximum Likelihood Estimate
7/6/2003 ICME Tutorial, Baltimore 24
Maximum Likelihood
For what β isP(y1, y2,…yn | x1, x2,… xn, β) maximized?ΠΠ P(yi|βxi) maximized? ΠΠ exp(-½(yi-βxi/∂)2) maximized?ΣΣ (-½(yi-βxi/∂)2 maximized?ΣΣ (yi-βxi)2 minimized?
7/6/2003 ICME Tutorial, Baltimore 25
Least Square Linear Model
Solution Method #1RSS(β) = (Y – Xβ)T(Y – Xβ)β = (XTX)-1 XTY
Solution Method #2 (for D > N)Gradient decentPerceptron
7/6/2003 ICME Tutorial, Baltimore 26
Other Linear Models
LDAFind the projection direction which minimizes the overlap for two Gaussian distributions
Separating Hyperplane
7/6/2003 ICME Tutorial, Baltimore 27
LDA
7/6/2003 ICME Tutorial, Baltimore 28
7/6/2003 ICME Tutorial, Baltimore 29
Separating Hyperplane
7/6/2003 ICME Tutorial, Baltimore 30
Separating Hyperplane
7/6/2003 ICME Tutorial, Baltimore 31
Maximum Margin Hyperplane
Only support vectors involve in class prediction!
7/6/2003 ICME Tutorial, Baltimore 32
Linear Models
N ≥ DLeast SquareLDA
D > NPerceptron (using gradient decent)Maximum Hyperplane
Generative vs. Discriminative Model
7/6/2003 ICME Tutorial, Baltimore 33
Linear Model Fits All Data?
7/6/2003 ICME Tutorial, Baltimore 34
How about Joining the Dots?
Y(x) = 1/k ΣΣ yi,
xi ∈Nk(x)K = 1
7/6/2003 ICME Tutorial, Baltimore 35
Linear Model Fits All?
7/6/2003 ICME Tutorial, Baltimore 36
NN with k = 1
7/6/2003 ICME Tutorial, Baltimore 37
Nearest Neighbor
Four Things Make a Memory Based Learner
A distance function?K: number of neighbors to consider?A weighted function (optional)?How to fit with the local points?
7/6/2003 ICME Tutorial, Baltimore 38
Problems of K=1
Fitting NoiseJagged Boundaries
7/6/2003 ICME Tutorial, Baltimore 39
Solutions
Fitting NoisePick a larger K?
7/6/2003 ICME Tutorial, Baltimore 40
NN with k = 15
7/6/2003 ICME Tutorial, Baltimore 41
NN
7/6/2003 ICME Tutorial, Baltimore 42
Solutions
Fitting NoisePick a larger K?
Jagged BoundariesIntroducing Kernel as a weighting function
7/6/2003 ICME Tutorial, Baltimore 43
Nearest Neighbor → Kernel Method
Four Things Make a Memory Based Learner
A distance functionK: number of neighbors to consider? AllA weighted function: RBF kernelsHow to fit with the local points? Predict weights
7/6/2003 ICME Tutorial, Baltimore 44
Kernel Method
RBF Weighted FunctionKernel width holds the key⌧Implying KUse cross validation to find the “optimal” width
Fitting with the Local PointsWhere NN meets Linear Model
7/6/2003 ICME Tutorial, Baltimore 45
LM vs. NNLinear Model
f(x) is approximated by a global linear functionMore stable, less flexible
Nearest NeighborK-NN assumes f(x) is well approximated by a locally constant functionLess stable, more flexible
Between LM and NNThe other models…
7/6/2003 ICME Tutorial, Baltimore 46
Decision Theories
Bias & Variance TradeoffBayes PredictionVC DimensionalityPAC Learnability
7/6/2003 ICME Tutorial, Baltimore 47
Variance vs. Bias
MSE(x) = ET [f(x) – ŷ]2
= ET[ŷ – ET(ŷ)]2 + [ET(ŷ)– f(x)]2
Error = VarT(ŷ) + Bias2(ŷ)
7/6/2003 ICME Tutorial, Baltimore 48
Variance vs. Bias
7/6/2003 ICME Tutorial, Baltimore 49
Outline
Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods
7/6/2003 ICME Tutorial, Baltimore 50
Where Are We and Where Am I Heading To ?
LM and NNKernel Method of Three Views
LM viewNN viewGeometric view
7/6/2003 ICME Tutorial, Baltimore 51
Linear Model View
Y = β0 + ΣΣ β XSeparating Hyperplane
Max||β||=1 CSubject to yyii f(f(xxii) ) ≥≥ C, orC, oryyii ((β0 +β xi) ≥≥ CC
7/6/2003 ICME Tutorial, Baltimore 52
Classifier Margin
Margin Defined as width of the boundary before hitting a data object
Maximum MarginTends to minimize classification varianceNo formal theory for this yet
7/6/2003 ICME Tutorial, Baltimore 53
Separating Hyperplane
7/6/2003 ICME Tutorial, Baltimore 54
M’s Mathematical Representation
Plus-plane{x: wx+b = +1}
Minus-plane{x: wx+b = -1}
w ⊥ Plus-planew(u – v) = 0, if u and v on plus-plane
w ⊥ Minus-plane
7/6/2003 ICME Tutorial, Baltimore 55
Separating Hyperplane
7/6/2003 ICME Tutorial, Baltimore 56
M
Let x- be any point on minus-planeLet x+ be the closest plus-plane-point to x-
x+ = x- + λw, whyThe line (x+x-) ⊥ minus-plane
M = |x+ - x-|
7/6/2003 ICME Tutorial, Baltimore 57
M
1. wx- + b = -1 2. wx+ + b = 1 3. x+ = x- + λw 4. M = |x+ - x-|5. w(x- + λw) + b = 1 (from 2 & 3)6. wx- + b + λww = 17. λww = 2
7/6/2003 ICME Tutorial, Baltimore 58
M
1. λww = 22. λ = 2/ww3. M = |x+ - x-| = |λw| = λ|w| = 2/|w|
4. Max MGradient decent, simulated annealing, EM, Newton’s method…
7/6/2003 ICME Tutorial, Baltimore 59
Max M
Max M = 2/|w|Min |w|/2Min |w|2/2
subject to yi(xiw+b) ≥ 1i = 1,…,N
Quadratic criterion with linear inequality constraints
7/6/2003 ICME Tutorial, Baltimore 60
Max M
Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N
Lp = minw,b |w|2/2 + ΣΣi=1..N αi[yi(xiw+b)-1]
w = ΣΣi=1..N αiyixi
0 = ΣΣi=1..N αiyi
7/6/2003 ICME Tutorial, Baltimore 61
Wolfe Dual
Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj
Subject to αi ≥ 0αi [yi(xiw+b)-1] = 0KKT conditions⌧αi > 0, yi(xiw+b) = 1 (Support Vectors)⌧αi = 0, yi(xiw+b) > 1
7/6/2003 ICME Tutorial, Baltimore 62
Class Predictionyyqq = = w xq + b
w = ΣΣi=1..N αiyixi
yyqq = sign(= sign(ΣΣi=1..N αiyi(xi ·Xq) + b)
7/6/2003 ICME Tutorial, Baltimore 63
Non-separatable Classes
Soft Margin HyperplaneBasis Expansion
7/6/2003 ICME Tutorial, Baltimore 64
Non-separating Case
7/6/2003 ICME Tutorial, Baltimore 65
Soft Margin SVMs
Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N
Min |w|2/2 + C ∑εi
xiw+b ≥ 1 - εi if yi = 1xiw+b ≤ -1 + εi if yi = -1εi ≥ 0
7/6/2003 ICME Tutorial, Baltimore 66
Non-separating Case
7/6/2003 ICME Tutorial, Baltimore 67
Wolfe Dual
Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj
Subject to C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions
yyqq = = sign ((ΣΣi=1..N αiyi(xi ·Xq) + b)
7/6/2003 ICME Tutorial, Baltimore 68
Basis Function
7/6/2003 ICME Tutorial, Baltimore 69
Harder 1D Example
7/6/2003 ICME Tutorial, Baltimore 70
Basis Function
Φ(X) = (x, x2)
7/6/2003 ICME Tutorial, Baltimore 71
Harder 1D Example
7/6/2003 ICME Tutorial, Baltimore 72
Some Basis Functions
Φ(X) = ΣΣ γmhm(X) hm(X) Rp → R
Common FunctionsPolynomialRadial basis functionsSigmoid functions
7/6/2003 ICME Tutorial, Baltimore 73
Kernel FunctionLd = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyj Φ(xi)Φ (xj)Subject to
C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions
yyqq = sign (= sign (ΣΣi=1..N αiyi(Φ(xi)·Φ(Xq)) + b)K(xi, xj) = Φ(xi)·Φ(Xj)
Kernel function!
7/6/2003 ICME Tutorial, Baltimore 74
Quadratic Basis Functions
Φ(a) = {1, ai, ai aj}, ij = 1..D(D+1)(D+2)/2 termsD2 termsO(D2) computational cost
It is equivalent to (ab+1)2
O(D) computational costTotal computational cost
O(N2D)
7/6/2003 ICME Tutorial, Baltimore 75
Dot Product Saves the Day
O(N2D)Quadratic
O(N2D2)Cubic
O(N2D3)Quartic
O(N2D4)
7/6/2003 ICME Tutorial, Baltimore 76
Quiz
What is a polynomial kernel degree dfunction’s signature?(ab+1)d
7/6/2003 ICME Tutorial, Baltimore 77
Outline
LM and NNKernel Method of Three Views
LM viewNN viewGeometric view
7/6/2003 ICME Tutorial, Baltimore 78
Nearest Neighbor View
Z, a set of zero mean jointly Gaussian random variables,
Each Zi corresponds to one example Xi
Cov (zi, zj) = K(xi, xj)yi, the lable of zi, +1 or -1
P(yi | zi) = σ(yi,zi)
7/6/2003 ICME Tutorial, Baltimore 79
Training Data
7/6/2003 ICME Tutorial, Baltimore 80
General Kernel Classifier [Jaakkola, etc. 99]
MAP Classification for xt
yt = sign (Σ αi yi K(xt,xi)) K(xi, xj) = Cov (zi, zj) (some similarity function)
Supervised Training: Compute αi Given X and y, andAn error function such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)
7/6/2003 ICME Tutorial, Baltimore 81
Leave One Out
7/6/2003 ICME Tutorial, Baltimore 82
SVMsyt = sign (Σ αi yi K(xt,xi))(yi xi) training data, αi nonnegative, and kernel K positive definiteαi is obtained by maximizing
J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)F(αi) = αi
αi ≥ 0, Σyiαi = 0
7/6/2003 ICME Tutorial, Baltimore 83
Important Insight
K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity function that produces a positive definite covariance matrix on the training instances
7/6/2003 ICME Tutorial, Baltimore 84
Basis Function Selection
Three General ApproachesRestriction methods⌧Limit the class of functionsSelection methods⌧Scan the dictionary adaptively (Boosting)Regularization methods⌧Use the entire dictionary but restrict
coefficients (Ridge Regression)
7/6/2003 ICME Tutorial, Baltimore 85
Overfitting?
Probably NotBecause
N free parameters (not D)Maximizing margin
7/6/2003 ICME Tutorial, Baltimore 86
Geometrical View
S = w X + b|w| = 1, b = 0V = {w | Si f(xi) > 0; i = 1..n, |w| = 1}SVM is the center of the largest sphere contained in V
7/6/2003 ICME Tutorial, Baltimore 87
SVMs
7/6/2003 ICME Tutorial, Baltimore 88
BPMs
Bayes Objective FunctionŜt = Bayes Z (Xt) = argmin Si in S E H|Z = x [l(H(x), Si)]
BPMs [Herbrich, etc. 2001]Abp= argmin h in H Ex[E H|Z = x [l(H(x), h(x))]]
7/6/2003 ICME Tutorial, Baltimore 89
BPMs
Linear ClassifierInput X Posses Spherical Gaussian Density
BP is the Center of Mass of the Version Space
7/6/2003 ICME Tutorial, Baltimore 90
BPMs vs. SVMs
7/6/2003 ICME Tutorial, Baltimore 91
BPMs
Use SVMs to find a good h in HFind the BP
Billiard Algorithm [Herbrich, etc. 2001]
Perceptron Algorithm [Herbrich, etc. 2001]
7/6/2003 ICME Tutorial, Baltimore 92
Billiard Ball Algorithm (R. Herbrich )
7/6/2003 ICME Tutorial, Baltimore 93
Outline
Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods
7/6/2003 ICME Tutorial, Baltimore 94
Similarity Measurement
7/6/2003 ICME Tutorial, Baltimore 95
Perceptual Distance FunctionTwo Monumental Challenges
Formulating a perceptual feature spaceFormulating a perceptual distance function
7/6/2003 ICME Tutorial, Baltimore 96
Dimensionality Curse
D: Data DimensionWhen D increases
Nearest neighbors are not localAll points are equally distanced
7/6/2003 ICME Tutorial, Baltimore 97
Sparse High-D Space [C. Aggarwal, etc. ICDT 2001]
Hyper-cube Range Queries
dd ssP =][
7/6/2003 ICME Tutorial, Baltimore 98
7/6/2003 ICME Tutorial, Baltimore 99
Sparse High-D Space
Spherical Range Queries
7/6/2003 ICME Tutorial, Baltimore 100
)12(
)5.0()]5.0,([+Γ
•=∈ dQspRP
ddd π
7/6/2003 ICME Tutorial, Baltimore 101
7/6/2003 ICME Tutorial, Baltimore 102
Dimensionality Curse
7/6/2003 ICME Tutorial, Baltimore 103
So?
Is nearest neighbor estimate cursed in high-D spaces?
Yes!When D is large and N is relatively small, the estimate is off!!
7/6/2003 ICME Tutorial, Baltimore 104
Are We Doomed?
How does the curse affect classification?Similar objects tend to clustertogetherClassification makes binary prediction
7/6/2003 ICME Tutorial, Baltimore 105
Distribution of Distances
7/6/2003 ICME Tutorial, Baltimore 106
Some Solutions to High-D
Restricted Estimators Specifying the nature of local neighborhood
Adaptive Feature Reduction PCA, LDA
Dynamic Partial Function
7/6/2003 ICME Tutorial, Baltimore 107
Three Major Paradigms
Preserve data description in a lower dimensional space
PCAMaximize discriminability in a lower dimensional space
LDAActivate only similar channels
DPF
7/6/2003 ICME Tutorial, Baltimore 108
Minkowski Distance
Objects P and QD = (ΣM (pi - qi)n)1/n
Similar images are similar in all M features
7/6/2003 ICME Tutorial, Baltimore 109
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
00.
060.
130.
190.
250.
320.
380.
440.
510.
570.
630.
690.
760.
820.
880.
95
Feature Distance
Freq
uenc
y
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
00.
060.
130.
190.
250.
320.
380.
440.
510.
570.
630.
690.
760.
820.
880.
95
Feature Distance
Freq
uenc
y
7/6/2003 ICME Tutorial, Baltimore 110
Weighted Minkowski Distance
D = (ΣM wi(pi - qi)n)1/n
Similar images are similar in the same subset of the M features
7/6/2003 ICME Tutorial, Baltimore 111
0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0
0.007545 0.01307 0.004637 0.002413 0.002635 0.002954 0.0020070.014669 0.02717 0.010578 0.006734 0.007725 0.006379 0.0057660.012615 0.023055 0.009333 0.006764 0.007363 0.006593 0.0054430.082128 0.212612 0.068016 0.037835 0.032241 0.018068 0.0132030.061564 0.176548 0.045542 0.026445 0.026374 0.018583 0.0220370.019243 0.037016 0.015684 0.010834 0.012792 0.013536 0.0093460.09418 0.153677 0.066896 0.040249 0.036368 0.030341 0.0211380.1284 0.335405 0.13774 0.072613 0.054947 0.039216 0.043319
0.041414 0.101403 0.035881 0.022633 0.018991 0.017131 0.019450.014024 0.049782 0.01457 0.0053 0.004439 0.003041 0.0052260.049319 0.120274 0.045804 0.020165 0.019499 0.013805 0.018513
GIF
00.020.040.060.080.1
0.120.14
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
Feature Number
Aver
age
Dis
tanc
e0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0
0.002923 0.004377 0.029086 0.017063 0.007649 0.002019 0.001984 0.011560.006648 0.010143 0.070708 0.046142 0.023502 0.005178 0.005169 0.030140.006298 0.009264 0.075118 0.042225 0.020053 0.006285 0.006533 0.0300430.010198 0.056025 0.052869 0.033199 0.018294 0.00688 0.006858 0.023620.017066 0.047514 0.104013 0.073459 0.037468 0.013849 0.01293 0.0483440.008148 0.015337 0.074134 0.044238 0.021222 0.005197 0.005099 0.0299780.013529 0.051743 0.063263 0.038084 0.020885 0.010481 0.009844 0.0285110.045746 0.104141 0.145924 0.11276 0.065015 0.026333 0.02593 0.0751920.026167 0.034522 0.085067 0.054154 0.02918 0.015887 0.014371 0.0397320.002676 0.012148 0.008913 0.004682 0.002452 0.000913 0.000905 0.0035730.014527 0.036084 0.046779 0.024712 0.017418 0.004182 0.004991 0.0196160.012121 0.030269 0.045198 0.022268 0.012468 0.004706 0.004955 0.017919
Scale up/down
00.050.1
0.150.2
0.250.3
0.350.4
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
Feature Number
Aver
age
Dis
tanc
e
0.024788 0.069615 0.0226 0.009364 0.01 0.00678 0.0097120.094781 0.227558 0.099002 0.046466 0.047815 0.036883 0.0246990.093399 0.233519 0.188091 0.043026 0.037991 0.022151 0.0240640.040228 0.102763 0.034949 0.014184 0.01465 0.010237 0.0155170.001163 0.000896 0.000722 0.000627 0.000349 0.000452 0.0027580.006947 0.006769 0.003541 0.006377 0.002048 0.005515 0.0130060.006365 0.005313 0.002064 0.004006 0.002055 0.003338 0.01010.011705 0.010935 0.006615 0.007506 0.003319 0.005911 0.0152110.009434 0.010169 0.004484 0.006306 0.002582 0.004798 0.0136570.006305 0.005997 0.003392 0.005719 0.002382 0.004853 0.0128020.005835 0.00945 0.004323 0.00564 0.002688 0.004535 0.0063320.008149 0.009636 0.0047 0.006213 0.002564 0.003375 0.0064210.006776 0.010315 0.005393 0.008004 0.003845 0.005659 0.0132030.001526 0.002551 0.000576 0.000371 0.000331 0.000286 0.000380.016302 0.022657 0.007055 0.00353 0.002171 0.004162 0.003980.012414 0.020159 0.007076 0.003102 0.00188 0.004606 0.003490.007231 0.013591 0.004979 0.001092 0.000582 0.002766 0.0007410.011588 0.015102 0.005764 0.003855 0.00262 0.004584 0.0037920.01212 0.016013 0.006441 0.004048 0.002728 0.004856 0.004241
0.012235 0.01671 0.00483 0.002616 0.00197 0.00268 0.001672
Cropping
00.050.1
0.150.2
0.250.3
0.35
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
Feature Number
Ave
rage
Dis
tanc
e
0.006109 0.019169 0.032795 0.015229 0.008667 0.002357 0.00292 0.0123940.01223 0.070665 0.046472 0.02549 0.017445 0.008694 0.00841 0.021302
0.019067 0.08113 0.04592 0.024327 0.014169 0.004995 0.005275 0.0189370.011323 0.029089 0.063856 0.037716 0.01988 0.00522 0.005556 0.0264460.000995 0.000971 0.00241 0.001415 0.000736 0.000275 0.000272 0.0010220.007103 0.006337 0.015615 0.008709 0.003433 0.001572 0.002071 0.006280.004321 0.004457 0.012494 0.007507 0.003403 0.001351 0.001976 0.0053460.007451 0.008135 0.017145 0.008711 0.003192 0.001154 0.00223 0.0064860.00576 0.006822 0.015235 0.00869 0.003676 0.001193 0.002159 0.006191
0.006491 0.005948 0.013473 0.007436 0.003165 0.001777 0.002377 0.0056460.003832 0.005257 0.011884 0.008077 0.002654 0.001227 0.001213 0.0050110.004812 0.005389 0.011737 0.00729 0.003216 0.001534 0.002039 0.0051630.008795 0.007888 0.016303 0.008801 0.004048 0.002367 0.0027 0.0068440.000451 0.000707 0.002277 0.001346 0.000797 0.000253 0.000239 0.0009820.004914 0.006924 0.01499 0.009123 0.006657 0.003364 0.003391 0.0075050.004473 0.006398 0.017247 0.008858 0.005219 0.002338 0.002392 0.0072110.001723 0.003639 0.010426 0.005216 0.003024 0.00043 0.000423 0.0039040.00427 0.005712 0.011221 0.00856 0.006923 0.004464 0.004462 0.007126
0.004978 0.006186 0.009864 0.007161 0.005881 0.003835 0.003847 0.0061180.001722 0.0046 0.015611 0.007291 0.00338 0.000508 0.00049 0.005456
Rotation
0
0.02
0.04
0.06
0.08
0.1
0.12
1 10 19 28 37 46 55 64 73 82 91 100
109
118
127
136
Feature Number
Aver
age
Dis
tanc
e
7/6/2003 ICME Tutorial, Baltimore 112
Similarity Theories
Objects are similar in all respects (Richardson 1928)Objects are similar in some respects (Tversky 1977)Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)
7/6/2003 ICME Tutorial, Baltimore 113
DPF
Which Place is Similar to DC?PartialDynamicDynamic Partial FunctionSee ACM MM 2002, ICIP 2002, ACM MM Journal
7/6/2003 ICME Tutorial, Baltimore 114
Precision/Recall
7/6/2003 ICME Tutorial, Baltimore 115
Summary
Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel Methods
Linear Model ViewNearest Neighbor ViewGeometric View
Dimension Reduction Methods
7/6/2003 ICME Tutorial, Baltimore 116
Advanced Topics
Imbalance Data LearningN- >> N+
See our ICML 2003 papersSequence-data KernelKernel Alignment & Boosting
7/6/2003 ICME Tutorial, Baltimore 117
Useful Links
Related Publicationshttp://www-db.stanford.edu/~echang/
Online DemoVIMA Technologies
Six deployments as July 2003www.vimatech.com
7/6/2003 ICME Tutorial, Baltimore 118
References1. The Elements of Statistical Learning, T. Hastie, R. Tibshirani, and J.
Friedman, Springer, N.Y., 20012. Machine Learning, T. Mitchell, 19973. High-dimensional Data Analysis, D. Donoho, American Math. Society Lecture,
20004. Support Vector Machine Active Learning for Image Retrieval, S. Tong and E.
Chang, ACM MM, 20015. Dynamic Partial Function, B. Li and E. Chang, ACM Multimedia Journal, 20036. Pattern Discovery in Sequences under a Markov Assumption, D. Chudova and
P. Smyth, ACM KDD 20027. Bayes Point Machines, R. Herbrich, T. Graepel and C. Campbell, Journal of
Machine Learning Research, 20018. The Nature of Statistical Learning Theory, V. Vapnik, Springer, N.Y., 19959. Probabilistic Kernel Regression Models, T. Jaakkola and D. Haussler,
Conference of AI and Statistics, 199910. Support Vector Machines, Lecture Notes, A. Moore, CMU11. On the Surprising Behavior of Distance Metrics in High-dimensional Space, C.
Aggarwal, A. Hinneburg, and D. Keim, ICDT 2001 12. Adaptive Conformal Transformation for Learning Imbalanced Data, G. Wu, E.
Chang, International Conference on Machine Learning, August 2003