Statistical Methods for Learning Multimedia …infolab.stanford.edu/~echang/ICME03-UCSB.pdf7/6/2003 ICME Tutorial, Baltimore 1 Statistical Methods for Learning Multimedia Semantics

7/6/2003 ICME Tutorial, Baltimore 1

Statistical Methods for Learning Multimedia Semantics

Edward ChangAssociate Professor,Electrical Engineering, UC Santa BarbaraCTO, VIMA Technologies


Outline

Statistical LearningMultimedia Applications’ Data CharacteristicsClassical ModelsKernel Methods

Linear Model ViewNearest Neighbor ViewGeometric View

Dimension Reduction Methods


Statistical Learning

Program the computers to learn!Computers improve performancewith experience at some taskExample:

Task: classify imagesPerformance: prediction accuracyExperience: labeled images


Definition

X: Data poolU: Unlabeled pool L: Labeled pool

G: LabelsRegression: G → RClassification: G → +1, -1

H: Learning algorithm


Statistical LearningExperience

Characterized by training data LTraining

f = H(L)Task (e.g., prediction)

ŷ = f(u), u ∈ UPerformance

Measured by some error functione.g., maximizing yf(u)


Learning Algorithms (H)

Linear RegressionK-NNBayesian AnalysisNeural NetworksDecision TreesKernel MethodsEtc.


H Having a hypothesis spaceFind the “best” hypothesis based on the training data (L) efficiently

Best solutionFitting L well? Predicting U accurately!

EfficiencyComputational complexity and resource requirements


Classical Model [Donoho 2000]

N:Number of training instancesN = |U|

N+, N-

D:DimensionalityN >> D N → ∞

E.g., PAC learnabilityN- ≈ N+


Emerging MM Applications

N < DN+ << N-

ExamplesInformation retrieval with relevance feedbackK-class classification⌧Image classification⌧Gene profiling


Gene Profiling ExampleN = 59 cases, D = 4026 genes


Image Retrieval Demo

N < DN < 50D = 150

N+ << N-

ACM SIGMOD 01; ACM MM 01,02; IEEE CVPR 03Also see my Web site


SVMactive


SVMactive


SVMactive


SVMactive


Ranking


Solution Summary N < D

ACM MM 2001 (SVM Active)⌧Make each u in U most informative

PCM 2002, ICIP 2003⌧Increase N- through co-training

ACM MM 2002 (DPF)⌧Reduce D

N+ << N-

ACM MM 2003, ICML 2003⌧Conformal transformation ⌧Kernel boundary alignment


Outline

Statistical LearningMM Applications’ Data CharacteristicsClassical Models (Classification)Kernel Methods




Classical Methods

Linear ModelLeast SquareMaximum Likelihood Naïve BayesianLDAMaximum Margin Hyperplane

Nearest Neighbor


Linear Regression


Least Square

Y = β0 + ΣΣ βj Xj (j = 1 to D)Y = XTβRSS(β) = (Y – Xβ)T(Y – Xβ)

RSS: Residual Sum of Squareβ = (XTX)-1 XTY


Maximum Likelihood

Y = β0 + ΣΣ βj Xj (j = 1 to p)Y = XTβY = XTβ + ε

ε (noise signals) are independentε → N (0, ∂2)

P(y|βx) has a normal dist. withMean at y = βxVariance ∂2


Maximum Likelihood

P(y|βx) → N (0, ∂2) Training

Given (x1,y1) (x2,y2) … (xn,yn)Infer P(β | x1, x2,… xn, y1, y2,…yn )By Bayes rule, orMaximum Likelihood Estimate


Maximum Likelihood

For what β isP(y1, y2,…yn | x1, x2,… xn, β) maximized?ΠΠ P(yi|βxi) maximized? ΠΠ exp(-½(yi-βxi/∂)2) maximized?ΣΣ (-½(yi-βxi/∂)2 maximized?ΣΣ (yi-βxi)2 minimized?


Least Square Linear Model

Solution Method #1RSS(β) = (Y – Xβ)T(Y – Xβ)β = (XTX)-1 XTY

Solution Method #2 (for D > N)Gradient decentPerceptron


Other Linear Models

LDAFind the projection direction which minimizes the overlap for two Gaussian distributions

Separating Hyperplane


LDA







Maximum Margin Hyperplane

Only support vectors involve in class prediction!


Linear Models

N ≥ DLeast SquareLDA

D > NPerceptron (using gradient decent)Maximum Hyperplane

Generative vs. Discriminative Model


Linear Model Fits All Data?


How about Joining the Dots?

Y(x) = 1/k ΣΣ yi,

xi ∈Nk(x)K = 1


Linear Model Fits All?


NN with k = 1


Nearest Neighbor

Four Things Make a Memory Based Learner

A distance function?K: number of neighbors to consider?A weighted function (optional)?How to fit with the local points?


Problems of K=1

Fitting NoiseJagged Boundaries


Solutions

Fitting NoisePick a larger K?


NN with k = 15


NN


Solutions

Fitting NoisePick a larger K?

Jagged BoundariesIntroducing Kernel as a weighting function


Nearest Neighbor → Kernel Method

Four Things Make a Memory Based Learner

A distance functionK: number of neighbors to consider? AllA weighted function: RBF kernelsHow to fit with the local points? Predict weights


Kernel Method

RBF Weighted FunctionKernel width holds the key⌧Implying KUse cross validation to find the “optimal” width

Fitting with the Local PointsWhere NN meets Linear Model


LM vs. NNLinear Model

f(x) is approximated by a global linear functionMore stable, less flexible

Nearest NeighborK-NN assumes f(x) is well approximated by a locally constant functionLess stable, more flexible

Between LM and NNThe other models…


Decision Theories

Bias & Variance TradeoffBayes PredictionVC DimensionalityPAC Learnability


Variance vs. Bias

MSE(x) = ET [f(x) – ŷ]2

= ET[ŷ – ET(ŷ)]2 + [ET(ŷ)– f(x)]2

Error = VarT(ŷ) + Bias2(ŷ)


Variance vs. Bias


Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods


Where Are We and Where Am I Heading To ?

LM and NNKernel Method of Three Views

LM viewNN viewGeometric view


Linear Model View

Y = β0 + ΣΣ β XSeparating Hyperplane

Max||β||=1 CSubject to yyii f(f(xxii) ) ≥≥ C, orC, oryyii ((β0 +β xi) ≥≥ CC


Classifier Margin

Margin Defined as width of the boundary before hitting a data object

Maximum MarginTends to minimize classification varianceNo formal theory for this yet




M’s Mathematical Representation

Plus-plane{x: wx+b = +1}

Minus-plane{x: wx+b = -1}

w ⊥ Plus-planew(u – v) = 0, if u and v on plus-plane

w ⊥ Minus-plane




M

Let x- be any point on minus-planeLet x+ be the closest plus-plane-point to x-

x+ = x- + λw, whyThe line (x+x-) ⊥ minus-plane

M = |x+ - x-|


M

1. wx- + b = -1 2. wx+ + b = 1 3. x+ = x- + λw 4. M = |x+ - x-|5. w(x- + λw) + b = 1 (from 2 & 3)6. wx- + b + λww = 17. λww = 2


M

1. λww = 22. λ = 2/ww3. M = |x+ - x-| = |λw| = λ|w| = 2/|w|

4. Max MGradient decent, simulated annealing, EM, Newton’s method…


Max M

Max M = 2/|w|Min |w|/2Min |w|2/2

subject to yi(xiw+b) ≥ 1i = 1,…,N

Quadratic criterion with linear inequality constraints


Max M

Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N

Lp = minw,b |w|2/2 + ΣΣi=1..N αi[yi(xiw+b)-1]

w = ΣΣi=1..N αiyixi

0 = ΣΣi=1..N αiyi


Wolfe Dual

Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj

Subject to αi ≥ 0αi [yi(xiw+b)-1] = 0KKT conditions⌧αi > 0, yi(xiw+b) = 1 (Support Vectors)⌧αi = 0, yi(xiw+b) > 1


Class Predictionyyqq = = w xq + b

w = ΣΣi=1..N αiyixi

yyqq = sign(= sign(ΣΣi=1..N αiyi(xi ·Xq) + b)


Non-separatable Classes

Soft Margin HyperplaneBasis Expansion


Non-separating Case


Soft Margin SVMs

Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N

Min |w|2/2 + C ∑εi

xiw+b ≥ 1 - εi if yi = 1xiw+b ≤ -1 + εi if yi = -1εi ≥ 0


Non-separating Case


Wolfe Dual

Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj

Subject to C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions

yyqq = = sign ((ΣΣi=1..N αiyi(xi ·Xq) + b)


Basis Function


Harder 1D Example


Basis Function

Φ(X) = (x, x2)


Harder 1D Example


Some Basis Functions

Φ(X) = ΣΣ γmhm(X) hm(X) Rp → R

Common FunctionsPolynomialRadial basis functionsSigmoid functions


Kernel FunctionLd = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyj Φ(xi)Φ (xj)Subject to

C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions

yyqq = sign (= sign (ΣΣi=1..N αiyi(Φ(xi)·Φ(Xq)) + b)K(xi, xj) = Φ(xi)·Φ(Xj)

Kernel function!


Quadratic Basis Functions

Φ(a) = {1, ai, ai aj}, ij = 1..D(D+1)(D+2)/2 termsD2 termsO(D2) computational cost

It is equivalent to (ab+1)2

O(D) computational costTotal computational cost

O(N2D)


Dot Product Saves the Day

O(N2D)Quadratic

O(N2D2)Cubic

O(N2D3)Quartic

O(N2D4)


Quiz

What is a polynomial kernel degree dfunction’s signature?(ab+1)d


Outline

LM and NNKernel Method of Three Views

LM viewNN viewGeometric view


Nearest Neighbor View

Z, a set of zero mean jointly Gaussian random variables,

Each Zi corresponds to one example Xi

Cov (zi, zj) = K(xi, xj)yi, the lable of zi, +1 or -1

P(yi | zi) = σ(yi,zi)


Training Data


General Kernel Classifier [Jaakkola, etc. 99]

MAP Classification for xt

yt = sign (Σ αi yi K(xt,xi)) K(xi, xj) = Cov (zi, zj) (some similarity function)

Supervised Training: Compute αi Given X and y, andAn error function such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)


Leave One Out


SVMsyt = sign (Σ αi yi K(xt,xi))(yi xi) training data, αi nonnegative, and kernel K positive definiteαi is obtained by maximizing

J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)F(αi) = αi

αi ≥ 0, Σyiαi = 0


Important Insight

K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity function that produces a positive definite covariance matrix on the training instances


Basis Function Selection

Three General ApproachesRestriction methods⌧Limit the class of functionsSelection methods⌧Scan the dictionary adaptively (Boosting)Regularization methods⌧Use the entire dictionary but restrict

coefficients (Ridge Regression)


Overfitting?

Probably NotBecause

N free parameters (not D)Maximizing margin


Geometrical View

S = w X + b|w| = 1, b = 0V = {w | Si f(xi) > 0; i = 1..n, |w| = 1}SVM is the center of the largest sphere contained in V


SVMs


BPMs

Bayes Objective FunctionŜt = Bayes Z (Xt) = argmin Si in S E H|Z = x [l(H(x), Si)]

BPMs [Herbrich, etc. 2001]Abp= argmin h in H Ex[E H|Z = x [l(H(x), h(x))]]


BPMs

Linear ClassifierInput X Posses Spherical Gaussian Density

BP is the Center of Mass of the Version Space


BPMs vs. SVMs


BPMs

Use SVMs to find a good h in HFind the BP

Billiard Algorithm [Herbrich, etc. 2001]

Perceptron Algorithm [Herbrich, etc. 2001]


Billiard Ball Algorithm (R. Herbrich )


Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods


Similarity Measurement


Perceptual Distance FunctionTwo Monumental Challenges

Formulating a perceptual feature spaceFormulating a perceptual distance function


Dimensionality Curse

D: Data DimensionWhen D increases

Nearest neighbors are not localAll points are equally distanced


Sparse High-D Space [C. Aggarwal, etc. ICDT 2001]

Hyper-cube Range Queries

dd ssP =][



Sparse High-D Space

Spherical Range Queries


)12(

)5.0()]5.0,([+Γ

•=∈ dQspRP

ddd π



Dimensionality Curse


So?

Is nearest neighbor estimate cursed in high-D spaces?

Yes!When D is large and N is relatively small, the estimate is off!!


Are We Doomed?

How does the curse affect classification?Similar objects tend to clustertogetherClassification makes binary prediction


Distribution of Distances


Some Solutions to High-D

Restricted Estimators Specifying the nature of local neighborhood

Adaptive Feature Reduction PCA, LDA

Dynamic Partial Function


Three Major Paradigms

Preserve data description in a lower dimensional space

PCAMaximize discriminability in a lower dimensional space

LDAActivate only similar channels

DPF


Minkowski Distance

Objects P and QD = (ΣM (pi - qi)n)1/n

Similar images are similar in all M features


1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

00.

060.

130.

190.

250.

320.

380.

440.

510.

570.

630.

690.

760.

820.

880.

95

Feature Distance

Freq

uenc

y

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

00.

060.

130.

190.

250.

320.

380.

440.

510.

570.

630.

690.

760.

820.

880.

95

Feature Distance

Freq

uenc

y


Weighted Minkowski Distance

D = (ΣM wi(pi - qi)n)1/n

Similar images are similar in the same subset of the M features


0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0

0.007545 0.01307 0.004637 0.002413 0.002635 0.002954 0.0020070.014669 0.02717 0.010578 0.006734 0.007725 0.006379 0.0057660.012615 0.023055 0.009333 0.006764 0.007363 0.006593 0.0054430.082128 0.212612 0.068016 0.037835 0.032241 0.018068 0.0132030.061564 0.176548 0.045542 0.026445 0.026374 0.018583 0.0220370.019243 0.037016 0.015684 0.010834 0.012792 0.013536 0.0093460.09418 0.153677 0.066896 0.040249 0.036368 0.030341 0.0211380.1284 0.335405 0.13774 0.072613 0.054947 0.039216 0.043319

0.041414 0.101403 0.035881 0.022633 0.018991 0.017131 0.019450.014024 0.049782 0.01457 0.0053 0.004439 0.003041 0.0052260.049319 0.120274 0.045804 0.020165 0.019499 0.013805 0.018513

GIF

00.020.040.060.080.1

0.120.14

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0.002923 0.004377 0.029086 0.017063 0.007649 0.002019 0.001984 0.011560.006648 0.010143 0.070708 0.046142 0.023502 0.005178 0.005169 0.030140.006298 0.009264 0.075118 0.042225 0.020053 0.006285 0.006533 0.0300430.010198 0.056025 0.052869 0.033199 0.018294 0.00688 0.006858 0.023620.017066 0.047514 0.104013 0.073459 0.037468 0.013849 0.01293 0.0483440.008148 0.015337 0.074134 0.044238 0.021222 0.005197 0.005099 0.0299780.013529 0.051743 0.063263 0.038084 0.020885 0.010481 0.009844 0.0285110.045746 0.104141 0.145924 0.11276 0.065015 0.026333 0.02593 0.0751920.026167 0.034522 0.085067 0.054154 0.02918 0.015887 0.014371 0.0397320.002676 0.012148 0.008913 0.004682 0.002452 0.000913 0.000905 0.0035730.014527 0.036084 0.046779 0.024712 0.017418 0.004182 0.004991 0.0196160.012121 0.030269 0.045198 0.022268 0.012468 0.004706 0.004955 0.017919

Scale up/down

00.050.1

0.150.2

0.250.3

0.350.4

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e

0.024788 0.069615 0.0226 0.009364 0.01 0.00678 0.0097120.094781 0.227558 0.099002 0.046466 0.047815 0.036883 0.0246990.093399 0.233519 0.188091 0.043026 0.037991 0.022151 0.0240640.040228 0.102763 0.034949 0.014184 0.01465 0.010237 0.0155170.001163 0.000896 0.000722 0.000627 0.000349 0.000452 0.0027580.006947 0.006769 0.003541 0.006377 0.002048 0.005515 0.0130060.006365 0.005313 0.002064 0.004006 0.002055 0.003338 0.01010.011705 0.010935 0.006615 0.007506 0.003319 0.005911 0.0152110.009434 0.010169 0.004484 0.006306 0.002582 0.004798 0.0136570.006305 0.005997 0.003392 0.005719 0.002382 0.004853 0.0128020.005835 0.00945 0.004323 0.00564 0.002688 0.004535 0.0063320.008149 0.009636 0.0047 0.006213 0.002564 0.003375 0.0064210.006776 0.010315 0.005393 0.008004 0.003845 0.005659 0.0132030.001526 0.002551 0.000576 0.000371 0.000331 0.000286 0.000380.016302 0.022657 0.007055 0.00353 0.002171 0.004162 0.003980.012414 0.020159 0.007076 0.003102 0.00188 0.004606 0.003490.007231 0.013591 0.004979 0.001092 0.000582 0.002766 0.0007410.011588 0.015102 0.005764 0.003855 0.00262 0.004584 0.0037920.01212 0.016013 0.006441 0.004048 0.002728 0.004856 0.004241

0.012235 0.01671 0.00483 0.002616 0.00197 0.00268 0.001672

Cropping

00.050.1

0.150.2

0.250.3

0.35

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Ave

rage

Dis

tanc

e

0.006109 0.019169 0.032795 0.015229 0.008667 0.002357 0.00292 0.0123940.01223 0.070665 0.046472 0.02549 0.017445 0.008694 0.00841 0.021302

0.019067 0.08113 0.04592 0.024327 0.014169 0.004995 0.005275 0.0189370.011323 0.029089 0.063856 0.037716 0.01988 0.00522 0.005556 0.0264460.000995 0.000971 0.00241 0.001415 0.000736 0.000275 0.000272 0.0010220.007103 0.006337 0.015615 0.008709 0.003433 0.001572 0.002071 0.006280.004321 0.004457 0.012494 0.007507 0.003403 0.001351 0.001976 0.0053460.007451 0.008135 0.017145 0.008711 0.003192 0.001154 0.00223 0.0064860.00576 0.006822 0.015235 0.00869 0.003676 0.001193 0.002159 0.006191

0.006491 0.005948 0.013473 0.007436 0.003165 0.001777 0.002377 0.0056460.003832 0.005257 0.011884 0.008077 0.002654 0.001227 0.001213 0.0050110.004812 0.005389 0.011737 0.00729 0.003216 0.001534 0.002039 0.0051630.008795 0.007888 0.016303 0.008801 0.004048 0.002367 0.0027 0.0068440.000451 0.000707 0.002277 0.001346 0.000797 0.000253 0.000239 0.0009820.004914 0.006924 0.01499 0.009123 0.006657 0.003364 0.003391 0.0075050.004473 0.006398 0.017247 0.008858 0.005219 0.002338 0.002392 0.0072110.001723 0.003639 0.010426 0.005216 0.003024 0.00043 0.000423 0.0039040.00427 0.005712 0.011221 0.00856 0.006923 0.004464 0.004462 0.007126

0.004978 0.006186 0.009864 0.007161 0.005881 0.003835 0.003847 0.0061180.001722 0.0046 0.015611 0.007291 0.00338 0.000508 0.00049 0.005456

Rotation

0

0.02

0.04

0.06

0.08

0.1

0.12

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

Feature Number

Aver

age

Dis

tanc

e


Similarity Theories

Objects are similar in all respects (Richardson 1928)Objects are similar in some respects (Tversky 1977)Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)


DPF

Which Place is Similar to DC?PartialDynamicDynamic Partial FunctionSee ACM MM 2002, ICIP 2002, ACM MM Journal


Precision/Recall


Summary

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel Methods




Advanced Topics

Imbalance Data LearningN- >> N+

See our ICML 2003 papersSequence-data KernelKernel Alignment & Boosting


Useful Links

Related Publicationshttp://www-db.stanford.edu/~echang/

Online DemoVIMA Technologies

Six deployments as July 2003www.vimatech.com

http://www-db.stanford.edu/~echang/

http://www-db.stanford.edu/~echang/


References1. The Elements of Statistical Learning, T. Hastie, R. Tibshirani, and J.

Friedman, Springer, N.Y., 20012. Machine Learning, T. Mitchell, 19973. High-dimensional Data Analysis, D. Donoho, American Math. Society Lecture,

20004. Support Vector Machine Active Learning for Image Retrieval, S. Tong and E.

Chang, ACM MM, 20015. Dynamic Partial Function, B. Li and E. Chang, ACM Multimedia Journal, 20036. Pattern Discovery in Sequences under a Markov Assumption, D. Chudova and

P. Smyth, ACM KDD 20027. Bayes Point Machines, R. Herbrich, T. Graepel and C. Campbell, Journal of

Machine Learning Research, 20018. The Nature of Statistical Learning Theory, V. Vapnik, Springer, N.Y., 19959. Probabilistic Kernel Regression Models, T. Jaakkola and D. Haussler,

Conference of AI and Statistics, 199910. Support Vector Machines, Lecture Notes, A. Moore, CMU11. On the Surprising Behavior of Distance Metrics in High-dimensional Space, C.

Aggarwal, A. Hinneburg, and D. Keim, ICDT 2001 12. Adaptive Conformal Transformation for Learning Imbalanced Data, G. Wu, E.

Chang, International Conference on Machine Learning, August 2003

Documents

Statistical Methods for Learning Multimedia …infolab.stanford.edu/~echang/ICME03-UCSB.pdf7/6/2003 ICME Tutorial, Baltimore 1 Statistical Methods for Learning Multimedia Semantics