47
1 Kernels Between Distributions & Sets Tony Jebara Risi Kondor Machine Learning Lab Columbia University June 2003

Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

1

Kernels BetweenDistributions & Sets

Tony JebaraRisi KondorMachine Learning LabColumbia UniversityJune 2003

Page 2: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

2

OutlineVectors vs. Sets of VectorsGenerative Models of SetsKernels on SetsKernels on Probability ModelsHellinger & Bhattacharyya

Probability Product Kernels:Exponential FamilyBernoulli & MultinomialGaussianKernelized GaussianMixture ModelsHidden Markov ModelsBayesian NetworksSampling Methods

Page 3: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

3

Part 2: Kernels onSets & Distributions

Part 1: GenerativeModels on Sets

Non-VectorData

Page 4: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

4

Part 1:

Generative Models onSets of Vectors

(...as opposed to Vectors)

Page 5: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

5

Modeling Data as Vectors & Vectorization

Many learning methods expect datum as vector

But, vectorizing an object into vector is dangerousImages: morph, rotate, translate, zoom…Audio: pitch changes, ambient acoustics…Video: motion, camera view, angles…Gene Data: proteins fold, insertions, deletions…

Want Invariance: factor out certain variations (translations)Want Linearity: model desired variations (identity) linearlyBut, above variations are highly nonlinear in vector representation

i.e. image translation:�

�� � ��� �

��� � � ��

�� � �� �� �� ��

�������� ������������

Page 6: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

6

Alternative: Avoid Vectorizing, Use “Bag of Vectors”Since vectorization so nonlinear, avoid it from outsetView a datum as “Bag of Vectors” instead of single Vector

i.e. grayscale image = Set of Vectors or Bag of Pixels(N pixels, each is a D=3 XYI tuple)

Vs.

xx

xWhy? Image Vectorization dumps DxNdimensional vector by lexicographicordering. Over many images, orderingstays constant, only intensities vary.All variation captured only by intensities changes

Page 7: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

7

Why Bags of Pixels or Sets of Vectors?Vectorization / Rasterization: uses index in image tosort pixels into large vector. Dataset only shows variationsin I entries of large vector (spatial is nonlinear)

If we knew “optimal” correspondence:could sort pixels in the bag into large vector moreappropriately. Dataset shows jointly linearvariations in X, Y and I entries

� � �� � � � � � �

� � � �� � � � � � � � �� � �� � � � � � � � � ��

��

� � �� � � �

� � � � � � � � � �� � � � � � � � �� � � � � � � � � ��

��

� � �� � � � � � �

� � � �� � � � � � � � �� � �� � � � � � � � � ��

��

� � �� � � � � � � � � �

� � � � � � � � � �� � � � � � � � �� � � � � � � � � ��

��

Page 8: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

8

Why Bags of Pixels or Sets of Vectors?

But, we don’t know optimal correspondence, must learn it

As vector images, linear changes & eigenvectors are additionsand deletions of intensities (awkward) .Translating, raising eyebrows, etc. involve erasing & redrawing

In bag of pixels (vectorized only after optimal correspondence) see linear changes and eigenvectors are morphings, warpings,jointly spatial and intensity change

Page 9: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

9

Bag Representation àààà Permutation àààà ManifoldAssume order unknown. “Set of Vectors” or “Bag of Pixels”Get permutational invariance (order doesn’t matter)

Can’t represent invariance by single ‘X’ vector point in DxN spacesince we don’t know the ordering

Get permutation invariance by ‘X’ spanning all possible reorderingsMultiply X by unknown A matrix (permutation or doubly-stochastic)

x

x �

� �� �� �� �� �� �� �� �� �

� �� �� �� �� �� �� �� �� �

� �� �� �� �� �� �� �� �� �

� �� �� �� �� �� �� �� �� �

� �� �� �� �� �� �� �� �� �

xx

x

Page 10: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

10

Invariant Paths as Matrix Operators on Vectors

Move vector along a manifold by multiplying by a matrix:

Restrict A to be permutation matrixResulting manifold of configurations is an “orbit” if A is a groupOr, for smooth manifold, A is doubly-stochastic matrixEndow each image in dataset with its own transformation matrix AEach image is now a bag or manifold:

� ��

� � ������

� ��� � �

Page 11: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

11

Modeling a Dataset of Invariant Manifolds

Example: assume model is PCA, learn 2D subspace of 3D data

Permutation indicates points can move independently along paths

Find PCA after moving to form ‘tight’ 2D subspace

More generally, move along manifolds to improve fit of any model(PCA, SVM, probability density, etc.)

Page 12: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

12

Explicitly Handling Permutations (AISTAT 2003)Borrow from SVM: regularization cost + linear constraints on modelHere have: modeling cost + linear constraints on transformationsEstimate transformation parameters

and model parameters (PCA, Gaussian, SVM)

Cost function on matrices A emerges from modeling criterionMin Convex Cost with Convex Hull of Constraints (Unique!)

Since A matrices are soft permutationmatrices (doubly-stochastic) we have:

� ������

�� � ��

�� �� � �

�� � ��

� � �� �

� �� ���

������ �� � � � � �� � ��

� � �� � �

� � � �

� � �� � �� �

Page 13: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

13

Cost C(A)? Gaussian Mean

Maximum Likelihood Gaussian Mean Model:

Theorem 1: C(A) is convex in A (Convex Program)

Can solve via a quadratic program on the A matrices

Minimizing the trace of a covariance tries to pull the data spherically towards a common mean

� �� ��� ��� ��

�� �� �� � � �� �� � � ������ �

��

�� �� �� ������

� ���� ��� � � �� � ��

� � � ��� � �� ���������� � � �

Page 14: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

14

Cost C(A)? Gaussian Mean & Covariance

Theorem 2: Regularized log determinant of a covariance is convex Equivalently, minimize

Theorem 3: Cost not quadratic but can be upper bounded by quadIteratively solve quadratic program with variational bound

Minimizing determinant flattens data into a pancake of low volume

� �� � ��� ��� ��

�� ���� � � ����� � � ��

���� �

� � ���� � �

� � �� � �� � ���� ����

�� �� � � ��

�� �� ����� � � �� �� � � � � �� � � ���� � � �� ����� � �� �

� ��� � ��

� � � � ���� ���� � � � ��

� �� �

� � � ���� ���� ������ � � ������ �� �� � �

� � �������� � ������� � � �

Page 15: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

15

Cost C(A)? Fisher Discriminant

���

� �

� ��

� ��� ��� � � �

� ��� � �� �� � � � � � �x

x

xx

x

x

x

xx

x xx

xx

x

x

x

x

xx

x x

� � �� � � � �� � �� � � � � �� � � � � �

Find linear discriminant model ‘w’ thatmaximizes between / within class scatter

For discriminative invariance learning, estimate transformationmatrices to: increase between-class scatter (numerator)

reduce within class scatter (denominator)

Minimizing this tries to permute data to make classification easy

Page 16: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

16

Cost Function C(A) InterpretationsMaximum Likelihood Mean

Permute data towards common mean

Maximum Likelihood Mean & CovariancePermute data towards flat subspacePushes energy into few eigenvectorsGreat as pre-processing before PCA

Fisher DiscriminantPermute data towards two flatsubspaces while repelling awayfrom each other’s means

Page 17: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

17

Practical Optimization of Quadratic Programs

Quadratic Programming used for all C(A) since:Gaussian Mean quadraticGaussian Covariance upper boundable by quadraticFisher Discriminant upper boundable by quadratic

Use Sequential Minimal Optimizationaxis parallel optimization, pick axes to update,ensure constraints not violated

Soft permutation matrix 4 constraintsor 4 entries at a time

� �� ��� �� � �� � �� �� � �� �� � � � ���� �����

����� � � � � � � � � �� �� �

� � ��� �� �� ��

� � � �� � � �

Page 18: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

18

Digits: Image = Bag of XY Vectors

Original

PCA Permuted PCA

20 Images of ‘3’ and ‘9’Each is 70 (x,y) dotsNo order on the ‘dots’

PCA compress with samenumber of Eigenvectors

Convex Program firstestimates thepermutation ààààbetter reconstruction

Page 19: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

19

Linear Interpolation

Intermediate imagesare smooth morphs

Points nicelycorresponded

Spatial morphingversus ‘redrawing’

No ghosting

Page 20: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

20

Single Person Faces: Image = Bag of XYI Pixels

Original PCA PermutedBag-ofPixelsPCA 2000 XYI Pixels: Compress to 20 dims

Improve squared error of PCA byAlmost 3 orders of magnitude x103

Page 21: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

21

Multi-Person Faces: Bag-of-Pixels Eigenvectors+/- Scaling on Eigenvector

Top 5Eigenvectors

All just linearvariations inbag of XYIpixels

Vectorizationnonlinearneeds huge #of eigenvectors

Page 22: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

22

Multi-Person Faces: Bag-of-Pixels Eigenvectors+/- Scaling on Eigenvector

Next 5Eigenvectors

Page 23: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

23

Classification: Distances & Affinity between Two Bags?For nearest neighbor classification: need distances between two bags

For SVM classification: need kernel or affinity between two bags

Could solve permutations, find closest point between two manifolds:

But, can we avoid computing optimal permutations (work)?Implicitly compute distances or affinities?

x

x

��

��

Page 24: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

24

Idea: 1) Each bag has IID vectors in it2) Model each bag using a probability density (e.g. Gaussian)3) Build kernel classifier by measuring affinity between PDFs

Implicitly Handling Permutations: Kernels on PDFs

Page 25: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

25

Part 2: Kernelizing …

Kernels on Sets of Vectors& Kernels on PDFs

Page 26: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

26

Kernels from Generative ModelsCombines Generative & Discriminative Tools

Use a Kernel derived from generative model

Fisher Kernels: Jaakkola & HausslerConvolution / Transducer Kernels: Haussler, Watkins, CortesDiffusion Kernels: Kondor, LaffertyHeat Kernels: Lafferty, Lebanon

Model distribution of points(one, some or all)

Density helps compute affinities& kernels in machine (SVM)

�� �! � �

Page 27: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

27

Fisher Kernel: approximate distance betweentwo generative models on statisticalmanifold using Kullback Leibler

Approximate KL by quadratic formlocal tangent space at ML estimate

From distance get affinityvia Fisher Info & gradients

Heat Kernel: (Lafferty & Lebanon) better geodesic distance & affinity, Only solvable for multinomial (sphere) or covariance (hyperbolic)

Fisher Kernels & Kullback-Leibler Divergence

�� ����

�!"� � � ��

�� �

xxx

� ��

�� � �

�!" �

�� � � � � � �

θθθθ

θθθθ’θθθθ* ��� � � �

�� � � �

! � ���

�� � �

����� �� �� �

� � � �

Page 28: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

28

Hellinger Divergence & Bhattacharyya (COLT’03)KL not symmetric, needs approximation, lose interesting nonlinearities

Instead: START with nice divergence, avoid approximationOther choices & bounds for divergence (Topsoe, Dragomir):

Triangular, L1, Hellinger, Harmonic, Variational, Csiszar’s f-Div, …

Hellinger Divergence Bhattacharyya Affinity

Desiderata: Mercer Kernel? +ve? +ve definite? YES Symmetric? YESComputable for many distributions? YESInteresting nonlinear behavior? YES

��

�� �! � ���

� ��� � � �� ���

�!"� � �

�� �

� ��� �#� � � � �� �� �� � �$� � � � � ��� �

Page 29: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

29

Probability Product KernelsComputing the kernel: given inputs χ and χ’

1) Estimate Densities (I.e. ML):

2) Compute Bhattacharyya Affinity:

Probability Product Kernel:

Bhattacharyya Kernel:

Expected Likelihood Kernel:

� � � �

� � � �

� � � �

� � � �

� � �

� � �

�� � �! � � ��� � � � � � �� � �! � � ��

� �� � � ���� �

�� �

� ! !�

� � �� �! % � % �� � � �

χχ

Page 30: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

30

Many Properties, includes Gaussian Mean,Covariance, Multinomial, Binomial, Poisson,Exponential, Gamma, Bernoulli, Dirichlet, …

Maximum likelihood is straightforward:

All have above form but different A(x), T(x), convex K(θθθθ)

Exponential Family Product Kernels

� � � � �� ��� �� � � � � � !� � � � � �

� ��� ��

! � �""� � � �

Page 31: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

31

Compute the Bhattacharyya Kernel for the e-family:

Analytic solution for e-family:

Only depends on convex cumulant-generating function K(θθθθ)

Meanwhile, Fisher Kernel is always linear in sufficient stats…

� � �

� � � �

��� ���

� � � � �� � � � �

� � � � �

��� � �

! � � � � ��

! ! !

� � � � �

� � � � � � � �

� � � �� �

� �

��

�� �� � & � � &� �

� �� �� � � � �

Exponential Family Product Kernels

� � � � �� ��� �� � � � � � !� � � � � �

� � �� ���� ��� � � !

� �� � � � � � � � �

Page 32: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

32

Multinomial and Bernoulli Product KernelsBernoulli:

(binary)

Multinomial:(discrete)

For multinomial counts (for N words):

Fisher for Multinomial is linear

� � � � � � �

� � �

� �

� � �

� � � �

��

�� �

� ��

� � � ��

� �

! � � � � ��

� �

� � �

� � # � #

� � �

� �� ## � � # � #� �� �

$�$

� ��

� � �

�� �

��

� ��

� �

!

� � %

� � � %%

$�

� �����

� � ��

� ��!

� �� � � %%� �� ��

Page 33: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

33

Multinomial Product Kernel for TextWebKB dataset: Faculty vs. student web page SVM kernel classifier

20-Fold Cross-Validation, 1641 student & 1124 faculty, ...Use Bhattacharyya Kernel on multinomial (word frequency)

Training Set Size = 77 Training Set Size = 622

Page 34: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

34

Gaussian Mean: (any ρρρρ)(continuous)

If µ=χ µ’=χ’ get RBF KernelFisher here is linear

Gaussian Covariance: (any ρρρρ)

Fisher here is quadratic

But, how do we get a non-degenerate covariance from 1 data point?

Gaussian Product Kernel

� � �

������ �� � � � � � ��� � �

� � � � � �� �

� ��� � � ��� �

! � � � � ��� �

�� �� � � � �

� � � �� � �

&� � � � �� �� � � �� � ��

� �� � � � � �� � ��'��� ����� � � �� � �� � �� � � �� �� �� �

� � �

��

� � � � �

��� � �

! � � � � ��� �� � � � �

& � �� � '

��

Page 35: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

35

Gaussian Product Kernels on Bags of VectorsInstead of a single χ & χ’Construct p & p’ from many χ & χ’I.e. use bag of vectors

� �

� �

� �

� �

�� � � �

� �

� �

� �

� �

� ������ �� � � � � � ��� � �� � � ��� � � �

�� �!�� �� � � � �� � � � � � � �� �� � � �� � ��

� � �� � % %�� � �� � � � � � �

�� �� � ��� �� �� �� � � ��

Page 36: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

36

Bhattacharyya affinity on Gaussians on bags {χ,...} & {χ’,...}Invariant: to order of tuples in each bagBut too simple: overlap of two Gaussian distributions on images

Need more detail than mean and covariance of pixels…

Use Kernel Trick again when computing Gaussian mean & covariance

Never compute outerproducts, use kernel, I.e. infinite RBF:

Compute mini-kernel between each pixel in a given image…Gives kernelized or augmented Gaussian µµµµ and ΣΣΣΣ via Gram

Kernelized Gaussian Product Kernel (ICML’03)

� � �� � ��(� � � )� )�

� ��

���� � ��� �'(� � � � � � �

Page 37: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

37

Previously:

Now have:

Still invariant to order of pixelsCompute Hilbert Gaussian’s mean & covariance of each imagebag or image is N x N pixel Gram matrix using kappa kernelUse kernel PCA to regularize infinite dimensional RBF Gaussian

Puts all dimensions (X,Y,I) on an equal footing

Kernelized Gaussian Product Kernel

� � � ������ �� � � � � � ��� � �� � ��� � � �

�� �!�� �� � � � �)� )� &� � � � �� �� � � �� � ��

� � �� �% %�� � �� � � � � � �

�� � � � �� �% %�� )� �� )� � � )� � �

� �

� �

� � �

� �

� �

� � �

� � � �

� � � �

� �( (� �� �� �� �� �� �( (� �� �

� �

� �

� �

� � �

� �

� �

� � � �

� � � �

� �( (� �� �� �� �� �� �( (� �� �

� �

Page 38: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

38

Reconstruction of Letter ‘R’ with 1-4 KPCA with RBF Kernel

Reconstruction of Letter ‘R’ with 3 KPCA with RBF Kernel + Smoothing

Letter ‘R’ with 3 KPCA Components of RBF Kernel

Kernelized Gaussian Product Kernel

Page 39: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

39

Kernelized Gaussian Product Kernel100 40x40 monochromaticimages of crosses & squares translating & scaling

SVM: Train on 50, Test on 50

Fisher for Gaussianis Quadratic Kernel

RBF Kernel (red)67% accuracy

Bhattacharyya (blue)90% accuracy

Page 40: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

40

Kernelized Gaussian Product KernelSVM Classification of NIST digit images 0,1,…,9Sample each image to get bag of 30 (X,Y) pixelsTrain on random 120, test on random 80

bag-of-vectorsBhattacharyyaoutperformsstandard RBFdue to built-ininvariance

Fisher Kernel forGaussian is quadratic

Page 41: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

41

Mixture Model Product KernelsBeyond Exponential Family: Mixtures and Hidden VariablesEasier for ρρρρ=1 Expected Likelihood kernel…

Mixture:

Kernel:

� � ��

�� � � � � � �

�� � �

� � ��

� � � � ��

�� � � � � � �

�� � �

� � �

� � � �

� � �

� �

�� �

� � �

� � � �

� � �

� �

� �� �

! � � � � ��

� � � � � � � � � � ��

� � � � !

� �

� �

� � �

� � �

�� � �� �

Use M*N subkernel evaluationsfrom our previous repertoire

� �

� �

Page 42: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

42

Hidden Markov Model Product KernelsHidden Markov Models: (sequences)

# of hidden configurations large

Kernel:

Do we need to consider raw cross-product of hiddens?

� � � � �� � � ��� � �

� � ����� � � � � � � �

��� � $

��

��

��

��

��

� � ����(& �

� � �

� � � �

� � ��

� � �

� � � �

� � �

� �

� �� �

! � � � � ��

� � � � � � � � � � ��

� � � � ! � �

� � �

�� � �� �

�� �) �*

Page 43: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

43

Hidden Markov Model Product KernelsDo we need to consider cross-product of hiddens? NO!

Take advantage ofstructure in HMMsvia Bayesian network

Only compute subkernels for common parentsEvaluate total of subkernelsForm +ve clique potential functions, sum via junction tree algorithm

��

��

��

��

��

�� �� �� ����

� � � � �

� � � � � � � �

� �

�� ��

� � � � � ��

� � � � � � � �

� � �

� � � � � �

� �

� � � � � � �� ��� �

�� �� ��� �

� � � �� ��� �

! � � � � � � � � � ��

� � � � !

� � � � � � � � �

� �

� �

� �

� � �

� + +

� � $ �� $� $

� ��� � ��� � ��� � ��� � ���

�)� �

Page 44: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

44

Hidden Markov Model Product KernelsProtein Dataset: 480 protein sequences from SCOP datasetEach ~200 discrete symbols (alphabet ~ 20)Train 2-state HMM on each sequence (over-fitting)SVM two class problem: family 2.1.1.4 vs. family 2.1Training: 120 +ve, 120 -ve Testing: 120 +ve, 120 -ve

Page 45: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

45

Bayesian Network Product KernelsProbability product over common sample space between any pair ofLatent Bayesian Networks:

Compute subkernels forover all latent parents

� ��

��

� � � � ��

�� $

� � � � �� � � � � �� � � � � �! � � � � � � � � ��� � � � � �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� �� �� ����

��

��

��

��

��

� �� �� � � �

����

! � � �� � � �� ��� �

Computations grow tractablywith enlarged clique size of jointparents. Won’t get loopy iforiginal graphs non-loopy

Page 46: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

46

Intractable Distributions Product Kernels (Sampling)What if non-parametric or intractable (loopy) distributions?

Sampling: approximate

By definition, generative models can:1) Generate a Sample2) Compute Likelihood of a Sample

Thus, approximate probability product via sampling:

Beta controls how much sampling from each distribution…

� � �� � �! � � � � ��� � � �

� �

� �

� �

��

� �� � �

� � � �

� �

� � � � � � � �

� �

! ! � �

� � � �, �,

� �

� � �

� �� �� �

� �

Page 47: Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

47

DiscussionUse Generative Models & Sets in SVMs and Kernel Machines viaHellinger, Bhattacharyya, Expected Likelihood (Less Kernel Voodoo)

Avoid Approximation, Mercer Kernel, Symmetric, Nonlinear Behavior,and Computable for many distributions:

Exponential Family, Bernoulli, Multinomial, Gaussian,Kernelized Gaussian, Mixture Models, HMMsLatent Bayes Nets, Sampling Methods, …

Future Work: Getting aggregated maximumlikelihood solution to influence kernel

� � � �� � � "

! � � � ��-� �� � � �