Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix:

1

Kernels BetweenDistributions & Sets

Tony JebaraRisi KondorMachine Learning LabColumbia UniversityJune 2003

2

OutlineVectors vs. Sets of VectorsGenerative Models of SetsKernels on SetsKernels on Probability ModelsHellinger & Bhattacharyya

Probability Product Kernels:Exponential FamilyBernoulli & MultinomialGaussianKernelized GaussianMixture ModelsHidden Markov ModelsBayesian NetworksSampling Methods

3

Part 2: Kernels onSets & Distributions

Part 1: GenerativeModels on Sets

Non-VectorData

4

Part 1:

Generative Models onSets of Vectors

(...as opposed to Vectors)

5

Modeling Data as Vectors & Vectorization

Many learning methods expect datum as vector

But, vectorizing an object into vector is dangerousImages: morph, rotate, translate, zoom…Audio: pitch changes, ambient acoustics…Video: motion, camera view, angles…Gene Data: proteins fold, insertions, deletions…

Want Invariance: factor out certain variations (translations)Want Linearity: model desired variations (identity) linearlyBut, above variations are highly nonlinear in vector representation

i.e. image translation:�

�

��

��

��

�

�

�

�

��

6

Alternative: Avoid Vectorizing, Use “Bag of Vectors”Since vectorization so nonlinear, avoid it from outsetView a datum as “Bag of Vectors” instead of single Vector

i.e. grayscale image = Set of Vectors or Bag of Pixels(N pixels, each is a D=3 XYI tuple)

Vs.

xx

xWhy? Image Vectorization dumps DxNdimensional vector by lexicographicordering. Over many images, orderingstays constant, only intensities vary.All variation captured only by intensities changes

7

Why Bags of Pixels or Sets of Vectors?Vectorization / Rasterization: uses index in image tosort pixels into large vector. Dataset only shows variationsin I entries of large vector (spatial is nonlinear)

If we knew “optimal” correspondence:could sort pixels in the bag into large vector moreappropriately. Dataset shows jointly linearvariations in X, Y and I entries

� � ��

� � � ��

��

� � ��

� � � � � � � � � ��

��

� � ��

� � � ��

��

� � ��

� � � � � � � � � ��

��

8

Why Bags of Pixels or Sets of Vectors?

But, we don’t know optimal correspondence, must learn it

As vector images, linear changes & eigenvectors are additionsand deletions of intensities (awkward) .Translating, raising eyebrows, etc. involve erasing & redrawing

In bag of pixels (vectorized only after optimal correspondence) see linear changes and eigenvectors are morphings, warpings,jointly spatial and intensity change

9

Bag Representation àààà Permutation àààà ManifoldAssume order unknown. “Set of Vectors” or “Bag of Pixels”Get permutational invariance (order doesn’t matter)

Can’t represent invariance by single ‘X’ vector point in DxN spacesince we don’t know the ordering

Get permutation invariance by ‘X’ spanning all possible reorderingsMultiply X by unknown A matrix (permutation or doubly-stochastic)

x

x �

�

�

�

�

�

�

�

� ��

�

�

�

�

�

�

�

�

� ��

�

�

�

�

�

�

�

�

� ��

�

�

�

�

�

�

�

�

� ��

�

�

�

�

�

�

�

�

� ��

xx

x

10

Invariant Paths as Matrix Operators on Vectors

Move vector along a manifold by multiplying by a matrix:

Restrict A to be permutation matrixResulting manifold of configurations is an “orbit” if A is a groupOr, for smooth manifold, A is doubly-stochastic matrixEndow each image in dataset with its own transformation matrix AEach image is now a bag or manifold:

� ��

� � ��

� ��

11

Modeling a Dataset of Invariant Manifolds

Example: assume model is PCA, learn 2D subspace of 3D data

Permutation indicates points can move independently along paths

Find PCA after moving to form ‘tight’ 2D subspace

More generally, move along manifolds to improve fit of any model(PCA, SVM, probability density, etc.)

12

Explicitly Handling Permutations (AISTAT 2003)Borrow from SVM: regularization cost + linear constraints on modelHere have: modeling cost + linear constraints on transformationsEstimate transformation parameters

and model parameters (PCA, Gaussian, SVM)

Cost function on matrices A emerges from modeling criterionMin Convex Cost with Convex Hull of Constraints (Unique!)

Since A matrices are soft permutationmatrices (doubly-stochastic) we have:

� ��

��

�

��

��

� � ��

� ��

��

� � ��

� � � �

� � ��

�

13

Cost C(A)? Gaussian Mean

Maximum Likelihood Gaussian Mean Model:

Theorem 1: C(A) is convex in A (Convex Program)

Can solve via a quadratic program on the A matrices

Minimizing the trace of a covariance tries to pull the data spherically towards a common mean

� ��

��

��

��

� ��

� � � ��

14

Cost C(A)? Gaussian Mean & Covariance

Theorem 2: Regularized log determinant of a covariance is convex Equivalently, minimize

Theorem 3: Cost not quadratic but can be upper bounded by quadIteratively solve quadratic program with variational bound

Minimizing determinant flattens data into a pancake of low volume

� ��

��

��

� � ��

� � ��

��

��

� ��

� � � � ��

� ��

� � � ��

� � ��

15

Cost C(A)? Fisher Discriminant

��

� �

� ��

� ��

� �� x

x

xx

x

x

x

xx

x xx

xx

x

x

x

x

xx

x x

� � ��

Find linear discriminant model ‘w’ thatmaximizes between / within class scatter

For discriminative invariance learning, estimate transformationmatrices to: increase between-class scatter (numerator)

reduce within class scatter (denominator)

Minimizing this tries to permute data to make classification easy

16

Cost Function C(A) InterpretationsMaximum Likelihood Mean

Permute data towards common mean

Maximum Likelihood Mean & CovariancePermute data towards flat subspacePushes energy into few eigenvectorsGreat as pre-processing before PCA

Fisher DiscriminantPermute data towards two flatsubspaces while repelling awayfrom each other’s means

17

Practical Optimization of Quadratic Programs

Quadratic Programming used for all C(A) since:Gaussian Mean quadraticGaussian Covariance upper boundable by quadraticFisher Discriminant upper boundable by quadratic

Use Sequential Minimal Optimizationaxis parallel optimization, pick axes to update,ensure constraints not violated

Soft permutation matrix 4 constraintsor 4 entries at a time

� ��

��

� � ��

� � � ��

18

Digits: Image = Bag of XY Vectors

Original

PCA Permuted PCA

20 Images of ‘3’ and ‘9’Each is 70 (x,y) dotsNo order on the ‘dots’

PCA compress with samenumber of Eigenvectors

Convex Program firstestimates thepermutation ààààbetter reconstruction

19

Linear Interpolation

Intermediate imagesare smooth morphs

Points nicelycorresponded

Spatial morphingversus ‘redrawing’

No ghosting

20

Single Person Faces: Image = Bag of XYI Pixels

Original PCA PermutedBag-ofPixelsPCA 2000 XYI Pixels: Compress to 20 dims

Improve squared error of PCA byAlmost 3 orders of magnitude x103

21

Multi-Person Faces: Bag-of-Pixels Eigenvectors+/- Scaling on Eigenvector

Top 5Eigenvectors

All just linearvariations inbag of XYIpixels

Vectorizationnonlinearneeds huge #of eigenvectors

22

Multi-Person Faces: Bag-of-Pixels Eigenvectors+/- Scaling on Eigenvector

Next 5Eigenvectors

23

Classification: Distances & Affinity between Two Bags?For nearest neighbor classification: need distances between two bags

For SVM classification: need kernel or affinity between two bags

Could solve permutations, find closest point between two manifolds:

But, can we avoid computing optimal permutations (work)?Implicitly compute distances or affinities?

x

x

�

��

�

��

24

Idea: 1) Each bag has IID vectors in it2) Model each bag using a probability density (e.g. Gaussian)3) Build kernel classifier by measuring affinity between PDFs

Implicitly Handling Permutations: Kernels on PDFs

25

Part 2: Kernelizing …

Kernels on Sets of Vectors& Kernels on PDFs

26

Kernels from Generative ModelsCombines Generative & Discriminative Tools

Use a Kernel derived from generative model

Fisher Kernels: Jaakkola & HausslerConvolution / Transducer Kernels: Haussler, Watkins, CortesDiffusion Kernels: Kondor, LaffertyHeat Kernels: Lafferty, Lebanon

Model distribution of points(one, some or all)

Density helps compute affinities& kernels in machine (SVM)

�� ! � �

27

Fisher Kernel: approximate distance betweentwo generative models on statisticalmanifold using Kullback Leibler

Approximate KL by quadratic formlocal tangent space at ML estimate

From distance get affinityvia Fisher Info & gradients

Heat Kernel: (Lafferty & Lebanon) better geodesic distance & affinity, Only solvable for multinomial (sphere) or covariance (hyperbolic)

Fisher Kernels & Kullback-Leibler Divergence

��

�!"� � � ��

��

xxx

� ��

��

�!" �

��

θθθθ

θθθθ’θθθθ* ��

�

�

��

! � ��

��

��

� � � �

28

Hellinger Divergence & Bhattacharyya (COLT’03)KL not symmetric, needs approximation, lose interesting nonlinearities

Instead: START with nice divergence, avoid approximationOther choices & bounds for divergence (Topsoe, Dragomir):

Triangular, L1, Hellinger, Harmonic, Variational, Csiszar’s f-Div, …

Hellinger Divergence Bhattacharyya Affinity

Desiderata: Mercer Kernel? +ve? +ve definite? YES Symmetric? YESComputable for many distributions? YESInteresting nonlinear behavior? YES

��

�

�� ! � ��

� ��

�

�!"� � �

��

� �� #� � � � �� $� � � � � ��

29

Probability Product KernelsComputing the kernel: given inputs χ and χ’

1) Estimate Densities (I.e. ML):

2) Compute Bhattacharyya Affinity:

Probability Product Kernel:

Bhattacharyya Kernel:

Expected Likelihood Kernel:

� � � �

�

� � � �

� � � �

� � � �

� � �

� � �

�� ! � � �� ! � � ��

� ��

��

� ! !�

� � �� ! % � % ��

χχ

30

Many Properties, includes Gaussian Mean,Covariance, Multinomial, Binomial, Poisson,Exponential, Gamma, Bernoulli, Dirichlet, …

Maximum likelihood is straightforward:

All have above form but different A(x), T(x), convex K(θθθθ)

Exponential Family Product Kernels

� � � � �� !� � � � � �

� ��

! � �""� � � �

31

Compute the Bhattacharyya Kernel for the e-family:

Analytic solution for e-family:

Only depends on convex cumulant-generating function K(θθθθ)

Meanwhile, Fisher Kernel is always linear in sufficient stats…

� � �

� � � �

��

� � � � ��

� � � � �

��

! � � � � ��

! ! !

� � � � �

� � � � � � � �

�

� � � ��

� �

��

�� & � � &� �

� ��

Exponential Family Product Kernels

� � � � �� !� � � � � �

� � �� !

� ��

32

Multinomial and Bernoulli Product KernelsBernoulli:

(binary)

Multinomial:(discrete)

For multinomial counts (for N words):

Fisher for Multinomial is linear

� � � � � � �

� � �

�

�

�

� �

� � �

� � � �

��

��

� ��

�

� � � ��

� �

! � � � � ��

�

�

� �

� � �

�

� � # � #

� � �

� �� ## � � # � #� ��

$�$

�

� ��

�

�

� � �

��

��

�

� ��

� �

!

�

�

�

� � %

� � � %%

$�

� ��

� � ��

�

� ��!

�

� �� %%� ��

33

Multinomial Product Kernel for TextWebKB dataset: Faculty vs. student web page SVM kernel classifier

20-Fold Cross-Validation, 1641 student & 1124 faculty, ...Use Bhattacharyya Kernel on multinomial (word frequency)

Training Set Size = 77 Training Set Size = 622

34

Gaussian Mean: (any ρρρρ)(continuous)

If µ=χ µ’=χ’ get RBF KernelFisher here is linear

Gaussian Covariance: (any ρρρρ)

Fisher here is quadratic

But, how do we get a non-degenerate covariance from 1 data point?

Gaussian Product Kernel

� � �

��

� � � � � ��

� ��

! � � � � ��

��

� � � ��

&� � � � ��

�

� �� '��

� � �

��

� � � � �

��

! � � � � ��

& � �� '

��

35

Gaussian Product Kernels on Bags of VectorsInstead of a single χ & χ’Construct p & p’ from many χ & χ’I.e. use bag of vectors

� �

� �

�

� �

� �

��

�

�

� �

� �

� �

� �

�

�

� ��

�� !��

� � �� % %��

��

36

Bhattacharyya affinity on Gaussians on bags {χ,...} & {χ’,...}Invariant: to order of tuples in each bagBut too simple: overlap of two Gaussian distributions on images

Need more detail than mean and covariance of pixels…

Use Kernel Trick again when computing Gaussian mean & covariance

Never compute outerproducts, use kernel, I.e. infinite RBF:

Compute mini-kernel between each pixel in a given image…Gives kernelized or augmented Gaussian µµµµ and ΣΣΣΣ via Gram

Kernelized Gaussian Product Kernel (ICML’03)

� � �� (� � � )� )�

� ��

�� '(� � � � � � �

37

Previously:

Now have:

Still invariant to order of pixelsCompute Hilbert Gaussian’s mean & covariance of each imagebag or image is N x N pixel Gram matrix using kappa kernelUse kernel PCA to regularize infinite dimensional RBF Gaussian

Puts all dimensions (X,Y,I) on an equal footing

Kernelized Gaussian Product Kernel

� � � ��

�� !�� )� )� &� � � � ��

� � �� % %��

�� % %�� )� �� )� � � )� � �

� �

� �

� � �

�

� �

� �

�

� � �

� � � �

� � � �

� �( (� �� ( (� ��

�

� �

�

� �

� �

� � �

�

� �

� �

� � � �

� � � �

� �( (� �� ( (� ��

�

� �

�

38

Reconstruction of Letter ‘R’ with 1-4 KPCA with RBF Kernel

Reconstruction of Letter ‘R’ with 3 KPCA with RBF Kernel + Smoothing

Letter ‘R’ with 3 KPCA Components of RBF Kernel

Kernelized Gaussian Product Kernel

39

Kernelized Gaussian Product Kernel100 40x40 monochromaticimages of crosses & squares translating & scaling

SVM: Train on 50, Test on 50

Fisher for Gaussianis Quadratic Kernel

RBF Kernel (red)67% accuracy

Bhattacharyya (blue)90% accuracy

40

Kernelized Gaussian Product KernelSVM Classification of NIST digit images 0,1,…,9Sample each image to get bag of 30 (X,Y) pixelsTrain on random 120, test on random 80

bag-of-vectorsBhattacharyyaoutperformsstandard RBFdue to built-ininvariance

Fisher Kernel forGaussian is quadratic

41

Mixture Model Product KernelsBeyond Exponential Family: Mixtures and Hidden VariablesEasier for ρρρρ=1 Expected Likelihood kernel…

Mixture:

Kernel:

� � ��

�

��

��

� � ��

� � � � ��

��

��

� � �

� � � �

� � �

� �

��

� � �

� � � �

� � �

�

� �

�

� ��

! � � � � ��

� � � � � � � � � � ��

� � � � !

� �

� �

� � �

�

� � �

��

Use M*N subkernel evaluationsfrom our previous repertoire

� �

� �

�

�

�

42

Hidden Markov Model Product KernelsHidden Markov Models: (sequences)

# of hidden configurations large

Kernel:

Do we need to consider raw cross-product of hiddens?

� � � � ��

�

� � ��

�� $

��

�

��

�

��

�

��

�

��

� � ��(& �

� � �

� � � �

� � ��

� � �

� � � �

� � �

� �

� ��

! � � � � ��

� � � � � � � � � � ��

� � � � ! � �

� � �

�

�

��

�� ) �*

43

Hidden Markov Model Product KernelsDo we need to consider cross-product of hiddens? NO!

Take advantage ofstructure in HMMsvia Bayesian network

Only compute subkernels for common parentsEvaluate total of subkernelsForm +ve clique potential functions, sum via junction tree algorithm

��

�

��

�

��

�

��

�

��

�

��

� � � � �

� � � � � � � �

� �

��

� � � � � ��

� � � � � � � �

� � �

� � � � � �

� �

� � � � � � ��

��

� � � ��

! � � � � � � � � � ��

� � � � !

� � � � � � � � �

� �

� �

� �

� � �

�

� + +

� � $ �� $� $

� ��

�)� �

44

Hidden Markov Model Product KernelsProtein Dataset: 480 protein sequences from SCOP datasetEach ~200 discrete symbols (alphabet ~ 20)Train 2-state HMM on each sequence (over-fitting)SVM two class problem: family 2.1.1.4 vs. family 2.1Training: 120 +ve, 120 -ve Testing: 120 +ve, 120 -ve

45

Bayesian Network Product KernelsProbability product over common sample space between any pair ofLatent Bayesian Networks:

Compute subkernels forover all latent parents

� ��

��

� � � � ��

�� $

� � � � �� ! � � � � � � � � ��

��

�

��

�

��

�

��

�

��

�

��

��

��

��

��

��

��

��

��

��

��

��

�

��

�

��

�

��

�

��

�

� ��

��

! � � ��

Computations grow tractablywith enlarged clique size of jointparents. Won’t get loopy iforiginal graphs non-loopy

46

Intractable Distributions Product Kernels (Sampling)What if non-parametric or intractable (loopy) distributions?

Sampling: approximate

By definition, generative models can:1) Generate a Sample2) Compute Likelihood of a Sample

Thus, approximate probability product via sampling:

Beta controls how much sampling from each distribution…

� � �� ! � � � � ��

� �

� �

� �

��

� ��

� � � �

� �

� � � � � � � �

� �

! ! � �

� � � �, �,

� �

� � �

� ��

� �

47

DiscussionUse Generative Models & Sets in SVMs and Kernel Machines viaHellinger, Bhattacharyya, Expected Likelihood (Less Kernel Voodoo)

Avoid Approximation, Mercer Kernel, Symmetric, Nonlinear Behavior,and Computable for many distributions:

Exponential Family, Bernoulli, Multinomial, Gaussian,Kernelized Gaussian, Mixture Models, HMMsLatent Bayes Nets, Sampling Methods, …

Future Work: Getting aggregated maximumlikelihood solution to influence kernel

� � � �� "

! � � � ��-� ��

Documents

Kernels Between Distributions & Sets - Columbia …jebara/talks/snowbird03.pdfInvariant Paths as Matrix Operators on Vectors 10 Move vector along a manifold by multiplying by a matrix: