A new implementation of k-MLE for mixture modelling of Wishart distributions

Preview:

DESCRIPTION

A new implementation of k-MLE for mixture modelling of Wishart distributions GSI 2013

Citation preview

A new implementation of k-MLE formixture modelling of Wishart distributions

Christophe Saint-Jean Frank Nielsen

Geometric Science of Information 2013

August 28, 2013 - Mines Paris Tech

Application Context (1)

2/31

2/31

We are interested in clustering varying-length sets of multivariateobservations of same dim. p.

X1 =

3.6 0.05 −4.3.6 0.05 −4.3.6 0.05 −4.

, . . . ,XN =

5.3 −0.5 2.53.6 0.5 3.51.6 −0.5 4.6−1.6 0.5 5.1−2.9 −0.5 6.1

Sample mean is a good but not discriminative enough feature.

Second order cross-product matrices tXiXi may capture somerelations between (column) variables.

Application Context (2)

3/31

3/31

The problem is now the clustering of a set of p × p PSD matrices :

χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN

}

Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...

Application Context (2)

3/31

3/31

The problem is now the clustering of a set of p × p PSD matrices :

χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN

}

Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...

Outline of this talk

4/31

4/31

1 MLE and Wishart DistributionExponential Family and Maximum Likehood EstimateWishart DistributionTwo sub-families of the Wishart Distribution

2 Mixture modeling with k-MLEOriginal k-MLEk-MLE for Wishart distributionsHeuristics for the initialization

3 Application to motion retrieval

Reminder : Exponential Family (EF)

5/31

5/31

An exponential family is a set of parametric probability distributions

EF = {p(x ;λ) = pF (x ; θ) = exp {〈t(x), θ〉+ k(x)− F (θ)|θ ∈ Θ}

Terminology:

λ source parameters.

θ natural parameters.

t(x) sufficient statistic.

k(x) auxiliary carrier measure.

F (θ) the log-normalizer:differentiable, strictlyconvex

Θ = {θ ∈ RD |F (θ) <∞}is an open convex set

Almost all commonly used distributions are EF members butuniform, Cauchy distributions.

Reminder : Maximum Likehood Estimate (MLE)

6/31

6/31

Maximum Likehood Estimate principle is a very commonapproach for fitting parameters of a distribution

θ = argmaxθ

L(θ;χ) = argmaxθ

N∏i=1

p(xi ; θ) = argminθ− 1

N

N∑i=1

log p(xi ; θ)

assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations.

Log density have a convenient expression for EF members

log pF (x ; θ) = 〈t(x), θ〉+ k(x)− F (θ)

It follows

θ = argmaxθ

N∑i=1

log pF (xi ; θ) = argmaxθ

(〈

N∑i=1

t(xi ), θ〉 − NF (θ)

)

MLE with EF

7/31

7/31

Since F is a strictly convex, differentiable function, MLEexists and is unique :

∇F (θ) =1

N

N∑i=1

t(xi )

Ideally, we have a closed form :

θ = ∇F−1(

1

N

N∑i=1

t(xi )

)

Numerical methods including Newton-Raphson can besuccessfully applied.

Wishart Distribution

8/31

8/31

Definition (Central Wishart distribution)

Wishart distribution characterizes empirical covariance matrices forzero-mean gaussian samples:

Wd(X ; n,S) =|X |

n−d−12 exp

{− 1

2tr(S−1X )

}2

nd2 |S |

n2 Γd

(n2

)where for x > 0, Γd(x) = π

d(d−1)4∏d

j=1 Γ(x − j−1

2

)is the

multivariate gamma function.

Remarks : n > d − 1, E[X ] = nS

The multivariate generalization of the chi-square distribution.

Wishart Distribution as an EF

9/31

9/31

It’s an exponential family:

logWd(X ; θn, θS) = < θn, log |X | >R + < θS ,−1

2X >HS

+ k(X )− F (θn, θS)

with k(X ) = 0 and

(θn, θS) = (n − d − 1

2,S−1), t(X ) = (log |X |,−1

2X ),

F (θn, θS) =

(θn +

(d + 1)

2

)(d log(2)− log |θS |)+log Γd

(θn +

(d + 1)

2

)

MLE for Wishart Distribution

10/31

10/31

In the case of the Wishart distribution, a closed form would beobtained by solving the following system

θ = ∇F−1(

1

N

N∑i=1

t(xi )

)≡ d log(2)− log |θS |+ Ψd

(θn + (d+1)

2

)= ηn

−(θn + (d+1)

2

)θ−1S = ηS

(1)

with ηn and ηS the expectation parameters and Ψd the derivativeof the log Γd .Unfortunately, no closed-form solution is known.

Two sub-families of the Wishart Distribution (1)

11/31

11/31

Case n fixed (n = 2θn + d + 1)

Fn(θS) =nd

2log(2)− n

2log |θS |+ log Γd

(n2

)kn(X ) =

n − d − 1

2log |X |

Case S fixed (S = θ−1S )

FS(θn) =

(θn +

d + 1

2

)log |2S |+ log Γd

(θn +

d + 1

2

)

kS(X ) = −1

2tr(S−1X )

Two sub-families of the Wishart Distribution (2)

12/31

12/31

Both are exponential families and MLE equations are solvable !

Case n fixed:

−n

2θ−1S =

1

N

N∑i=1

−1

2Xi =⇒ θS = Nn

(N∑i=1

Xi

)−1(2)

Case S fixed :

θn = Ψ−1d

(1

N

N∑i=1

log |Xi | − log |2S |

)−d + 1

2, θn > 0 (3)

with Ψ−1d the functional reciprocal of Ψd .

An iterative estimator for the Wishart Distribution

13/31

13/31

Algorithm 1: An estimator for parameters of the Wishart

Input: A sample X1,X2, . . . ,XN of Sd++

Output: Final values of θn and θS

Initialize θn with some value > 0;

repeat

Update θS using Eq. 2 with n = 2θn + d + 1;

Update θn using Eq. 3 with S the inverse matrix of θS ;

until convergence of the likelihood ;

Questions and open problems

14/31

14/31

From a sample of Wishart matrices, distr. parameters arerecovered in few iterations.

Major question : do you have a MLE ? probably ...

Minor question : sample size N = 1 ?

Under-determined systemRegularization by sampling around X1

Mixture Models (MM)

15/31

15/31

A additive (finite) mixture is a flexible tool to model a morecomplex distribution m:

m(x) =k∑

j=1

wjpj(x), 0 ≤ wj ≤ 1,k∑

j=1

wj = 1

where pj are the component distributions of the mixture, wj

the mixing proportions.

In our case, we consider pj as member of some parametricfamily (EF)

m(x ; Ψ) =k∑

j=1

wjpFj(x ; θj)

with Ψ = (w1,w2, ...,wk−1, θ1, θ2, ..., θk)

Expectation-Maximization is not fast enough [5] ...

Original k-MLE (primal form.) in one slide

16/31

16/31

Algorithm 2: k-MLE

Input: A sample χ = {x1, x2, ..., xN}, F1,F2, ...,Fk Bregmangenerator

Output: Estimate Ψ of mixture parameters

A good initialization for Ψ (see later);

repeatrepeat

foreach xi ∈ χ do zi = argmaxj log wjpFj(xi ; θj);

foreach Cj := {xi ∈ χ|zi = j} do θj = MLEFj(Cj);

until Convergence of the complete likelihood ;

Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;

k-MLE’s properties

17/31

17/31

Another formulation comes with the connection between EFand Bregman divergences [3]:

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)

Bregman divergence BF (. : .) associated to a strictly convexand differentiable function F :

Original k-MLE (dual form.) in one slide

18/31

18/31

Algorithm 3: k-MLE

Input: A sample χ = {y1 = t(x1), y2 = x2, ..., yn = t(xN)},F ∗1 ,F

∗2 , ...,F

∗k Bregman generator

Output: Ψ = (w1, w2, ..., wk−1, θ1 = ∇F ∗(η1), ..., θk = ∇F ∗(ηk))

A good initialization for Ψ (see later);

repeatrepeat

foreach xi ∈ χ do zi = argminj

[BF∗j

(yi : ηj)− log wj

];

foreach Cj := {xi ∈ χ|zi = j} do ηj =∑

xi∈Cj yi/|Cj |

until Convergence of the complete likelihood ;

Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;

k-MLE for Wishart distributions

19/31

19/31

Practical considerations impose modifications of the algorithm:

During the assignment empty clusters may appear (Highdimensional data get this worse).

A possible solution is to consider Hartigan and Wang’sstrategy [6] instead of Lloyd’s strategy:

Optimally transfer one observation at a timeUpdate the parameters of involved clusters.Stop when no transfer is possible.

This should guarantees non-empty clusters [7] but does notwork when considering weighted clusters...

Get back to an “old school” criterion : |Czi | > 1

Experimentally shown to perform better in high dimensionthan the Lloyd’s strategy.

k-MLE - Hartigan and Wang

20/31

20/31

Criterion for potential transfer (Max):

log wzipFzi(xi ; θzi )

log wz∗ipFz∗

i(xi ; θzi∗)

< 1

with z∗i = argmaxj log wjpFj(xi ; θj)

Update rules :

θzi = MLEFj(Czi\{xi})

θz∗i = MLEFj(Cz∗i ∪ {xi})

OR

Criterion for potential transfer (Min):

BF∗(yi : ηz∗i )− logwz∗i

BF∗(yi : ηzi )− logwzi

< 1

with z∗i = argminj(BF∗(yi : ηj) −logwj)

Update rules :

ηzi =|Czi |ηzi − yi|Czi | − 1

ηz∗i =|Cz∗i |ηz∗i + yi

|Cz∗i |+ 1

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Towards a good initialization...

21/31

21/31

Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...

Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”

Fast and greedy approximation : Θ(kN)Probabilistic guarantee of good initialization:

OPTF ≤ k-meansF ≤ O(log k)OPTF

Dual Bregman divergence BF∗ may replace the square distance

Heuristic to avoid to fix k

22/31

22/31

K-means imposes to fix k, the number of clusters

We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :

“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”

Heuristic to avoid to fix k

22/31

22/31

K-means imposes to fix k, the number of clusters

We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :

“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”

Heuristic to avoid to fix k

22/31

22/31

K-means imposes to fix k, the number of clusters

We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :

“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”

It may overestimate the number of clusters...

Initialization with DP-k-MLE++

23/31

23/31

Algorithm 4: DP-k-MLE++

Input: A sample y1 = t(X1), . . . , yN = t(XN), F , λ > 0

Output: C a subset of y1, . . . , yN , k the number of clusters

Choose first seed C = {yj}, for j uniformly random in {1, 2, . . . ,N};repeat

foreach yi do compute pi = BF∗(yi : C)/∑N

i ′=1 BF∗(yi ′ : C)

where BF∗(yi : C) = minc∈CBF∗(yi : c) ;

if ∃pi > λ thenChoose next seed s among y1, y2, . . . , yN with prob. pi ;

Add selected seed to C : C = C ∪ {s} ;

until all pi ≤ λ;

k = |C|;

Motion capture

24/31

24/31

Real dataset:Motion capture of contemporary dancers (15 sensors in 3d).

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Remark: Size of each sub-motion is known (so its θn)

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Application to motion retrieval(1)

25/31

25/31

Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .

The idea is to describe Xi through one mixture modelparameters Ψi .

Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .

Application to motion retrieval(2)

26/31

26/31

Comparing two movements amounts to compute adissimilarity measure between Ψi and Ψj .

Remark 1 : with DP-k-MLE++, the two mixtures would notprobably have the same number of components.

Remark 2 : when both mixtures have one component, anatural choice is

KL(Wd(.; θ)||Wd(.; θ′)) = BF∗(η : η′) = BF (θ′ : θ)

A closed form is always available !

No closed form exists for KL divergence between generalmixtures.

Application to motion retrieval(3)

27/31

27/31

A possible solution is to use the CS divergence [10]:

CS(m : m′) = − log

∫m(x)m′(x)dx∫

m(x)2dx∫m′(x)2dx

It has a analytic formula for∫m(x)m′(x)dx =

k∑j=1

k′∑j ′=1

wjw′j ′ exp

F (θj+θ′j′ )−(F (θj )+F (θ′

j′ ))

Note that this expression is well defined since naturalparameter space Θ = R+

∗ × Sp++ is a convex cone.

Implementation

28/31

28/31

Early specific code in MatlabTM.

Today implementation in Python (based on pyMEF [2])

Ongoing proof of concept (with Herranz F., Beurive A.)

Conclusions - Future works

29/31

29/31

Still some mathematical work to be done:

Solve MLE equations to get ∇F ∗ = (∇F )−1 then F ∗

Characterize our estimator for full Wishart distribution.

Complete and validate the prototype of system for motionretrieval.

Speeding-up algorithm: computational/numerical/algorithmictricks.

library for bregman divergences learning ?

Possible extensions:

Reintroduce mean vector in the model : Gaussian-WishartOnline k-means -> online k-MLE ...

References I

30/31

30/31

Nielsen, F.:k-MLE: A fast algorithm for learning statistical mixture models.In: International Conference on Acoustics, Speech and Signal Processing.(2012) pp. 869–872

Schwander, O. and Nielsen, F.pyMEF - A framework for Exponential Families in Pythonin Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing

Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.Clustering with bregman divergences.Journal of Machine Learning Research (6) (2005) 1705–1749

Nielsen, F., Garcia, V.:Statistical exponential families: A digest with flash cards.http://arxiv.org/abs/0911.4863 (11 2009)

Hidot, S., Saint Jean, C.:An Expectation-Maximization algorithm for the Wishart mixture model:Application to movement clustering.Pattern Recognition Letters 31(14) (2010) 2318–2324

References II

31/31

31/31

Hartigan, J.A., Wong, M.A.:Algorithm AS 136: A k-means clustering algorithm.Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1)(1979) 100–108

Telgarsky, M., Vattani, A.:Hartigan’s method: k-means clustering without Voronoi.In: Proc. of International Conference on Artificial Intelligence andStatistics (AISTATS). (2010) pp. 820–827

Arthur, D., Vassilvitskii, S.:k-means++: The advantages of careful seedingIn: Proceedings of the eighteenth annual ACM-SIAM symposium onDiscrete algorithms (2007) pp. 1027–1035

Kulis, B., Jordan, M.I.:Revisiting k-means: New algorithms via Bayesian nonparametrics.In: International Conference on Machine Learning (ICML). (2012)

Nielsen, F.:Closed-form information-theoretic divergences for statistical mixtures.In: International Conference on Pattern Recognition (ICPR). (2012) pp.1723–1726

Recommended