Information Geometry of Self-organizing maximum likelihood

1

Information Geometry of

Self-organizing maximum likelihood

Shinto Eguchi ISM, GUAS

This talk is based on joint research with Dr Yutaka Kano, Osaka Univ

Bernoulli 2000 Conference at Riken on 27 October, 2000

2

Consider a statistical model: }:)({ xfM

)( fromdata be),( Let 1 xfxx n

)(logmaxargˆ 1

i

n

i

xfθ

))(logmaxargˆ 1

i

n

i

xfθ

-MLE

Maximum Likelihood Estimation (MLE)

)())(log)exp() 11 xfxfzz

( Fisher, 1922),

)(maxargˆ 1

1 i

n

i

xfθ

Take an increasing function .

Consistency, efficiency sufficiency, unbiasedness invariance, information

3

)(2

)(exp

2

1),(

2

x

x

πxf

n

iii

n

ii

xxf

xfθ

1

11

0)(),(solvearg

),(maxargˆ

)ˆ,(

)ˆ,(

1

1

1ˆ

θixf

iθixf x

n

xi

-MLE

MLE

Normal density

-MLE

given data },,1,{ nixi

),2,1(),(

),(

1

tx

t

t

θixf

iθixf

t

4

-3 -2 -1 1 2 3

0.10.20.30.4

outlier

MLE

-MLE1

Normal density

θ

1θ

1θ

θ

5

.)(e)(* 0

where sdszz

s

,)}(log)(log{)}(log)(log{),( ** ababbabd

z

zzzzbabd z

b

a

)(

d)e(:),( where

log

log

divergence- )d)(),(:),( xxfxgdfgD

d)log)logd)log)log ** fgfgg

).).-(iff0(0),()iff0(0),( eafgfgDababd

6

Examples

(1)

zzzz e)(,)( * )d()(

)(log)(),( y

y

yy

f

ggfgKL

1

e)(,

1e)(

)1(*

βz

βz

zβ

β

zβ

β

)d(1

)()()d(

)()()(),(

11

yyy

yyy

y νβ

fgν

βfg

gfgDββββ

(2)

zηηz

ηηz

η

ηee

eeze

ez

1

log)(,1

1log)( *

)d()(

)(log)d(

)(1

)(1log)(),( y

y

y

y

yy ν

ge

feeyν

ge

fegfgD

η

ηη

η

η

η

(3)

zzzz ηη

ββ

)(lim,)(lim0

-divergence

-divergence

KL-divergence

7

zzz

zzz

e)(,1

e)(,

1e)(

)1(*

zzz

z ez

eeeez

e

ez

1

1)(,

1log)(,

1

1log)( *

-1 -0.5 0.5 1

-1

-0.5

0.5

1

1.5

-1 -0.5 0.5 1

1

2

3

4

-1 -0.5 0.5 1

0.5

1

1.5

2

2.5

-1 -0.5 0.5 1

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

-1 -0.5 0.5 1

0.5

1

1.5

2

2.5

-1 -0.5 0.5 1

0.2

0.4

0.6

0.8

1

9.0

,5.0

,3.0

,005.

9.0

5.0

3.0

001.

8

Pythagorian theorem

),(),(),( stst hgDhfDfgD

)10())( log))( log)1())( log sxhsxfsxhs

)10()( )( )1()(

),(),(),(

txgtxftxg

hfDfgDhgD

t

(1,1)

(1,0)

(0,1)

( t, s ) .

(0,0) f

g

h

sh

tg

9

d))log)(log()(),(),(),( hffghfDfgDhgD

d)}(log)(log){(

d)}(log)(log{d)}(log)(log{



**

**

**

hffg

hfhff

fgfgg

hghgg

d)log)log)log)log),( ** fgfgggfgD

0}),(),(),({d))log)(log()(

d))log)(log()(),(),(),(

hfDfgDhgDtshffgts

hffghfDfgDhgD ststst

(Pf)

10

),,( * G

Differential geometry of

θθθθ

θθθ θθ

)log

,()(

*

|*

*2 ffDG

θθθθθθ

θθθ θθ

)log,()(

2

**

*3 | ffD

Riemann metric

Affine connection

Conjugateaffine connection

}Δf,fD,D *** θθθθθθ

θθ :ΘΘ)({over)(:( ΨΨ

}:)({ on xfM),( fgD

(1992) Eguchi cf.)convexis(),d)(

)()(

x

x

xx

g

fgCiszsar’s divergence

θθθθθθ

θθθ θθ

)log,()(

2

*

*3* | ffD

Mon

11

d

1),(

fgfgg

fgDβ

1987)Lauritzen,(cf. tensorskewness-

E)(

metric-

jifji SSfG

)()()( ,*

, kjikjikjiT

kjif SSSf

E)1(

d1

1

1),(

fgfgD

tensorskewness-

E)(

metric-ninformatio

jifji SSG

)()()( ,*

, kjikjikjiT

)()()1()(geodesicmixture

})()()1({,(

geodesic-

/1

xgtxftxf

xgtxftxf

t

t

/1})()()1({,(

geodesic-

xgtxftxf t

-divergence Amari’s -divergence

1,(geodesic-)1( xf t

)10( )10(

kjif SSS

E)21(

12

-likelihood function

)d())((log)(,()(log1

)( *

1

yyx fbbfn

Ln

ii

M-estimation ( Huber, 1964, 1983)

())(log, bxfx

(

)(d)(log

d)}(log)(log{d)}(log)(log{),( **

L

bfg

fgfggfgD

n

ii

fn

L1

)(log1

)( x

(dlogd)log(log),( LfgfggfgKL

Kullback-Leibler and maximum likelihood

13

}),(),({E)(),(1

(1

, θySθyθxSθxθg f

n

iiin

Another definition of -likelihood

T

TT

θ

θg

θ

θg

θg

((

fielda vector asintegrableorfunction,potentialahas(

)(log),(thatsuch θxθx fz

sLC

d()( θgθ

Take a positive function (x, ) and define

),(log),(where θxθ

θxS ii f

-likelihood equation is a weighted score with integrabity.

θ( L

θ g(

θ( L

14

0 ) , (

) d ))} ( log )) ( log {

) d ))} ( log )) ( log {) (

] ) ( [ E ] ) ( [ E

0

* *

0

0

0 0

0 0

θ θ

θ θ

θ θ

θ θ θ

θ

D

x x f x f

x x f x f x f

L Lθf f

) ( density a be Let0 1x x xf iid n

] ) ( [ E max arg0

0θ θθ θ

L f

]) ( [ Ea.s.

) (0

L Lθf

( max arg ˆ s. a. ˆwhereL θ θConsistency of -MLE

15

Influence function

},(,(),{(

(,(),()(ˆ

lim),ˆIF( 1

TSSEJ

bSJGθ

θ

yyy

yyy

θθ θ ()d()(log))(( byGyfGL),)((maxarg:)(ˆ GLG θθθ

Fisher consistency )()(ˆ Fθ )(ˆ FθGF

y FGGF 1(-contamination model of

,(), yy S,( yS

Asymptotic efficiency (,0N)ˆ( VD

θn

((,(Var(( idVJJV ySy

Robustness or Efficiency

16

Generalized linear model

, (

) (exp ) , (y c

a

yy f

) ((

) ( Var, ) ( E2

2

V

a

yy

Regression model

x θ x θ x θ xT T T

h h y( ) or , ( ) | ( E1

n

ii

Tn

ii

Ti

i i

y y fn

y fn

n i y

1

*

1

) d ) , ( log1

) , ( log1

) , , 1 ( ) , ( data given function, likelihood -

x θ x θ

x

iiT

n

iii

Tii

n

i

yyyyy xxθxxθ )(d))((),())((),(11

Estimating equation

Gaussian Inverse

Gamma

Bernoulli1

Poisson

Normal1(

2

μ

μ

μμ(

μ

μV

17

Bernoulli regression

) | 1 Prob( ) , ( ), 1, 1 ( ) , ( 1( , ( ) , | (2/ ) 1( 2/ ) 1 (

x θ x θ x θ x θ x y p y p p y fy y

n

ii i

iy

iy

iy

iy

p p

p p

p p

i i

i i

1

* *

1 1

1 1

) , ( 1 log ) , ( log

)) , ( 1( log ) , ( log max arg MLE -

)) , ( 1( log ) , ( log max arg MLE

θ x θ x

θ x θ x

θ x θ x

θ

θ

Logistic regression

) exp( 1

1) (

x θx θT

Tp

18

0.2 0.4 0.6 0.8 1

-1

-0.5

0.5

1

Misclassification model

2/ ) 1( 2/ ) 1( 2/ ) 1( 2/ ) 1(( ) ( 1( ) ( 1( ( 1( ) , | (* * * *

y y y yT T T T Tp p p p y f x θ x θ x θ x θ x θ

),|(| *~ iiiiTyfy xθx

0 )) ( ) ( ) (

}) ( )) ( 1( log )) ( 1( )) ( (log { solve arg ˆ

* *

* *

1

iT

iT

iT

iT

iT

iT

iT

n

i

p p p

p p p p

x θ x θ x θ

x θ x θ x θ x θ θε

0 as ˆ* ε θ θε

i j j j j j

n

i

Ti i i

j

p p p p p

p J jj

x

x x θ θ

) } ) 1( log ) 1() (log {

) ( ˆ ) , ˆ ( IF

1

1 0

z zas 0 )

MLE

MLE

|| ) , ˆ ( IF ||j θ

19

-1.5 -1 -0.5 0.5 1 1.5 2

-4

-2

2

4

6

8

Group II

Group II from

Group I = from

2

2

5.00

05.0,

0

0N

2

2

6.00

03.0,

1

5.1N},,{ 1801 xx

Logistic Discrimination

},,{ 2001 yy

)92.116.3sgn(

mislabel withlogisitic

)23.153.2sgn(

mislabel withLogisitic

xy

xy

　

　

Mislabel

Group I

5

Group II

35

)19.219.3sgn(

classifier Logisitic

xy　

Group I

20

-1.5 -1 -0.5 0.5 1 1.5 2

-4

-2

2

4

50 100 150 200

0.2

0.4

0.6

0.8

1

25 50 75 100 125 150 175

0.2

0.4

0.6

0.8

1

ΜL-

ΜL

dataproper

withΜL

Misclassification

Group I

5 data

Group II

35 data

21

Poisson regression

),1,0(!

),(}),(exp{),|( y

yyf

yθxθxθx

-likelihood function

0 1

*

1

)),|(log1

)),|(log1

)(y

i

n

ii

n

ii yf

nyf

nL θxθxθ

Canonical link xθθx T),(

iiiii

n

i

iii

n

ii

f

w

xθxθx

xxθxθxθ

)},){,|(log

),),),ˆ(IF

1

1T

1

)(

)(),|(1(),,|( yyfyf θxθx-contamination model

||,ˆ(IF||0)lim θzzz

22

),,,,,,,(

exp1exp1),,(

),,(

212221121121

2222121

2

1212111

121

21

wwwwuu

xwxw

u

xwxw

uxxf

xxfy

θ

θ

θ

Neural network

-10-5

0

5

10 -10

-5

0

5

10

-2-1.5-1

-0.50

-10-5

0

5

10

)1,1,1,1,1,1,1,1(0 θ

1exp1

1

1exp1

1

),,(

2121

021

xxxx

xxf θ

23

-10-5

05

10-10

-5

0

5

10

-4

-2

0

-10-5

05

10

-10-5

05

10-10

-5

0

5

10

-2

-1

0

-10-5

05

10

-10-5

05

10-10

-5

0

5

10

0

1

2

-10-5

05

10

-10-5

05

10-10

-5

0

5

10

0

2

4

-10-5

05

10

Input

30

03,

0

0~2

1 Nx

x

Output

)25.0,0(

obs)50(4)2,,(outlier

obs)200(1exp1

1

1exp1

1obs

2

011

2121

N

xxfy

xxxxy

θ

24

-10-5

0

5

10 -10

-5

0

5

10

-2-1.5-1

-0.50

-10-5

0

5

10

-10-5

0

5

10 -10

-5

0

5

10

-2-1.5-1

-0.50

-10-5

0

5

10

-10-5

0

5

10 -10

-5

0

5

10

-1.5-1

-0.50

-10-5

0

5

10

-maximum likelihood

Maximum likelihood

8.0

25

Classical procedure for PCA

Self-organizing procedure

n

i

Tiik

n

ii

pn

ii

nn

z

kpOz

11

22

1

)()(1

,)(rsEigenvectoˆ,1

ˆ

}{||||),(

),(,:,(minargˆ

xxxxSSxμ

yyy

Rμμx

n

i

Tjijijjik

n

ijjii

n

ijji

j

j

pn

ii kpOz

1

11

1

1

1

)ˆ)(ˆ()ˆ,ˆ(rsEigenvecto

)ˆ,ˆ()ˆ,ˆ(

ˆˆ

Algorithm reweightedy Iterativel

),(,:,(minargˆ

μxμxμ

μxμμ

Rμμx

Let off-line data. pn Rxx ,,1

26

γ

γμx ˆ,ˆ( izix

-20

0

20

40-20

-10

0

10

-20

0

20

-20

-10

0

10

-20

0

20

20 40 60 80 100

-0.2

0.2

0.4

0.6

0.8

1

obsx

out x

,( μxiz

),( out obs xx

),(ˆ )(ˆ out obsclassic obsclassic xxγxγ

),(ˆ )(ˆ out obsorg-self obsclassic xxγxγ

27

-10 -5 5 10

-10

-5

5

10

-6 -4 -2 2 4 6

-6

-4

-2

2

4

6

),0(5.),0(5.)(~,, 211 VNVNgn xxx

1

01

001

0001

8

7

05

003

0001

1

03

005

0007

21

21

VV

VV

Classic procedure

Self-organizing procedure

28

). , , , ( : ) , , ( ), ( Fix

, ) GL( ,

) ( | det | ) , , , ( ~

model tric Semiparame

. . ) ( ) ( ) ( ~

, componets t independen has

such that ) , ( Assume

0 0 0

1

11 1

q W r W r q

q m W

Wq W q W r

q ei y q y q q

m

W W

m

i

m

m

im m

μ x μ x s

R μ

μ x μ x x

y y

y

μ y x μ

Independent Component Analysis (Minami & Eguchi, 2000)

likelihood-

|)det(|1),,(11

),(1

0 )( WcWrn

WLn

ii

μxμ

equationlikelihood-

mTT

iim

ii

n

icWW

WWr

n I

0

)(I

)(),,(

1

0

00

1

xμxh

μxhμx

F

F

29

0

)(E

)(E

)(E

)()(E

),,(E)(),,(E

0),,,(

0),,,(

0),,,(

00),,,(

0),,,(0),,,(

jTjjqWr

Tj

TjjqWr

TiiqWr

Tii

TiiqWr

qWrjT

iqWr

q

q

q

hq

WrWhWr

xw

xwxw

xw

xwxw

μxwxμxμx

μx

μx

μx

μx

μxμx

m

TTm

r cWW

WWrrWA

I

0

)(I

)(),,(E:)(:),(

0

00)(

xμxh

μxhμxxμ x

leaf ancillary -

).()(),,(Then qWAqWr μμ S

.),,1(0)()()(:Let }{ 00 miyhyqyqq iiiiii S F

Theorem (Semiparametric consistency)

(Pf)

0)()()( 00 iiiiii yhyqyq

30

}),,({I),,(),(),ˆ(IF

),),,(1(),,,(ioncontaminat-

01 T

m WWrWJW

WrWr

ξμξhμξμ

ξxμxμx

|| ),(IFsup),...,1()(log)(sup 00 ξ

ξ

Wmiqq ii

methodrobust

, consistent tricallysemiparame is likelihood-

))) 2121 zzzz )exp()thatsuch zz R

-likelihood satisfies the semiparametric consistency

method likelihood-

31

-3 -2 -1 1

-2

-1

1

2

) ( ) ( ) ( 1(2 1 ~V N x f x f i μ x

1 0

0 1,

0

0

otherwise , 0

) 1 0( 1) (

2. 0 , 0

V

xx f

32

-2 2 4

-4

-2

2

4

-4 -2 2 4

-4

-2

2

4

6

-4 -2 2 4

-4

-2

2

4

-3 -2 -1 1

-2

-1

1

2

Usual method self-organizing method

Blue dots

Blue & red dots

2w

2w

2w

2w

1w

1w1w

1w

33

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5alpha = 0.0000

-5 0 5-5

-4

-3

-2

-1

0

1

2

3

4

5alpha = 0.2200

150 the exponential power

22

22

448.0

48.04,

0

0N50

http://www.ai.mit.edu/people/fisher/ica_data/

1w

1w

2w

2w

34

Concluding remark

likelihood -

) ( sup

as 0 , , ( log :

0 , ( outlier an isdef

z z

z z x f z

x f x

c z

t

t t

-10 -8 -6 -4 -2

-2.5

-2

-1.5

-1

-0.5

divergence -

) (z z

divergence - Bias potential function

inference Likelihood

inference likelihood -

?!

-Regression analysis

-Discriminant analysis

-PCA

-ICA

-sufficiency

-factoriziable

-exponential family

-EM algorithm

Documents

Information Geometry of Self-organizing maximum likelihood