Upload
duaa
View
30
Download
0
Embed Size (px)
DESCRIPTION
Bernoulli 2000 Conference at Riken on 27 October, 2000. Information Geometry of Self-organizing maximum likelihood. Shinto Eguchi ISM, GUAS. This talk is based on joint research with Dr Yutaka Kano, Osaka Univ. Consider a statistical model:. - PowerPoint PPT Presentation
Citation preview
1
Information Geometry of
Self-organizing maximum likelihood
Shinto Eguchi ISM, GUAS
This talk is based on joint research with Dr Yutaka Kano, Osaka Univ
Bernoulli 2000 Conference at Riken on 27 October, 2000
2
Consider a statistical model: }:)({ xfM
)( fromdata be),( Let 1 xfxx n
)(logmaxargˆ 1
i
n
i
xfθ
))(logmaxargˆ 1
i
n
i
xfθ
-MLE
Maximum Likelihood Estimation (MLE)
)())(log)exp() 11 xfxfzz
( Fisher, 1922),
)(maxargˆ 1
1 i
n
i
xfθ
Take an increasing function .
Consistency, efficiency sufficiency, unbiasedness invariance, information
3
)(2
)(exp
2
1),(
2
x
x
πxf
n
iii
n
ii
xxf
xfθ
1
11
0)(),(solvearg
),(maxargˆ
)ˆ,(
)ˆ,(
1
1
1ˆ
θixf
iθixf x
n
xi
-MLE
MLE
Normal density
-MLE
given data },,1,{ nixi
),2,1(),(
),(
1
tx
t
t
θixf
iθixf
t
4
-3 -2 -1 1 2 3
0.10.20.30.4
outlier
MLE
-MLE1
Normal density
θ
1θ
1θ
θ
5
.)(e)(* 0
where sdszz
s
,)}(log)(log{)}(log)(log{),( ** ababbabd
z
zzzzbabd z
b
a
)(
d)e(:),( where
log
log
divergence- )d)(),(:),( xxfxgdfgD
d)log)logd)log)log ** fgfgg
).).-(iff0(0),()iff0(0),( eafgfgDababd
6
Examples
(1)
zzzz e)(,)( * )d()(
)(log)(),( y
y
yy
f
ggfgKL
1
e)(,
1e)(
)1(*
βz
βz
zβ
β
zβ
β
)d(1
)()()d(
)()()(),(
11
yyy
yyy
y νβ
fgν
βfg
gfgDββββ
(2)
zηηz
ηηz
η
ηee
eeze
ez
1
log)(,1
1log)( *
)d()(
)(log)d(
)(1
)(1log)(),( y
y
y
y
yy ν
ge
feeyν
ge
fegfgD
η
ηη
η
η
η
(3)
zzzz ηη
ββ
)(lim,)(lim0
-divergence
-divergence
KL-divergence
7
zzz
zzz
e)(,1
e)(,
1e)(
)1(*
zzz
z ez
eeeez
e
ez
1
1)(,
1log)(,
1
1log)( *
-1 -0.5 0.5 1
-1
-0.5
0.5
1
1.5
-1 -0.5 0.5 1
1
2
3
4
-1 -0.5 0.5 1
0.5
1
1.5
2
2.5
-1 -0.5 0.5 1
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
-1 -0.5 0.5 1
0.5
1
1.5
2
2.5
-1 -0.5 0.5 1
0.2
0.4
0.6
0.8
1
9.0
,5.0
,3.0
,005.
9.0
5.0
3.0
001.
8
Pythagorian theorem
),(),(),( stst hgDhfDfgD
)10())( log))( log)1())( log sxhsxfsxhs
)10()( )( )1()(
),(),(),(
txgtxftxg
hfDfgDhgD
t
(1,1)
(1,0)
(0,1)
( t, s ) .
(0,0) f
g
h
sh
tg
9
d))log)(log()(),(),(),( hffghfDfgDhgD
d)}(log)(log){(
d)}(log)(log{d)}(log)(log{
d)}(log)(log{d)}(log)(log{
d)}(log)(log{d)}(log)(log{
**
**
**
hffg
hfhff
fgfgg
hghgg
d)log)log)log)log),( ** fgfgggfgD
0}),(),(),({d))log)(log()(
d))log)(log()(),(),(),(
hfDfgDhgDtshffgts
hffghfDfgDhgD ststst
(Pf)
10
),,( * G
Differential geometry of
θθθθ
θθθ θθ
)log
,()(
*
|*
*2 ffDG
θθθθθθ
θθθ θθ
)log,()(
2
**
*3 | ffD
Riemann metric
Affine connection
Conjugateaffine connection
}Δf,fD,D *** θθθθθθ
θθ :ΘΘ)({over)(:( ΨΨ
}:)({ on xfM),( fgD
(1992) Eguchi cf.)convexis(),d)(
)()(
x
x
xx
g
fgCiszsar’s divergence
θθθθθθ
θθθ θθ
)log,()(
2
*
*3* | ffD
Mon
11
d
1),(
fgfgg
fgDβ
1987)Lauritzen,(cf. tensorskewness-
E)(
metric-
jifji SSfG
)()()( ,*
, kjikjikjiT
kjif SSSf
E)1(
d1
1
1),(
fgfgD
tensorskewness-
E)(
metric-ninformatio
jifji SSG
)()()( ,*
, kjikjikjiT
)()()1()(geodesicmixture
})()()1({,(
geodesic-
/1
xgtxftxf
xgtxftxf
t
t
/1})()()1({,(
geodesic-
xgtxftxf t
-divergence Amari’s -divergence
1,(geodesic-)1( xf t
)10( )10(
kjif SSS
E)21(
12
-likelihood function
)d())((log)(,()(log1
)( *
1
yyx fbbfn
Ln
ii
M-estimation ( Huber, 1964, 1983)
())(log, bxfx
(
)(d)(log
d)}(log)(log{d)}(log)(log{),( **
L
bfg
fgfggfgD
n
ii
fn
L1
)(log1
)( x
(dlogd)log(log),( LfgfggfgKL
Kullback-Leibler and maximum likelihood
13
}),(),({E)(),(1
(1
, θySθyθxSθxθg f
n
iiin
Another definition of -likelihood
T
TT
θ
θg
θ
θg
θg
((
fielda vector asintegrableorfunction,potentialahas(
)(log),(thatsuch θxθx fz
sLC
d()( θgθ
Take a positive function (x, ) and define
),(log),(where θxθ
θxS ii f
-likelihood equation is a weighted score with integrabity.
θ( L
θ g(
θ( L
14
0 ) , (
) d ))} ( log )) ( log {
) d ))} ( log )) ( log {) (
] ) ( [ E ] ) ( [ E
0
* *
0
0
0 0
0 0
θ θ
θ θ
θ θ
θ θ θ
θ
D
x x f x f
x x f x f x f
L Lθf f
) ( density a be Let0 1x x xf iid n
] ) ( [ E max arg0
0θ θθ θ
L f
]) ( [ Ea.s.
) (0
L Lθf
( max arg ˆ s. a. ˆwhereL θ θConsistency of -MLE
15
Influence function
},(,(),{(
(,(),()(ˆ
lim),ˆIF( 1
TSSEJ
bSJGθ
θ
yyy
yyy
θθ θ ()d()(log))(( byGyfGL),)((maxarg:)(ˆ GLG θθθ
Fisher consistency )()(ˆ Fθ )(ˆ FθGF
y FGGF 1(-contamination model of
,(), yy S,( yS
Asymptotic efficiency (,0N)ˆ( VD
θn
((,(Var(( idVJJV ySy
Robustness or Efficiency
16
Generalized linear model
, (
) (exp ) , (y c
a
yy f
) ((
) ( Var, ) ( E2
2
V
a
yy
Regression model
x θ x θ x θ xT T T
h h y( ) or , ( ) | ( E1
n
ii
Tn
ii
Ti
i i
y y fn
y fn
n i y
1
*
1
) d ) , ( log1
) , ( log1
) , , 1 ( ) , ( data given function, likelihood -
x θ x θ
x
iiT
n
iii
Tii
n
i
yyyyy xxθxxθ )(d))((),())((),(11
Estimating equation
Gaussian Inverse
Gamma
Bernoulli1
Poisson
Normal1(
2
μ
μ
μμ(
μ
μV
17
Bernoulli regression
) | 1 Prob( ) , ( ), 1, 1 ( ) , ( 1( , ( ) , | (2/ ) 1( 2/ ) 1 (
x θ x θ x θ x θ x y p y p p y fy y
n
ii i
iy
iy
iy
iy
p p
p p
p p
i i
i i
1
* *
1 1
1 1
) , ( 1 log ) , ( log
)) , ( 1( log ) , ( log max arg MLE -
)) , ( 1( log ) , ( log max arg MLE
θ x θ x
θ x θ x
θ x θ x
θ
θ
Logistic regression
) exp( 1
1) (
x θx θT
Tp
18
0.2 0.4 0.6 0.8 1
-1
-0.5
0.5
1
Misclassification model
2/ ) 1( 2/ ) 1( 2/ ) 1( 2/ ) 1(( ) ( 1( ) ( 1( ( 1( ) , | (* * * *
y y y yT T T T Tp p p p y f x θ x θ x θ x θ x θ
),|(| *~ iiiiTyfy xθx
0 )) ( ) ( ) (
}) ( )) ( 1( log )) ( 1( )) ( (log { solve arg ˆ
* *
* *
1
iT
iT
iT
iT
iT
iT
iT
n
i
p p p
p p p p
x θ x θ x θ
x θ x θ x θ x θ θε
0 as ˆ* ε θ θε
i j j j j j
n
i
Ti i i
j
p p p p p
p J jj
x
x x θ θ
) } ) 1( log ) 1() (log {
) ( ˆ ) , ˆ ( IF
1
1 0
z zas 0 )
MLE
MLE
|| ) , ˆ ( IF ||j θ
19
-1.5 -1 -0.5 0.5 1 1.5 2
-4
-2
2
4
6
8
Group II
Group II from
Group I = from
2
2
5.00
05.0,
0
0N
2
2
6.00
03.0,
1
5.1N},,{ 1801 xx
Logistic Discrimination
},,{ 2001 yy
)92.116.3sgn(
mislabel withlogisitic
)23.153.2sgn(
mislabel withLogisitic
xy
xy
Mislabel
Group I
5
Group II
35
)19.219.3sgn(
classifier Logisitic
xy
Group I
20
-1.5 -1 -0.5 0.5 1 1.5 2
-4
-2
2
4
50 100 150 200
0.2
0.4
0.6
0.8
1
25 50 75 100 125 150 175
0.2
0.4
0.6
0.8
1
ΜL-
ΜL
dataproper
withΜL
Misclassification
Group I
5 data
Group II
35 data
21
Poisson regression
),1,0(!
),(}),(exp{),|( y
yyf
yθxθxθx
-likelihood function
0 1
*
1
)),|(log1
)),|(log1
)(y
i
n
ii
n
ii yf
nyf
nL θxθxθ
Canonical link xθθx T),(
iiiii
n
i
iii
n
ii
f
w
xθxθx
xxθxθxθ
)},){,|(log
),),),ˆ(IF
1
1T
1
)(
)(),|(1(),,|( yyfyf θxθx-contamination model
||,ˆ(IF||0)lim θzzz
22
),,,,,,,(
exp1exp1),,(
),,(
212221121121
2222121
2
1212111
121
21
wwwwuu
xwxw
u
xwxw
uxxf
xxfy
θ
θ
θ
Neural network
-10-5
0
5
10 -10
-5
0
5
10
-2-1.5-1
-0.50
-10-5
0
5
10
)1,1,1,1,1,1,1,1(0 θ
1exp1
1
1exp1
1
),,(
2121
021
xxxx
xxf θ
23
-10-5
05
10-10
-5
0
5
10
-4
-2
0
-10-5
05
10
-10-5
05
10-10
-5
0
5
10
-2
-1
0
-10-5
05
10
-10-5
05
10-10
-5
0
5
10
0
1
2
-10-5
05
10
-10-5
05
10-10
-5
0
5
10
0
2
4
-10-5
05
10
Input
30
03,
0
0~2
1 Nx
x
Output
)25.0,0(
obs)50(4)2,,(outlier
obs)200(1exp1
1
1exp1
1obs
2
011
2121
N
xxfy
xxxxy
θ
24
-10-5
0
5
10 -10
-5
0
5
10
-2-1.5-1
-0.50
-10-5
0
5
10
-10-5
0
5
10 -10
-5
0
5
10
-2-1.5-1
-0.50
-10-5
0
5
10
-10-5
0
5
10 -10
-5
0
5
10
-1.5-1
-0.50
-10-5
0
5
10
-maximum likelihood
Maximum likelihood
8.0
25
Classical procedure for PCA
Self-organizing procedure
n
i
Tiik
n
ii
pn
ii
nn
z
kpOz
11
22
1
)()(1
,)(rsEigenvectoˆ,1
ˆ
}{||||),(
),(,:,(minargˆ
xxxxSSxμ
yyy
Rμμx
n
i
Tjijijjik
n
ijjii
n
ijji
j
j
pn
ii kpOz
1
11
1
1
1
)ˆ)(ˆ()ˆ,ˆ(rsEigenvecto
)ˆ,ˆ()ˆ,ˆ(
ˆˆ
Algorithm reweightedy Iterativel
),(,:,(minargˆ
μxμxμ
μxμμ
Rμμx
Let off-line data. pn Rxx ,,1
26
γ
γμx ˆ,ˆ( izix
-20
0
20
40-20
-10
0
10
-20
0
20
-20
-10
0
10
-20
0
20
20 40 60 80 100
-0.2
0.2
0.4
0.6
0.8
1
obsx
out x
,( μxiz
),( out obs xx
),(ˆ )(ˆ out obsclassic obsclassic xxγxγ
),(ˆ )(ˆ out obsorg-self obsclassic xxγxγ
27
-10 -5 5 10
-10
-5
5
10
-6 -4 -2 2 4 6
-6
-4
-2
2
4
6
),0(5.),0(5.)(~,, 211 VNVNgn xxx
1
01
001
0001
8
7
05
003
0001
1
03
005
0007
21
21
VV
VV
Classic procedure
Self-organizing procedure
28
). , , , ( : ) , , ( ), ( Fix
, ) GL( ,
) ( | det | ) , , , ( ~
model tric Semiparame
. . ) ( ) ( ) ( ~
, componets t independen has
such that ) , ( Assume
0 0 0
1
11 1
q W r W r q
q m W
Wq W q W r
q ei y q y q q
m
W W
m
i
m
m
im m
μ x μ x s
R μ
μ x μ x x
y y
y
μ y x μ
Independent Component Analysis (Minami & Eguchi, 2000)
likelihood-
|)det(|1),,(11
),(1
0 )( WcWrn
WLn
ii
μxμ
equationlikelihood-
mTT
iim
ii
n
icWW
WWr
n I
0
)(I
)(),,(
1
0
00
1
xμxh
μxhμx
F
F
29
0
)(E
)(E
)(E
)()(E
),,(E)(),,(E
0),,,(
0),,,(
0),,,(
00),,,(
0),,,(0),,,(
jTjjqWr
Tj
TjjqWr
TiiqWr
Tii
TiiqWr
qWrjT
iqWr
q
q
q
hq
WrWhWr
xw
xwxw
xw
xwxw
μxwxμxμx
μx
μx
μx
μx
μxμx
m
TTm
r cWW
WWrrWA
I
0
)(I
)(),,(E:)(:),(
0
00)(
xμxh
μxhμxxμ x
leaf ancillary -
).()(),,(Then qWAqWr μμ S
.),,1(0)()()(:Let }{ 00 miyhyqyqq iiiiii S F
Theorem (Semiparametric consistency)
(Pf)
0)()()( 00 iiiiii yhyqyq
30
}),,({I),,(),(),ˆ(IF
),),,(1(),,,(ioncontaminat-
01 T
m WWrWJW
WrWr
ξμξhμξμ
ξxμxμx
|| ),(IFsup),...,1()(log)(sup 00 ξ
ξ
Wmiqq ii
methodrobust
, consistent tricallysemiparame is likelihood-
))) 2121 zzzz )exp()thatsuch zz R
-likelihood satisfies the semiparametric consistency
method likelihood-
31
-3 -2 -1 1
-2
-1
1
2
) ( ) ( ) ( 1(2 1 ~V N x f x f i μ x
1 0
0 1,
0
0
otherwise , 0
) 1 0( 1) (
2. 0 , 0
V
xx f
32
-2 2 4
-4
-2
2
4
-4 -2 2 4
-4
-2
2
4
6
-4 -2 2 4
-4
-2
2
4
-3 -2 -1 1
-2
-1
1
2
Usual method self-organizing method
Blue dots
Blue & red dots
2w
2w
2w
2w
1w
1w1w
1w
33
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5alpha = 0.0000
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5alpha = 0.2200
150 the exponential power
22
22
448.0
48.04,
0
0N50
http://www.ai.mit.edu/people/fisher/ica_data/
1w
1w
2w
2w
34
Concluding remark
likelihood -
) ( sup
as 0 , , ( log :
0 , ( outlier an isdef
z z
z z x f z
x f x
c z
t
t t
-10 -8 -6 -4 -2
-2.5
-2
-1.5
-1
-0.5
divergence -
) (z z
divergence - Bias potential function
inference Likelihood
inference likelihood -
?!
-Regression analysis
-Discriminant analysis
-PCA
-ICA
-sufficiency
-factoriziable
-exponential family
-EM algorithm