Upload
joanna-garrett
View
227
Download
1
Tags:
Embed Size (px)
Citation preview
Information Theoretic Signal Processing and Machine Learning
Jose C. Principe
Computational NeuroEngineering Laboratory
Electrical and Computer Engineering Department
University of Florida
www.cnel.ufl.edu
Acknowledgments
Dr. John FisherDr. Dongxin XuDr. Ken HildDr. Deniz Erdogmus My students: Puskal Pokharel
Weifeng Liu Jianwu Xu
Kyu-Hwa JeongSudhir RaoSeungju Han
NSF ECS – 0300340 and 0601271 (Neuroengineering program) and DARPA
Outline
• Information Theoretic Learning• A RKHS for ITL• Correntropy as Generalized Correlation• Applications
Matched filtering Wiener filtering and Regression Nonlinear PCA
Information Filtering: From Data to Information
Information Filters: Given data pairs {xi,di}
Optimal Adaptive systemsInformation measures
Embed information in the weights of the adaptive systemMore formally, use optimization to perform Bayesian estimation
Information
Measure
Adaptive
System
Data x
Information
Output
y
Data d
Error d-y=e
f(x,w)
Information Theoretic Learning (ITL)- 2010
Tutorial
IEEESP MAGAZINE, Nov 2006
Or ITL resource
www.cnel.ufl.edu
What is Information Theoretic Learning?
ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence.
Center piece is a non-parametric estimator for entropy that:Does not require an explicit estimation of pdfUses the Parzen window method which is known to be consistent and efficientEstimator is smoothReadily integrated in conventional gradient descent learningProvides a link between information theory and Kernel learning.
ITL is a different way of thinking about data quantification
Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents.
ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces.
Variance by EntropyCorrelation by CorrentopyMean square error (MSE) by Minimum error entropy (MEE)Distances in data space by distances in probability spacesFully exploits the strucutre of RKHS.
April 21, 2023 8
Information Theoretic LearningEntropy
Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as
Not all random variables (r.v.) are equally random!
5-5 00
0.5
1
f x(x)
-5 0 50
0.2
0.4
xf x(x
)
dxxfxfXHxpxpXH XXSXXS ))(log()()()(log)()(
April 21, 2023 9
0
1
1
p p1 p2 =
p1
p2
11
1
p2
p1
p30
p p1 p2 p3 =
pk
k 1=
n
p (entropy -norm)=
( norm of – p raised power to )
Information Theoretic LearningRenyi’s Entropy
Norm of the pdf:
dxxfXHxpXH XX
)(log
1
1)()(log
1
1)(
Renyi’s entropy equals Shannon’s as 1
dxxfV )(
V=-Information Potential
Vnorm
April 21, 2023 10
Information Theoretic LearningNorm of the PDF (Information Potential)
V2(x), 2- norm of the pdf (Information Potential) is one of the central concept in ITL.
E[p(x)]and
estimators
Information
Theory
(Renyi’s entropy)
Cost functions for
adaptive systems
Reproducing Kernel Hilbert spaces
Information Theoretic LearningParzen windowing
Given only samples drawn from a distribution:
Convergence:
-5 0 50
0.5
1N=10
xf x(x
)
-5 0 50
0.2
0.4
x
f x(x)
-5 0 50
0.5
1N = 1000
x
f x(x)
-5 0 50
0.2
0.4
x
f x(x)
N=10 N = 1000
N
ii
N
xxGN
xp
xpxx
1
1
)(1
)(ˆ
)(~},...,{
Kernel function
)(
)(*)()(ˆlim )(
NNhatprovided t
xGxpxp NN
Information Theoretic Learning Information Potential
Order-2 entropy & Gaussian kernels:
)(ˆ xp
j iij
j iij
N
ii
xxGN
dxxxGxxGN
dxxxGN
dxxpXH
)(1
log
)()(1
log
)(1
log)(log)(
22
2
2
1
22
Pairwise interactions
between samples
O(N2)
Information potential estimator,
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
provides a potential field over the space of the samples
parameterized by the kernel size
)(2̂ XV
Information Theoretic LearningCentral Moments Estimators
Mean:
Variance
2-norm of PDF (Information Potential)
Renyi’s Quadratic Entropy
N
iix
NXE
1
1][
2
11
2 )1
(1
]])[[(
N
ii
N
ii x
Nx
NXEXE
scalexxGN
xfEN
i
N
jji
1 1
22)(
1)]([
N
i
N
jji xxG
NxfE
1 122
))(1
log()]([log
Information Theoretic Learning Information Force
In adaptation, samples become information particles that interact through information forces.
Information potential field:
Information force:
i
ijj xxGN
xV )(1
)(ˆ22
iij
j
xxGNx
V)(
1ˆ2
2
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Erdogmus, Principe, Hild, Natural Computing, 2002.
xj
xi
April 21, 2023 15
Information Theoretic Learning Error Entropy Criterion (EEC)
We will use iterative algorithms for optimization of a linear system with steepest descent
Given a batch of N samples the IP is
For an FIR the gradient becomes
N
i
N
jji eeG
NEV
1 1222 )(
1)(ˆ
])()(
))()())(()((2
))()((
))()((
))(1
())((ˆ
)(ˆ
11
1 122
2
kk
N
j
N
i
k
N
i
N
jji
kk
w
iny
w
jnyjneinejneineG
N
w
jneine
jneine
eeGN
w
neVnV
))()(())()())(()((2
)(ˆ11
22 inxjnxjneinejneineGN
nV kk
N
j
N
ik
)()()1( nnn 2Vww
Information Theoretic Learning Backpropagation of Information Forces
Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.
Information
Forces
Adaptive
System
Adjoint
Network
Input
Desired
IT
CriterionOutput
Weight
Updates
ij
pk
p
N
n pij w
ne
ne
J
w
J
)(
)(1 1
Information Theoretic Learning Any any kernel
Recall Renyi’s entropy:
Parzen windowing:
Approximate EX[.] by empirical mean
Nonparametric -entropy estimator:
)]([log1
1
)(log1
1)(
1 XpE
dxxpXH
X
N
iixxK
Nxp
1
)(1
)(ˆ
N
j
N
iij xxK
NXH
1
1
1
)(1
log1
1)(ˆ
Erdogmus, Principe, IEEE Transactions on Neural Networks, 2002.
Pairwise interactions
between samples
Information Theoretic Learning Quadratic divergence measures
Kulback-Liebler Divergence:
Renyi’s Divergence:
Euclidean Distance:
Cauchy- Schwartz Divergence :
Mutual Information is a special case (distance between the joint and the product of marginals)
dxxq
xpxpYXDKL )(
)(log)();(
dxxq
xpxpYXD
1
)(
)()(log
1
1);(
dxxqxpYXDE2)()();(
))()(
)()(log();(
22
dxxqdxxp
dxxqxpYXDC
Information Theoretic Learning How to estimate Euclidean Distance
Euclidean Distance:
is called the cross information potential (CIP)
So DED can be readily computed with the information potential . Likewise for the Cauchy Schwartz divergence, and also the quadratic mutual information
dxxqxpqpDE2)()();(
q qp qp p N
a
N
bba
N
i
N
aai
qp
N
i
N
jji
pp
E
xxGNN
xxGNN
xxGNN
dxxqdxxqxpdxxpqpD
1 12
1 12
1 12
22
)(1
)(2
)(1
)()()(2)();(
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
dxxqxp )()(
Learning System
Y q X W =Input Signal Output Signal
X Y
Desired Signal D
OptimizationInformation Measure
I Y D
Information Theoretic Learning Unifies supervised and unsupervised learning
1
2 3
Switch 1 Switch 2 Switch 3Filtering/classification InfoMax ICA
Feature extraction Clustering
(also with entropy) NLPCA
/H(Y)
ITL - Applications
ITL
System identification Feature extraction
Blind source separation Clustering
www.cnel.ufl.edu ITL has examples and Matlab code
ITL – ApplicationsNonlinear system identification
Minimize information content of the residual error
Equivalently provides the best density matching between the output and the desired signals.
Adaptive
System
Input, x
Desired, d
Minimum
EntropyOutput, y Error, e
Erdogmus, Principe, IEEE Trans. Signal Processing, 2002. (IEEE SP 2003 Young Author Award)
dd
p
ppdp
xd
xyxye
1
),(
);,();,(min);(log
1
1min
www
ww
ITL – Applications Time-series prediction
Chaotic Mackey-Glass (MG-30) seriesCompare 2 criteria:
Minimum squared-errorMinimum error entropy
System: TDNN (6+4+1)
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.40
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Signal Value
Pro
babi
lity
Den
sity
DataMEEMSE
-0.025 -0.015 -0.005 0 0.005 0.015 0.0250
20
40
60
80
100
Error Value
Pro
babi
lity
Den
sity
MEEMSE
ITL - Applications
ITL
System identification Feature extraction
Blind source separation Clustering
ITL – Applications Optimal feature extraction
Data processing inequality: Mutual information is monotonically non-increasing.Classification error inequality: Error probability is bounded from below and above by the mutual information.
PhD on feature extraction for sonar target recognition (2002)
x
Backpropagate Information Forces
Maximize
Mutual Information
z
Class labels (d)
z Classifiery
Compare
d
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
ITL – Applications Extract 2 nonlinear features
64x64 SAR images of 3 vehicles: BMP2, BTR70, T72
P(Correct)
MI+LR 94.89%
SVM 94.60%
Templates 90.40%
Zhao, Xu and Principe, SPIE Automatic Target Recognition, 1999.
Hild, Erdogmus, Principe, IJCNN Tutorial on ITL, 2003.
Classification resultsInformation forces in training
ITL - Applications
ITL
System identificationFeature extraction
Blind source separation Clustering
ITL – Applications Independent component analysis
Observations are generated by an unknown mixture of statistically independent unknown sources.
MinimizeMutual InformationH W
s x y z
Sphering
Uncorrelatedsignals
Mixedsignals
Independentsignals
R
Independentsources
kk Hsx )()()(
1
zz HzHI
n
c
c
Ken Hild II, PhD on blind source separation (2003)
ITL – Applications On-line separation of mixed sounds
Observed mixtures and separated outputs
X1: X2: X3:
Z1: Z2: Z3:
0 100 200 300 400 500 600 700-5
0
5
10
15
20
25
30
SIR
(dB
)Time (thousands of data samples)
Infomax
FastICA
Comon
MeRMaId-SIG
On-line separation, 3x3 mixture
102
103
104
-5
0
5
10
15
20
25
Data Length (samples)
SIR
(dB
)
MeRMaId
Comon
FastICA
Infomax
Off-line separation, 10x10 mixture
Hild, Erdogmus, Principe, IEEE Signal Processing Letters, 2001.
ITL - Applications
ITL
System identificationFeature extraction
Blind source separation Clustering
ITL – Applications Information theoretic clustering
Select clusters based on entropy and divergence:Minimize within cluster entropy
Maximize between cluster divergence
Robert Jenssen PhD on information theoretic clustering
Jenssen, Erdogmus, Hild, Principe, Eltoft, IEEE Trans. Pattern Analysis and Machine Intelligence, 2005. (submitted)
2/122 )()(
)()(min
dxxqdxxp
dxxqxp
m
Between cluster divergence
Within cluster entropyMembership vector
Reproducing Kernel Hilbert Spaces as a Tool for Nonlinear System Analysis
Fundamentals of Kernel Methods
Kernel methods are a very important class of algorithms for nonlinear optimal signal processing and machine learning. Effectively they are shallow (one layer) neural networks (RBFs) for the Gaussian kernel.
They exploit the linear structure of Reproducing Kernel Hilbert Spaces (RKHS) with very efficient computation. ANY (!) SP algorithm expressed in terms of inner products has in principle an equivalent representation in a RKHS, and may correspond to a nonlinear operation in the input space. Solutions may be analytic instead of adaptive, when the linear structure is used.
Fundamentals of Kernel MethodsDefinition
An Hilbert space is a space of functions f(.)
Given a continuous, symmetric, positive-definite kernel
, a mapping Φ, and an inner product
A RKHS H is the closure of the span of all Φ(u).
Reproducing
Kernel trick
The induced norm
:U U R , H
, ( ) ( )Hf u f u
1 2 1 2( ), ( ) ( , )Hu u u u
2|| || ,H Hf f f
Fundamentals of Kernel MethodsRKHS induced by the Gaussian kernel
The Gaussian kernel is symmetric and positive definite
thus induces a RKHS on a sample set {x1, …xN} of reals, denoted as RKHSK.
Further, by Mercer’s theorem, a kernel mapping Φ can be constructed which transforms data from the input space to RKHSK where:
where <,> denotes inner product in RKHSK.
2
2
1 ( ')( , ') exp( ).
22
x xk x x
Kii xxxxk )(),()(
Wiley Book (2010)
The RKHS defined by the Gaussian can be used for Nonlinear Signal Processingand regression extending adaptive filters (O(N))
A RKHS for ITLRKHS induced by cross information potential
Let E be the set of all square integrable one dimensional probability density functions, i.e., where and I is an index set. Then form a linear manifold (similar to the simplex)
Close the set and define a proper inner product
L2(E) is an Hilbert space but it is not reproducing. However, let us define the bivariate function on L2(E) ( cross information potential (CIP))
One can show that the CIP is a positive definite function and so it defines a RKHSV . Moreover there is a congruence between L2(E) and HV.
dxxfi )(2IiExfi ,)(
iii xf )(
dxxfxfxfxf jiLji )()()(),(2
dxxfxfffV jiji )()(),(
A RKHS for ITLITL cost functions in RKHSV
Cross Information Potential (is the natural distance in HV)
Information Potential (is the norm (mean) square in HV)
Second order statistics in HV become higher order statistics of data (e.g. MSE in HV includes HOS of data).
Members of HV are deterministic quantities even when x is r.v.
Euclidean distance and QMI
Cauchy-Schwarz divergence and QMI
VHgVfVdxxgxf ,.)(,.),()()(
2,.)(,.)(,.),()()( fVfVfVdxxfxf
VH
2,.)(,.)(),( gVfVgfDED
)log(cos,.)(.,.)(
,.)(,.),(log),(
gVfV
gVfVgfD VH
2
212,1212,1 ,.)(,.)(),( ffVfVfffQMIED
,.)(.,.)(
,.)(,.),(log),(
212,1
212,1
212,1ffVfV
ffVfVfffQMI VH
CS
A RKHS for ITLRelation between ITL and Kernel Methods thru HV
There is a very tight relationship between HV and HK: By ensemble averaging of HK we get estimators for the HV statistical quantities. Therefore statistics in kernel space can be computed by ITL operators.
Correntropy:A new generalized similarity measure
Correlation is one of the most widely used functions in signal processing.
But, correlation only quantifies similarity fully if the random variables are Gaussian distributed.
Can we define a new function that measures similarity but it is not restricted to second order statistics?
Use the kernel framework
Correntropy:A new generalized similarity measure
Define correntropy of a stationary random process {xt} as
The name correntropy comes from the fact that the average over the lags (or the dimensions) is the information potential (the argument of Renyi’s entropy)
For strictly stationary and ergodic r. p.
Correntropy can also be defined for pairs of random variables
dxxpxxxxEstV ststx )()())((),(
N
nmnnm xx
NV
1
)(1ˆ
Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006.
Correntropy:A new generalized similarity measure
How does it look like? The sinewave
Correntropy:A new generalized similarity measure
Properties of Correntropy:
It has a maximum at the origin ( )
It is a symmetric positive function
Its mean value is the information potential
Correntropy includes higher order moments of data
The matrix whose elements are the correntopy at different lags is Toeplitz
2/1
0
2
2 !2
)1(),(
n
n
tsnn
n
xxEn
tsV
Correntropy:A new generalized similarity measure
Correntropy as a cost function versus MSE. 2
2
,
2
( , ) [( ) ]
( ) ( , )
( )
XY
x y
E
e
MSE X Y E X Y
x y f x y dxdy
e f e de
,
( , ) [ ( )]
( ) ( , )
( ) ( )
XY
x y
E
e
V X Y E k X Y
k x y f x y dxdy
k e f e de
Correntropy:A new generalized similarity measure
Correntropy induces a metric (CIM) in the sample space defined by
Therefore correntropy can
be used as an alternative
similarity criterion in the
space of samples.
1/ 2( , ) ( (0,0) ( , ))CIM X Y V V X Y
0.05
0.1
0.1
0.15
0.15
0.2
0.2
0.2
0.25
0.25
0.250.25
0.3
0.3
0.3
0.3
0.3
0.35
0.35
0.35
0.35
0.35
0.35
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.45
0.45
0.45
0.45
0.45
0.45
0.45
0.45
0.5
0.5
0.5
0.5
0.55
0.55
0.55
0.55
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, IEEE Trans. Sig. Proc, vol 55; # 11, pages 5286-5298
Correntropy:A new generalized similarity measure
Correntropy criterion implements M estimation of robust statistics. M estimation is a generalized maximum likelihood method.
In adaptation the weighted square problem is defined as
When
this leads to maximizing the correntropy of the error at the origin.
( ) '( ) /w e e e2
1
lim ( )N
i ii
w e e
2 2( ) (1 exp( / 2 )) / 2e e
N
iMi
N
i
xx11
0)ˆ,(),(minarg
2 2
1 1
2 2
1 1
min ( ) min (1 exp( / 2 )) / 2
max exp( / 2 ) / 2 max ( )
N N
i ii i
N N
i ii i
e e
e e
Correntropy:A new generalized similarity measure
Nonlinear regression with outliers (Middleton model)
Polynomial approximator
Correntropy:A new generalized similarity measure
Define centered correntropy
Define correntropy coefficient
Define parametric correntropy with
Define parametric centered correntropy
Define Parametric Correntropy Coefficient
0, aRba
)}()(),(){()]([)]([),( ,, ydFxdFyxdFyxYXEEYXEYXU YXYXYXYX
N
iji
N
j
N
iii yx
Nyx
NyxU
1 12
1
)(1
)(1
),(ˆ
),(),(
),(),(
YYUXXU
YXUYX
),(ˆ),(ˆ
),(ˆ),(ˆ
yyUxxU
yxUyx
),()()]([),( ,,, yxdFybaxYbaXEYXV YXYXba
Rao M., Xu J., Seth S., Chen Y., Tagare M., Principe J., “Correntropy Dependence Measure”, submitted to Metrika.
)]([)]([),( ,, YbaXEEYbaXEYXU YXYXba
),(),(, YbaXYXba
Correntropy:Correntropy Dependence Measure
Theorem: Given two random variables X and Y the parametric centered correntropy Ua,b(X, Y ) = 0 for all a, b in R if and only if X and Y are independent.
Theorem: Given two random variables X and Y the parametric correntropy coefficient a,b(X, Y ) = 1 for certain a = a0 and
b = b0 if and only if Y = a0X + b0.
Definition: Given two r.v. X and Y
Correntropy Dependence Measure
is defined as
0,
),(sup),( ,
aRba
YXYX ba
Applications of Correntropy Correntropy based correlograms
lags
Correntropy can be used in computational auditory scene analysis
(CASA), providing much better frequency resolution.
Figures show the correlogram from a 30 channel cochlea model for one
(pitch=100Hz)).
Auto-correlation Function Auto-correntropy Function
Applications of Correntropy Correntropy based correlograms
ROC for noiseless (L) and noisy (R) double vowel discrimiation
Applications of CorrentropyMatched Filtering
Matched filter computes the inner product between the received signal r(n) and the template s(n) (Rsr(0)).
The Correntropy MF computes
(Patent pending)
N
iiirs srk
NV
1
)(1
)0(
Hypothesis received signal Similarity value
H0 kk nr 2 2( ) / 2
0 21
1
2i i
Ns n
i
V eN
H1 kkk nsr 2 2/ 2
1 21
1
2i
Nn
i
V eN
Applications of CorrentropyMatched Filtering
Linear Channels
White Gaussian noise Impulsive noise
Template: binary sequence of length 20. kernel size using Silverman’s rule.
Applications of CorrentropyMatched FilteringAlpha stable noise (=1.1), and the effect of kernel size
Template: binary sequence of length 20. kernel size using Silverman’s rule.
Conclusions
Information Theoretic Learning took us beyond Gaussian statistics and MSE as cost functions.
ITL generalizes many of the statistical concepts we take for granted.
Kernel methods implement shallow neural networks (RBFs) and extend easily the linear algorithms we all know.
KLMS is a simple algorithm for on-line learning of nonlinear systems
Correntropy defines a new RKHS that seems to be very appropriate for nonlinear system identification and robust control
Correntropy may take us out of the local minimum of the (adaptive) design of optimum linear systems
For more information go to the website www.cnel.ufl.edu ITL resource for tutorial, demos and downloadable MATLAB code
Applications of Correntropy Nonlinear temporal PCA
The Karhunen Loeve transform performs Principal Component Analysis (PCA) of the autocorrelation of the r. p.
(0) (1) ( 1)
(1)
(0) (1)
( 1) (1) (0)
T
L L
R XX
r r r L
rN
r r
r L r r
(0) (1) ( 1)
(1)
(0) (1)
( 1) (1) (0)
T
N N
K X X
r r r N
rL
r r
r N r r
T T TR XX UDD U T T TK X X VD DV
, 1, 2,...,T Ti i iU X V i L
1 2{ , ,... }L D is LxN diagonal
KL can be also done by decomposing the Gram matrix K directly.
LxNLNxLx
Nxx
X
)1(...)(
.........
)(...)1(
Applications of Correntropy Nonlinear KL transform
Since the autocorrelation function of the projected data in RKHS is given by correntropy, we can directly construct K with correntropy.
Example: where ( ) sin(2 ) ( )x m A fm z m ( ) 0.8 (0,0.1) 0.1 (4,0.1) 0.1 ( 4,0.1)Np n N N N
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.5
1FFT of clean signal and noisy observation
Observation
clean signal
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2FFT of first PC using NN Auto-correlation Matrix
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.5
1FFT of 2nd PC using NN Correntropy Matrix
A VPCA (2nd PC)
PCA by
N-by-N (N=256)
PCA by L-by-L
(L=4)
PCA by L-by-L (L=100)
0.2 100% 15% 3% 8%
0.25 100% 27% 6% 17%
0.5 100% 99% 47% 90%
1,000 Monte Carlo runs. =1
Applications of Correntropy Correntropy Spectral Density
Average normalized amplitude
frequency
Kernel
size
CSD is a function of the kernel size, and shows the difference
between PSD ( large) and the new spectral measure
Time Bandwidth product
Kernel size