Information Theoretic Signal Processing and Machine Learning Jose C. Principe Computational NeuroEngineering Laboratory Electrical and Computer Engineering

Information Theoretic Signal Processing and Machine Learning

Jose C. Principe

Computational NeuroEngineering Laboratory

Electrical and Computer Engineering Department

University of Florida

www.cnel.ufl.edu

[email protected]

http://www.cnel.ufl.edu/

Acknowledgments

Dr. John FisherDr. Dongxin XuDr. Ken HildDr. Deniz Erdogmus My students: Puskal Pokharel

Weifeng Liu Jianwu Xu

Kyu-Hwa JeongSudhir RaoSeungju Han

NSF ECS – 0300340 and 0601271 (Neuroengineering program) and DARPA

Outline

• Information Theoretic Learning• A RKHS for ITL• Correntropy as Generalized Correlation• Applications

Matched filtering Wiener filtering and Regression Nonlinear PCA

Information Filtering: From Data to Information

Information Filters: Given data pairs {xi,di}

Optimal Adaptive systemsInformation measures

Embed information in the weights of the adaptive systemMore formally, use optimization to perform Bayesian estimation

Information

Measure

Adaptive

System

Data x

Information

Output

y

Data d

Error d-y=e

f(x,w)

Information Theoretic Learning (ITL)- 2010

Tutorial

IEEESP MAGAZINE, Nov 2006

Or ITL resource

www.cnel.ufl.edu

What is Information Theoretic Learning?

ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence.

Center piece is a non-parametric estimator for entropy that:Does not require an explicit estimation of pdfUses the Parzen window method which is known to be consistent and efficientEstimator is smoothReadily integrated in conventional gradient descent learningProvides a link between information theory and Kernel learning.

ITL is a different way of thinking about data quantification

Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents.

ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces.

Variance by EntropyCorrelation by CorrentopyMean square error (MSE) by Minimum error entropy (MEE)Distances in data space by distances in probability spacesFully exploits the strucutre of RKHS.

April 21, 2023 8

Information Theoretic LearningEntropy

Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as

Not all random variables (r.v.) are equally random!

5-5 00

0.5

1

f x(x)

-5 0 50

0.2

0.4

xf x(x

)

dxxfxfXHxpxpXH XXSXXS ))(log()()()(log)()(

April 21, 2023 9

0

1

1

p p1 p2 =

p1

p2

11

1

p2

p1

p30

p p1 p2 p3 =

pk

k 1=

n

p (entropy -norm)=

( norm of – p raised power to )

Information Theoretic LearningRenyi’s Entropy

Norm of the pdf:

dxxfXHxpXH XX

)(log

1

1)()(log

1

1)(

Renyi’s entropy equals Shannon’s as 1

dxxfV )(

V=-Information Potential

Vnorm

April 21, 2023 10

Information Theoretic LearningNorm of the PDF (Information Potential)

V2(x), 2- norm of the pdf (Information Potential) is one of the central concept in ITL.

E[p(x)]and

estimators

Information

Theory

(Renyi’s entropy)

Cost functions for

adaptive systems

Reproducing Kernel Hilbert spaces

Information Theoretic LearningParzen windowing

Given only samples drawn from a distribution:

Convergence:

-5 0 50

0.5

1N=10

xf x(x

)

-5 0 50

0.2

0.4

x

f x(x)

-5 0 50

0.5

1N = 1000

x

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

N=10 N = 1000

N

ii

N

xxGN

xp

xpxx

1

1

)(1

)(ˆ

)(~},...,{

Kernel function

)(

)(*)()(ˆlim )(

NNhatprovided t

xGxpxp NN

Information Theoretic Learning Information Potential

Order-2 entropy & Gaussian kernels:

)(ˆ xp

j iij

j iij

N

ii

xxGN

dxxxGxxGN

dxxxGN

dxxpXH

)(1

log

)()(1

log

)(1

log)(log)(

22

2

2

1

22

Pairwise interactions

between samples

O(N2)

Information potential estimator,

Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

provides a potential field over the space of the samples

parameterized by the kernel size

)(2̂ XV

Information Theoretic LearningCentral Moments Estimators

Mean:

Variance

2-norm of PDF (Information Potential)

Renyi’s Quadratic Entropy

N

iix

NXE

1

1][

2

11

2 )1

(1

]])[[(

N

ii

N

ii x

Nx

NXEXE

scalexxGN

xfEN

i

N

jji

1 1

22)(

1)]([

N

i

N

jji xxG

NxfE

1 122

))(1

log()]([log

Information Theoretic Learning Information Force

In adaptation, samples become information particles that interact through information forces.

Information potential field:

Information force:

i

ijj xxGN

xV )(1

)(ˆ22

iij

j

xxGNx

V)(

1ˆ2

2


Erdogmus, Principe, Hild, Natural Computing, 2002.

xj

xi

April 21, 2023 15

Information Theoretic Learning Error Entropy Criterion (EEC)

We will use iterative algorithms for optimization of a linear system with steepest descent

Given a batch of N samples the IP is

For an FIR the gradient becomes

N

i

N

jji eeG

NEV

1 1222 )(

1)(ˆ

])()(

))()())(()((2

))()((

))()((

))(1

())((ˆ

)(ˆ

11

1 122

2

kk

N

j

N

i

k

N

i

N

jji

kk

w

iny

w

jnyjneinejneineG

N

w

jneine

jneine

eeGN

w

neVnV

))()(())()())(()((2

)(ˆ11

22 inxjnxjneinejneineGN

nV kk

N

j

N

ik

)()()1( nnn 2Vww

Information Theoretic Learning Backpropagation of Information Forces

Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.

Information

Forces

Adaptive

System

Adjoint

Network

Input

Desired

IT

CriterionOutput

Weight

Updates

ij

pk

p

N

n pij w

ne

ne

J

w

J

)(

)(1 1

Information Theoretic Learning Any any kernel

Recall Renyi’s entropy:

Parzen windowing:

Approximate EX[.] by empirical mean

Nonparametric -entropy estimator:

)]([log1

1

)(log1

1)(

1 XpE

dxxpXH

X

N

iixxK

Nxp

1

)(1

)(ˆ

N

j

N

iij xxK

NXH

1

1

1

)(1

log1

1)(ˆ

Erdogmus, Principe, IEEE Transactions on Neural Networks, 2002.

Pairwise interactions

between samples

Information Theoretic Learning Quadratic divergence measures

Kulback-Liebler Divergence:

Renyi’s Divergence:

Euclidean Distance:

Cauchy- Schwartz Divergence :

Mutual Information is a special case (distance between the joint and the product of marginals)

dxxq

xpxpYXDKL )(

)(log)();(

dxxq

xpxpYXD

1

)(

)()(log

1

1);(

dxxqxpYXDE2)()();(

))()(

)()(log();(

22

dxxqdxxp

dxxqxpYXDC

Information Theoretic Learning How to estimate Euclidean Distance

Euclidean Distance:

is called the cross information potential (CIP)

So DED can be readily computed with the information potential . Likewise for the Cauchy Schwartz divergence, and also the quadratic mutual information

dxxqxpqpDE2)()();(

q qp qp p N

a

N

bba

qq

N

i

N

aai

qp

N

i

N

jji

pp

E

xxGNN

xxGNN

xxGNN

dxxqdxxqxpdxxpqpD

1 12

1 12

1 12

22

)(1

)(2

)(1

)()()(2)();(


dxxqxp )()(

Learning System

Y q X W =Input Signal Output Signal

X Y

Desired Signal D

OptimizationInformation Measure

I Y D

Information Theoretic Learning Unifies supervised and unsupervised learning

1

2 3

Switch 1 Switch 2 Switch 3Filtering/classification InfoMax ICA

Feature extraction Clustering

(also with entropy) NLPCA

/H(Y)

ITL - Applications

ITL

System identification Feature extraction

Blind source separation Clustering

www.cnel.ufl.edu ITL has examples and Matlab code


ITL – ApplicationsNonlinear system identification

Minimize information content of the residual error

Equivalently provides the best density matching between the output and the desired signals.

Adaptive

System

Input, x

Desired, d

Minimum

EntropyOutput, y Error, e

Erdogmus, Principe, IEEE Trans. Signal Processing, 2002. (IEEE SP 2003 Young Author Award)

dd

p

ppdp

xd

xyxye

1

),(

);,();,(min);(log

1

1min

www

ww

ITL – Applications Time-series prediction

Chaotic Mackey-Glass (MG-30) seriesCompare 2 criteria:

Minimum squared-errorMinimum error entropy

System: TDNN (6+4+1)

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.40

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Signal Value

Pro

babi

lity

Den

sity

DataMEEMSE

-0.025 -0.015 -0.005 0 0.005 0.015 0.0250

20

40

60

80

100

Error Value

Pro

babi

lity

Den

sity

MEEMSE

ITL - Applications

ITL

System identification Feature extraction


ITL – Applications Optimal feature extraction

Data processing inequality: Mutual information is monotonically non-increasing.Classification error inequality: Error probability is bounded from below and above by the mutual information.

PhD on feature extraction for sonar target recognition (2002)

x

Backpropagate Information Forces

Maximize

Mutual Information

z

Class labels (d)

z Classifiery

Compare

d


ITL – Applications Extract 2 nonlinear features

64x64 SAR images of 3 vehicles: BMP2, BTR70, T72

P(Correct)

MI+LR 94.89%

SVM 94.60%

Templates 90.40%

Zhao, Xu and Principe, SPIE Automatic Target Recognition, 1999.

Hild, Erdogmus, Principe, IJCNN Tutorial on ITL, 2003.

Classification resultsInformation forces in training

ITL - Applications

ITL

System identificationFeature extraction


ITL – Applications Independent component analysis

Observations are generated by an unknown mixture of statistically independent unknown sources.

MinimizeMutual InformationH W

s x y z

Sphering

Uncorrelatedsignals

Mixedsignals

Independentsignals

R

Independentsources

kk Hsx )()()(

1

zz HzHI

n

c

c

Ken Hild II, PhD on blind source separation (2003)

ITL – Applications On-line separation of mixed sounds

Observed mixtures and separated outputs

X1: X2: X3:

Z1: Z2: Z3:

0 100 200 300 400 500 600 700-5

0

5

10

15

20

25

30

SIR

(dB

)Time (thousands of data samples)

Infomax

FastICA

Comon

MeRMaId-SIG

On-line separation, 3x3 mixture

102

103

104

-5

0

5

10

15

20

25

Data Length (samples)

SIR

(dB

)

MeRMaId

Comon

FastICA

Infomax

Off-line separation, 10x10 mixture

Hild, Erdogmus, Principe, IEEE Signal Processing Letters, 2001.

ITL - Applications

ITL

System identificationFeature extraction


ITL – Applications Information theoretic clustering

Select clusters based on entropy and divergence:Minimize within cluster entropy

Maximize between cluster divergence

Robert Jenssen PhD on information theoretic clustering

Jenssen, Erdogmus, Hild, Principe, Eltoft, IEEE Trans. Pattern Analysis and Machine Intelligence, 2005. (submitted)

2/122 )()(

)()(min

dxxqdxxp

dxxqxp

m

Between cluster divergence

Within cluster entropyMembership vector

Reproducing Kernel Hilbert Spaces as a Tool for Nonlinear System Analysis

Fundamentals of Kernel Methods

Kernel methods are a very important class of algorithms for nonlinear optimal signal processing and machine learning. Effectively they are shallow (one layer) neural networks (RBFs) for the Gaussian kernel.

They exploit the linear structure of Reproducing Kernel Hilbert Spaces (RKHS) with very efficient computation. ANY (!) SP algorithm expressed in terms of inner products has in principle an equivalent representation in a RKHS, and may correspond to a nonlinear operation in the input space. Solutions may be analytic instead of adaptive, when the linear structure is used.

Fundamentals of Kernel MethodsDefinition

An Hilbert space is a space of functions f(.)

Given a continuous, symmetric, positive-definite kernel

, a mapping Φ, and an inner product

A RKHS H is the closure of the span of all Φ(u).

Reproducing

Kernel trick

The induced norm

:U U R , H

, ( ) ( )Hf u f u

1 2 1 2( ), ( ) ( , )Hu u u u

2|| || ,H Hf f f

Fundamentals of Kernel MethodsRKHS induced by the Gaussian kernel

The Gaussian kernel is symmetric and positive definite

thus induces a RKHS on a sample set {x1, …xN} of reals, denoted as RKHSK.

Further, by Mercer’s theorem, a kernel mapping Φ can be constructed which transforms data from the input space to RKHSK where:

where <,> denotes inner product in RKHSK.

2

2

1 ( ')( , ') exp( ).

22

x xk x x

Kii xxxxk )(),()(

Wiley Book (2010)

The RKHS defined by the Gaussian can be used for Nonlinear Signal Processingand regression extending adaptive filters (O(N))

A RKHS for ITLRKHS induced by cross information potential

Let E be the set of all square integrable one dimensional probability density functions, i.e., where and I is an index set. Then form a linear manifold (similar to the simplex)

Close the set and define a proper inner product

L2(E) is an Hilbert space but it is not reproducing. However, let us define the bivariate function on L2(E) ( cross information potential (CIP))

One can show that the CIP is a positive definite function and so it defines a RKHSV . Moreover there is a congruence between L2(E) and HV.

dxxfi )(2IiExfi ,)(

iii xf )(

dxxfxfxfxf jiLji )()()(),(2

dxxfxfffV jiji )()(),(

A RKHS for ITLITL cost functions in RKHSV

Cross Information Potential (is the natural distance in HV)

Information Potential (is the norm (mean) square in HV)

Second order statistics in HV become higher order statistics of data (e.g. MSE in HV includes HOS of data).

Members of HV are deterministic quantities even when x is r.v.

Euclidean distance and QMI

Cauchy-Schwarz divergence and QMI

VHgVfVdxxgxf ,.)(,.),()()(

2,.)(,.)(,.),()()( fVfVfVdxxfxf

VH

2,.)(,.)(),( gVfVgfDED

)log(cos,.)(.,.)(

,.)(,.),(log),(

gVfV

gVfVgfD VH

2

212,1212,1 ,.)(,.)(),( ffVfVfffQMIED

,.)(.,.)(

,.)(,.),(log),(

212,1

212,1

212,1ffVfV

ffVfVfffQMI VH

CS

A RKHS for ITLRelation between ITL and Kernel Methods thru HV

There is a very tight relationship between HV and HK: By ensemble averaging of HK we get estimators for the HV statistical quantities. Therefore statistics in kernel space can be computed by ITL operators.

Correntropy:A new generalized similarity measure

Correlation is one of the most widely used functions in signal processing.

But, correlation only quantifies similarity fully if the random variables are Gaussian distributed.

Can we define a new function that measures similarity but it is not restricted to second order statistics?

Use the kernel framework


Define correntropy of a stationary random process {xt} as

The name correntropy comes from the fact that the average over the lags (or the dimensions) is the information potential (the argument of Renyi’s entropy)

For strictly stationary and ergodic r. p.

Correntropy can also be defined for pairs of random variables

dxxpxxxxEstV ststx )()())((),(

N

nmnnm xx

NV

1

)(1ˆ

Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006.


How does it look like? The sinewave


Properties of Correntropy:

It has a maximum at the origin ( )

It is a symmetric positive function

Its mean value is the information potential

Correntropy includes higher order moments of data

The matrix whose elements are the correntopy at different lags is Toeplitz

2/1

0

2

2 !2

)1(),(

n

n

tsnn

n

xxEn

tsV


Correntropy as a cost function versus MSE. 2

2

,

2

( , ) [( ) ]

( ) ( , )

( )

XY

x y

E

e

MSE X Y E X Y

x y f x y dxdy

e f e de

,

( , ) [ ( )]

( ) ( , )

( ) ( )

XY

x y

E

e

V X Y E k X Y

k x y f x y dxdy

k e f e de


Correntropy induces a metric (CIM) in the sample space defined by

Therefore correntropy can

be used as an alternative

similarity criterion in the

space of samples.

1/ 2( , ) ( (0,0) ( , ))CIM X Y V V X Y

0.05

0.1

0.1

0.15

0.15

0.2

0.2

0.2

0.25

0.25

0.250.25

0.3

0.3

0.3

0.3

0.3

0.35

0.35

0.35

0.35

0.35

0.35

0.4

0.4

0.4

0.4

0.4

0.4

0.4

0.4

0.45

0.45

0.45

0.45

0.45

0.45

0.45

0.45

0.5

0.5

0.5

0.5

0.55

0.55

0.55

0.55

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, IEEE Trans. Sig. Proc, vol 55; # 11, pages 5286-5298


Correntropy criterion implements M estimation of robust statistics. M estimation is a generalized maximum likelihood method.

In adaptation the weighted square problem is defined as

When

this leads to maximizing the correntropy of the error at the origin.

( ) '( ) /w e e e2

1

lim ( )N

i ii

w e e

2 2( ) (1 exp( / 2 )) / 2e e

N

iMi

N

i

xx11

0)ˆ,(),(minarg

2 2

1 1

2 2

1 1

min ( ) min (1 exp( / 2 )) / 2

max exp( / 2 ) / 2 max ( )

N N

i ii i

N N

i ii i

e e

e e


Nonlinear regression with outliers (Middleton model)

Polynomial approximator


Define centered correntropy

Define correntropy coefficient

Define parametric correntropy with

Define parametric centered correntropy

Define Parametric Correntropy Coefficient

0, aRba

)}()(),(){()]([)]([),( ,, ydFxdFyxdFyxYXEEYXEYXU YXYXYXYX

N

iji

N

j

N

iii yx

Nyx

NyxU

1 12

1

)(1

)(1

),(ˆ

),(),(

),(),(

YYUXXU

YXUYX

),(ˆ),(ˆ

),(ˆ),(ˆ

yyUxxU

yxUyx

),()()]([),( ,,, yxdFybaxYbaXEYXV YXYXba

Rao M., Xu J., Seth S., Chen Y., Tagare M., Principe J., “Correntropy Dependence Measure”, submitted to Metrika.

)]([)]([),( ,, YbaXEEYbaXEYXU YXYXba

),(),(, YbaXYXba

Correntropy:Correntropy Dependence Measure

Theorem: Given two random variables X and Y the parametric centered correntropy Ua,b(X, Y ) = 0 for all a, b in R if and only if X and Y are independent.

Theorem: Given two random variables X and Y the parametric correntropy coefficient a,b(X, Y ) = 1 for certain a = a0 and

b = b0 if and only if Y = a0X + b0.

Definition: Given two r.v. X and Y

Correntropy Dependence Measure

is defined as

0,

),(sup),( ,

aRba

YXYX ba

Applications of Correntropy Correntropy based correlograms

lags

Correntropy can be used in computational auditory scene analysis

(CASA), providing much better frequency resolution.

Figures show the correlogram from a 30 channel cochlea model for one

(pitch=100Hz)).

Auto-correlation Function Auto-correntropy Function

Applications of Correntropy Correntropy based correlograms

ROC for noiseless (L) and noisy (R) double vowel discrimiation

Applications of CorrentropyMatched Filtering

Matched filter computes the inner product between the received signal r(n) and the template s(n) (Rsr(0)).

The Correntropy MF computes

(Patent pending)

N

iiirs srk

NV

1

)(1

)0(

Hypothesis received signal Similarity value

H0 kk nr 2 2( ) / 2

0 21

1

2i i

Ns n

i

V eN

H1 kkk nsr 2 2/ 2

1 21

1

2i

Nn

i

V eN

Applications of CorrentropyMatched Filtering

Linear Channels

White Gaussian noise Impulsive noise

Template: binary sequence of length 20. kernel size using Silverman’s rule.

Applications of CorrentropyMatched FilteringAlpha stable noise (=1.1), and the effect of kernel size

Template: binary sequence of length 20. kernel size using Silverman’s rule.

Conclusions

Information Theoretic Learning took us beyond Gaussian statistics and MSE as cost functions.

ITL generalizes many of the statistical concepts we take for granted.

Kernel methods implement shallow neural networks (RBFs) and extend easily the linear algorithms we all know.

KLMS is a simple algorithm for on-line learning of nonlinear systems

Correntropy defines a new RKHS that seems to be very appropriate for nonlinear system identification and robust control

Correntropy may take us out of the local minimum of the (adaptive) design of optimum linear systems

For more information go to the website www.cnel.ufl.edu ITL resource for tutorial, demos and downloadable MATLAB code


Applications of Correntropy Nonlinear temporal PCA

The Karhunen Loeve transform performs Principal Component Analysis (PCA) of the autocorrelation of the r. p.

(0) (1) ( 1)

(1)

(0) (1)

( 1) (1) (0)

T

L L

R XX

r r r L

rN

r r

r L r r

(0) (1) ( 1)

(1)

(0) (1)

( 1) (1) (0)

T

N N

K X X

r r r N

rL

r r

r N r r

T T TR XX UDD U T T TK X X VD DV

, 1, 2,...,T Ti i iU X V i L

1 2{ , ,... }L D is LxN diagonal

KL can be also done by decomposing the Gram matrix K directly.

LxNLNxLx

Nxx

X

)1(...)(

.........

)(...)1(

Applications of Correntropy Nonlinear KL transform

Since the autocorrelation function of the projected data in RKHS is given by correntropy, we can directly construct K with correntropy.

Example: where ( ) sin(2 ) ( )x m A fm z m ( ) 0.8 (0,0.1) 0.1 (4,0.1) 0.1 ( 4,0.1)Np n N N N

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.5

1FFT of clean signal and noisy observation

Observation

clean signal

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2FFT of first PC using NN Auto-correlation Matrix

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.5

1FFT of 2nd PC using NN Correntropy Matrix

A VPCA (2nd PC)

PCA by

N-by-N (N=256)

PCA by L-by-L

(L=4)

PCA by L-by-L (L=100)

0.2 100% 15% 3% 8%

0.25 100% 27% 6% 17%

0.5 100% 99% 47% 90%

1,000 Monte Carlo runs. =1

Applications of Correntropy Correntropy Spectral Density

Average normalized amplitude

frequency

Kernel

size

CSD is a function of the kernel size, and shows the difference

between PSD ( large) and the new spectral measure

Time Bandwidth product

Kernel size

Documents

Information Theoretic Signal Processing and Machine Learning Jose C. Principe Computational NeuroEngineering Laboratory Electrical and Computer Engineering