Linear Techniques for Regression and Classification on Functional Data Gilbert Saporta Chaire de Statistique Appliquée & CEDRIC Conservatoire National

Linear Techniques for Regression and Classification on Functional Data

Gilbert SaportaChaire de Statistique Appliquée & CEDRICConservatoire National des Arts et Métiers292 rue Saint Martin F 75141 Paris Cedex [email protected]://cedric.cnam.fr/~saporta

Joint work with D. Costanzo (U.Calabria) & C.Preda (U.Lille2)

mailto:[email protected]

http://cedric.cnam.fr/~saporta

Open University, Milton Keynes, May 17, 2007

2

Outline

1. Introduction 2. OLS regression on functional data3. PLS functional regression4. Clusterwise regression5. Discrimination6. Anticipated prediction7. Conclusion and perspectives


3

1.Introduction

Very high dimensional data: an infinite number of variables Regression on functional data

Example 1: Y= amount of crop Xt = temperature curves

p=

R.A.Fisher « The Influence of Rainfall on the Yield of Wheat at Rothamsted » Philosophical Transactions of the Royal Society, B: 213: 89-142 (1924)


4

Example 2 : Growth index of 84 shares at Paris stock exchange during 60 minutes

How to predict X55 till X60, for a new share, knowing Xfrom t=0 till t=55?


5

• Discrimination on functional data

Example 3: Kneading curves for cookies (Danone Vitapole)


6

After smoothing with cubic B-splines (Lévéder & al, 2004)

How to predict the quality of the cookies?


7

Linear combination

« Integral regression » (Fisher 1924)

instead of a finite sum 0

ˆ ( )T

tY t X dt

1

ˆp

j jj

Y X


8

Discrimination on functional data Particular case of regression when the

response is binary Anticipation

Determine an optimal time t*<T giving a prediction based on [0;t*] almost as good as the prediction using all the data [0;T]


9

2. OLS regression on functional data

Y ; Xt (with zero mean)

2.1 The OLS problem Minimizing

leads to normal, or Wiener-Hopf, equations:

where C(t,s)= cov(Xt, Xs)=E(XtXs)

0cov( , ) ( , ) ( )

T

tX Y C t s s ds

2

0( )

T

tE Y t X dt


10

2.2 Karhunen-Loeve decomposition (functional PCA)

factor loadings:

principal components:

1

( )t i ii

X f t

0( , ) ( ) ( )

T

i i iC t s f s ds f t

0( )

T

i i tf t X dt


11

Picard’s theorem: is unique if and only if:

Generally not true…especially when n is finite: since p >n. Perfect fit when minimizing:

2

21

i

i i

c

2

01

1( ) ( )

n T

i ii

y t x t dtn

0 0cov( , ) cov( , ( ) ) ( ) ( )

T T

i i i t t ic Y Y f t X dt E X Y f t dt


12

Even if is unique, Wiener-Hopf equation is not an ordinary integral equation: the solution is more frequently a distribution than a function

Constrained solutions are needed. (cf Green & Silverman 1994, Ramsay & Silverman 1997).


13

2.3 Regression on principal components

Rank q approximation:

1 1

cov( , )ˆ i ii i

i ii i

Y cY

2

2 2

1 1

ˆ( , ) ( , ) ii

i i i

cR Y Y R Y

( ) ( )

1 1

cov( ; ) cov( ; )ˆˆ ( ) ( )q q

q qi ii i

i ii i

Y YY t f t


14

Numerical computations Solve integral equations in the

general case for step functions: finite number of

variables and of units: operators are matrices, but with a very high size

Approximations by discretisation of time


15

Which principal components? First q? q best correlated with Y?

Principal components are computed irrespective of the response…


16

3. Functional PLS regression

Use PLS components instead of principal components.

first PLS component :

further PLS components as usual

2

0max cov ( , ( ) )w tY w t X dt

2

1w

2

0

cov( , )( )

cov ( , )

t

t

X Yw t

X Y dt

1 0( ) tt w t X dt


17

order q approximation of Y by Xt :

Convergence theorem:

q have to be finite in order to get a formula!

Usually q is selected by cross-validation(Preda & Saporta, 2005a)

( ) 1 1 ( )0

ˆˆ ... ( ) dt T

PLS q q q PLS q tY c t c t t X

2

( )ˆ ˆlim ( ) 0q PLS qE Y Y


18

First PLS component easily interpretable: coefficients with the same sign as r(y;xt)

No integral equation PLS fits better than PCR:

Same proof as in De Jong, 1993

2 2( ) ( )

ˆ ˆ( ; ) ( ; )PLS q PCR qR Y Y R Y Y


19

4. Clusterwise regression

4.1 Model: G , variable with K categories (sub-

populations)

2

( )

( )

i i

i

E Y x G i x

V Y x G i

X

X


20

4.2 OLS and clusterwise

regression

Residual variance of global regression= within cluster residual variance + variance due to the difference between local (clusterwise) and global regression (OLS)

ˆ ˆ LY Y OLS global estimate versus clusterwise "local" estimate


21

4.3 Estimation (Charles, 1977)

number of clusters k needs to be known Alternated least squares

For a given partition: estimate linear regressions for each cluster

Reallocate each point to the closest regression line (or surface)

Equivalent to ML for fixed regressors, fixed

partition model (Hennig, 2000)

4.4 Optimal k AIC, BIC, crossvalidation

2

{1 }

ˆ ˆ( ) arg min ( ( ))ˆii

j ji …KG j y x


22

4.5 Clusterwise functional PLS regression

OLS functional regression not adequate to give estimations in each cluster

Our proposal: estimate local models with functional PLS regression

Is the clusterwise algorithm still consistent?

Proof in Preda & Saporta, 2005b


23

Prediction: Allocate a new observation to a cluster

(nearest neighbor or other classification technique)

Use the corresponding local model May be generalised if Y is itself a

random vector: ,t t T T a

Y X


24

4.6 Application to stock market data

Growth index during 1 hour (between 10h and 11h) of 84 shares at Paris Stock Exchange

Goal : predict a new share between 10h55 and 11h using data between 10h and 10h55


25

Exact computations need 1366

variables (number of intervals where the 85 curves are constant)

Discretisation in 60 intervals. Comparison between PCR and PLS:


26

Crash of share 85 not detected!


27

Clusterwise PLS Four clusters (17;32;10;25) Number of PLS component for each cluster:

1; 3; 2 ; 2 (cross-validation)


28

Share 85 classified into cluster 1


29

3. Functional linear discrimination

LDA : linear combinations maximizing the ratio

Between group variance /Within group variance

For 2 groups Fisher’s LDF via a regression between coded Y and Xt

eg(Preda & Saporta, 2005a)

0( )

T

tt X dt

01

0 1

and pp

p p


30

PLS regression with q components gives an approximation of β(t) and of the score

For more than 2 groups: PLS2 regression between k-1 indicators of Y and Xt First PLS component given by the first

eigenvector of the product of Escoufier operators WxWY

Preda & Saporta, 2002 and Barker & Rayens , 2003

T 0

ˆd ( ) ( )T

PLS PLS tX t X dt


31

Quality measures

For k=2 : ROC curve and AUC For a specific threshold, x is classified into

G1if dT(x)>s Sensitivity or true positive rate:

P(dT(x)>s/Y=1)=1-β 1- specificity or 1- true negative rate:

P(dT(x)>s/Y=0)=


32

ROC curve

• Perfect discrimination : ROC curve is confounded with the edges of unit square• For identical conditional distributions ROC curve is confounded with the diagonal


33

ROC curve invariant for any increasing monotonous transformation

Area under ROC curve: a global measure of performance allowing model comparisons (partially)

X1 drawn from G1 and X2 from G2

AUC estimated by the proportion of concordant pairs

nc : Wilcoxon-Mann-Whitney statisticU+W= n1n2+0.5n1(n1+1) AUC=U/n1n2

1 2((1 ) ( )( ) )s

sAUC Xd s P Xs

1 2cc n n n


34

4. Anticipated prediction

t*<T such that the analysis on [0;t*] give donne predictions almost as good as with [0;T]

Solution: When increasing s from 0 to T, look for

the first value such that AUC(s) does not differ significantly from AUC(T)


35

A bootstrap procedure Stratified resampling of the data For each replication b, AUCb(s) and

AUCb(T) are computed Student’s T test or Wilcoxon on the B

paired differences b=AUCb(s)- AUCb(T)


36

5.Applications

5.1 simulated data Two classes with equal priors W(t) brownian motion


37


38

With B=50


39

5.2 Kneading curves After T= 480s of kneading one gets

cookies where quality is Y 115 observations: 50 « good », 40 «bad

» et 25 « adjustable » 241 equally spaced measurements Smoothing with cubic B-splines , 16

knots


40

Performance for Y={good,bad} Repeat 100 times the split into learning and

test samples of size (60, 30) Average error rate

0.142 with principal components 0.112 with PLS components

Average AUC = 0.746

β(t)


41

Anticipated prediction B=50 t*=186

The recording period of the resistance dough can be reduced to less than half of the current one


42

6.Conclusions and perspectives

PLS regression is an efficient and simple way to get linear prediction for functional data

We have proposed a bootstrap procedure for the problem of anticipated prediction


43

Works in progress: « on-line » forecasting: instead of

using the same anticipated decision time t* for all data, we could adapt t* to each new trajectory given its incoming measurements.

Clusterwise discrimination Comparison with functional logistic

regressionAguilera et al, 2006


44

References Aguilera A.M., Escabias, M. & Valderrama M.J. (2006) Using principal

components for estimating logistic regression with high-dimensional multicollinear data, Computational Statistics & Data Analysis, 50, 1905-1924

Barker M., Rayens W. (2003) Partial least squares for discrimination. J Chemomet 17:166–173

Charles, C., 1977. Régression typologique et reconnaissance des formes. Ph.D., Université Paris IX.

D. Costanzo, C. Preda et G. Saporta (2006). Anticipated prediction in discriminant analysis on functional data for binary response . In COMPSTAT2006, p. 821-828, Physica-Verlag

Hennig, C., (2000). Identifiability of models for clusterwise linear regression. J. Classification 17, 273–296.

Lévéder C., Abraham C., Cornillon P. A., Matzner-Lober E., Molinari N. (2004): Discrimination de courbes de pétrissage. Chimiometrie 2004, 37–43.

Preda C. , Saporta G. (2005a): PLS regression on a stochastic process, Computational Statistics and Data Analysis, 48, 149-158.

Preda C. , Saporta G. (2005b): Clusterwise PLS regression on a stochastic process, Computational Statistics and Data Analysis, 49, 99-108.

Preda C., Saporta G. & Lévéder C., (2007) PLS classification of functional data, Computational Statistics

Ramsay & Silverman (1997) Functional data analysis, Springer

Documents

Linear Techniques for Regression and Classification on Functional Data Gilbert Saporta Chaire de Statistique Appliquée & CEDRIC Conservatoire National