Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005

Correlation Aware Feature Selection

http://mpa.itc.ithttp://mpa.itc.it

Annalisa BarlaCesare FurlanelloGiuseppe JurmanStefano MerlerSilvano Paoli

Berlin – 8/10/2005

Overview

On Feature Selection

Correlation Aware Ranking

Synthetic Example

Feature SelectionStep-wise variable selection:

n*<N

effective variables modeling the classification function

N f

eatu

res

N steps

Step 1 Step N…

One feature vs. N features

…

Feature Selection

Step-wise selection of the features.

Steps

Ranked Features

Discarded Features

Ranking

Classifier independent filters

Prefiltering is risky: you might discard features that turns out to be important. (ignoring labelling)

Induced by a classifier

Support Vector Machines

l

iKii

HffxfyV

1

2,min

l

iii xxKxf

1

,

Classification function:

Optimal Separating Hyperplane

iiii xyw

bxw

0 :OSH

The classification/ranking machine The RFE idea: given N features (genes)

Train a SVM Compute a cost function J from the weight coefficients of the

the SVM Rank features in terms of contribution to J Discard the feature less contributing to J Reapply procedure on the N-1 features

This is called Recursive Feature Elimination (RFE)

Features are ranked according to their contribute to the classification, given the training data.

Time and data consuming, and at risk of selection bias

Guyon et al. 2002

RFE-based Methods

Considering chunks of data at a time:

Parametrics Sqrt(N) – RFE Bisection – RFE

Non-Parametrics E – RFE (adapting to weight distribution):

thresholding weights to a value w*

Variable Elimination

Given F={x1, x2, …, xH}

such that:

Txx ji , for a given threshold T.

w(x1)~w(x2) ~ … ~ ε < w*

w(x1)+w(x2)+ … >> w*

Each single weight is negligible

Correlated genes

BUT

Correlated Genes (1)

Correlated Genes (2)

Synthetic DataBinary problem 100 (50 +50) samples of 1000 genes:

genes 150 : randomly extracted from N(1,1) and N(-1,1) respectively

genes 50100 : randomly extracted from N(1,1) and N(-1,1) respectively (1 repeated 50 times)

genes 101 1000 extracted from UNIF(-4,4)

Class 1: 50Class 1: 50

Class 2: 50Class 2: 50

5050 1x501x50

N(1,1)

N(-1,1)

1 f

eat

rep

eate

d

Unif(-4,4)

1 50 100 1000

51 significantfeatures

Our algorithm

jE

lgonly save :else

step j

Methodology

Implemented within the BioDCV system (50 replicates)

Realized through R - C code interaction

Synthetic Data1 50 100 1000

Gene 100 is consistently ranked as 2nd

ste

ps

Work in Progress

Preservation of high correlated genes with low initial weights on microarrays datasets

Robust correlation measures

Different techniques to detect Fl families (clustering, gene functions)

Synthetic DataStep features

1-50features 51-100

features>100

0 0 0 282

1 0 0 158

2 0 0 67

3 0 0 59

4 0 0 167

5 0 0 39

6 0 0 31

7 0 0 13

8 0 0 9

9 0 50 6

10 0 0 55

11 0 0 32

12 0 0 20

13 0 0 13

14 0 0 18

SAVED 50 0

Step features 1-50

features 51-100

features>100

0 0 0 282

1 0 0 158

2 0 0 67

3 0 0 59

4 0 0 167

5 0 0 39

6 0 0 31

7 0 0 13

8 0 0 9

9 0 49 6

10 0 0 55

11 0 0 32

12 0 0 20

13 0 0 13

14 0 0 18

SAVED 50 0

Synthetic Data

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

227 559 864 470 363 735

Features discarded at step 9 from E-RFE Features discarded at step 9 from E-RFE procedure:procedure:

Correlation Correction:Correlation Correction:

Saves feature 100

INFRASTRUCTURE•MPACluster -> available for batch jobs•Connecting with IFOM -> 2005•Running at IFOM -> 2005/2006•Production on GRID resources

(spring 2005)

Challenges

ALGORITHMS II

1. Gene list fusion: suite of algebraic/statistical methods

2. Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis

3. New SVM Kernels for prediction on spectrometry data within complete validation

ALGORITHMS II

1. Gene list fusion: suite of algebraic/statistical methods

2. Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis

3. New SVM Kernels for prediction on spectrometry data within complete validation

Challenges for predictive profiling

A few issues in feature selectionwith a particular interest on classificationof genomic data

WHY?

To ease computational burden To enhance information

Discard the (apparently) less significant features and train in a simplified space: alleviate the curse of dimensionality

Highlight (and rank) the most important features and improve the knowledge of the underlying process.

HOW?

As a pre-processing step As a learning step

Employ a statistical filter

(t-test, S2N)

Link the feature ranking to the classification task: wrapper methods, …

Prefiltering is risky: you might discard features that turns out to be important.

Nevertheless, wrapper methods are quite costing.

Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods.

A classic alternative is to map into linear combination of features,and then select.

Principal Component Analysis

Metagenes

(a simplified model for pathways: but

biological suggestions require caution)

But we are not working anymore with the original features.

eigen-cratersfor unexploded bomb risk maps

Feature Selection within Complete Feature Selection within Complete Validation Experimental SetupsValidation Experimental Setups

Complete Validation is needed to decouple model tuning from (ensemble) model accuracy estimation: otherwise selection bias effects …

Full

Dataset

Test Error 1

LS 1

TS 1

Feat. Rank + Model

Features Subset 1

TS 1

Test Error n

LS n

TS n

Feat. Rank + Model

Features Subset n

TS n

Final Features Selection

& Error

Estimationn

X-FirmX-RougO-ButyA-Rind

O-AmmoX-MoisV-DiamA-ButyV-ColoO-RindA-Egg.X-DefoA-Milk

O-Egg.X-Solu

V-NumbT-Bitt

X-FriaA-AmmoO-Cow.X-GranS-Pung

T-SaltX-AdheX-Elas

A-Cow.V-Dist

V-ShapO-MilkT-AcidA-Sour

T-SweeA-Frui

O-SourO-Boil

0.020 0.025 0.030 0.035 0.040

_ ____ ____ _ ____ _ __ __ _ __ _ _ ___ _ _ ___ _ __ __ _ __ _ __ ___ ___ _ _ __ _ __ __ __ __ ____ __ _ ___ ___ ___ __ _ ___ ___ __ _ __ ___ __ ___

_ ____ _ _ __ __ __ ___ __ _ __ __ _ ___ __ __ __ ___ __ _ _____ __ __ __ __ _ _ __ ___ _ _____ _ __ __ _ _ ___ _ _ _ _____ __ ____ _ _ __ ___ __ __

__ __ __ __ ___ _ ___ _ ____ _ ___ _ ___ __ __ _ _ ___ __ __ ___ ___ ___ __ __ __ ___ ___ __ _ ___ _ __ __ ___ __ __ _____ _ _ _ _ ___ __ _ __ __ _

_ ___ _ ___ __ _ __ _ ____ __ __ __ __ ____ __ _ _ __ _ ___ _ __ ___ ___ _ __ __ _ ___ ___ ____ ___ ___ __ __ ___ _ ___ _ _ ___ _ __ _ __ _ _ _ __ __

__ __ ___ __ _ __ __ __ ___ __ __ ___ ___ __ __ __ _ __ ___ _ __ __ __ _ ____ _ __ __ _ ___ _ __ __ __ _____ ___ _ ____ __ ___ ___ __ _ _ __ _ ___

___ __ __ __ __ _ ___ _ __ __ __ _ __ __ ___ ___ __ _ _ ___ _ _ _ ____ __ ___ _ _ __ _ __ __ __ ___ _ __ __ _ ___ _ ____ _ __ __ __ __ __ _ ___ __ _ ___ __ __ _ _ __ ____ __ ___ ___ __ _ ___ __ __ __ _ _ _ ____ ____ _ ___ ___ ____ __ _ ___ __ __ ___ __ __ _ ___ __ _ ___ _ _ ___ __ _ _ ___ _ ___ __

_ _ __ ____ _ ___ ____ ____ _ _ ___ _ __ ___ __ __ __ _ _ __ _ __ ___ _ _____ __ _ _ ___ _ _ __ __ _ ___ _ __ _____ __ _ __ __ _ __ ___ __ _ __ ___ __ __ _ _ ___ _ __ _ ___ __ _ _ ___ __ _ ___ _ __ _ ____ __ _ ___ _ __ __ _ ___ _ _ _ _ ___ __ ___ __ _ _ ___ ___ __ __ _ ___ _ ____ _ __ _ _ ___ __ ___ _

____ __ _ ___ ____ _ ___ __ ___ __ ___ _ _ ___ _ ___ _ __ __ _ ____ __ ____ _ __ ____ __ __ _ ___ ____ _ _ _ __ ___ __ __ __ _ ____ __ _ __ ___ ___ _ _ __ _ _ _ __ _ __ ___ ___ _ _____ __ ___ ___ ___ _ ___ ___ ___ _ ___ _ _ __ __ ___ ___ __ ____ _ ___ ___ __ __ _ ___ __ _ __ __ __ _ ___ __ _

_ _ ___ __ _ _ ____ _ __ _ __ ___ _ __ _ ___ ___ __ ____ __ __ ___ __ _ ___ __ __ _ ____ __ __ __ _ _ __ _ _ _ ____ ____ _ ____ __ _ __ __ _ ___ _ __

__ __ _____ _ __ _ _ __ _ ___ __ __ _ __ _ __ __ ___ ___ _ _ __ __ __ _ ___ _ __ __ __ _ __ __ ____ _ _ ___ __ _ __ __ ___ _ ____ _ __ _ _ __ _ __ _ __ _

_ __ _ ___ _ ___ ___ _ __ _ ___ ___ _ _ ___ __ _ ___ _ _ ___ ___ _ _ ___ __ __ ____ _ __ ___ _ _ __ _ __ _ __ __ _ ___ _ _ __ __ _ __ ___ ___ _ ____ __

__ _ ___ __ _ _ _ __ ___ _ _ ___ __ __ __ __ _ _ ____ __ __ _ __ __ __ __ _ __ _ ___ _ __ ___ __ ___ _ _ __ ___ ___ _ __ __ __ __ ___ _ _ __ ____ __ ____ ___ _ __ _ _ __ _ __ _ ___ _ __ __ ___ ___ __ __ _ __ __ ___ _ __ _ __ __ __ __ _ __ __ _____ ____ __ __ __ __ _ ___ _ ___ _ ____ __ __ _ ___ __ _

_ __ __ _ _ ___ _ ___ __ _ __ _ __ ____ __ _ __ __ __ __ ___ _ ____ _ __ __ _ _ _ ____ _ ___ _ _ __ _ __ __ _ _ __ _ __ _ __ _ __ _ ____ ____ __ __ _ ____ _ ___ _ __ __ __ _ __ ___ ___ _ __ _ ____ _ _ __ ___ __ _ __ __ __ _ __ _ _ _ __ __ _ __ _ _ __ __ ___ _ __ ___ __ ____ _ __ _ __ ___ __ __ __ _ ___ __

_ __ __ _ _ _ ___ __ __ _ _ __ ____ _ __ _ __ _ _ __ _ ___ _ __ __ ___ __ __ ___ ____ _ __ __ _ _ ____ __ ___ ____ ___ __ __ _ ___ __ _ __ __ _ __ __ _

__ _ ___ _ __ __ _____ __ ____ __ _ ___ __ ___ __ __ __ _ _ __ __ __ _ _ _ __ __ _ ___ _____ _ _ _ ___ _ _ ___ ____ __ __ _ __ ____ __ ___ __ _ ___

__ ___ _ __ ___ _ ___ __ _ __ ___ __ __ __ ____ _ __ __ ___ __ ___ ___ __ __ __ __ ___ _ ___ __ _ __ ___ ____ __ _ __ ___ _ _ _ ___ _ _ ___ __ ___

_ __ __ _ ____ ___ _ _ ____ ___ __ __ _ __ _ __ _ ___ ___ ___ _ __ __ __ __ _ ____ _ ___ __ _ _ __ _ _ ___ _ __ ___ ____ _ _ __ _ __ __ ___ __ __ __ __ ___ _ ___ _ ___ _ ___ ___ _ __ __ __ __ ____ ___ _ __ __ _ ___ _ __ _ __ __ __ ___ ___ _ __ __ ____ _ ____ __ _ __ ___ _ ___ __ _ _ __ ___ _ __ _ _

__ _ _ __ _ __ _ ____ _ _ ___ ____ __ ___ ___ __ _ __ _ ___ _ __ _ _ __ _ _______ __ __ __ __ _ ___ __ _ _ ___ __ __ _ ___ __ _ ____ __ __ __ _ ____

____ __ __ __ ____ __ _ _ ___ __ __ _ _ __ ___ __ __ __ __ _ ___ _ ___ _ ___ ___ _ ___ __ _ __ ___ _ ___ _ _____ __ __ __ __ _ __ __ __ _ _ _ __ __ ___ _ __ _ ____ _ __ _ __ ___ ____ _____ __ _ __ ___ ___ __ ___ ___ _ _ __ ___ __ _ __ _ __ __ __ ___ _ _ ___ __ __ _ _ _ __ ___ _ _ _____ ___ _ __ _

_ __ __ __ ____ ___ ___ ___ _ __ _ _ ___ __ _ _ __ _ __ __ ____ ___ ___ __ __ __ _ __ _ __ ___ ___ ___ ___ __ ____ ____ _ ____ __ _ _ _ __ ___ ___ _ __ _ __ _____ _ _ ___ _ ___ _ ___ _ _ ___ ____ __ _ _ _ ___ ___ _____ __ __ __ _ ___ _ _ __ __ _ _ _ _____ __ __ __ __ _ ___ __ __ __ ___ _ ____

__ ___ ___ __ _ __ _ _ __ _ __ _ __ __ ____ ___ _ __ __ _ _ __ __ ___ _ __ __ __ ____ ___ __ _ _________ __ __ _ __ __ __ __ ___ _ __ _ __ _ ___ _ _

_ _ _ _ __ __ __ ___ __ __ _ _ ____ __ _ __ __ __ ___ __ __ _ ___ ___ __ __ ____ ___ _ ____ __ __ _ ____ ___ _ __ _____ __ __ __ __ __ ___ __ ____ __ _ _ _ ____ __ _ ____ _ __ __ _ ___ __ __ __ __ _ _ ____ ___ _ ____ __ _ _ ___ _ __ _ __ __ __ _ ___ __ _ ____ __ __ _ __ __ __ _ ___ ____ _ ____

__ __ ___ __ _ __ _ __ _ __ _ __ ___ _ _ __ __ __ _ __ ____ _ __ __ _ _ _ __ __ __ __ __ _ _ ___ _ _ ___ __ __ __ __ _ ___ ___ _____ ___ ___ __ __ ____ __ __ _ _ ____ _ __ _ __ __ _ ____ __ ___ __ __ _ __ __ _ _ __ _ _ _ ____ __ ___ _ ___ _ _ _____ _ ___ ___ _ ___ _ __ ____ _ __ __ ___ ___ __ __ __

_ __ _____ ___ __ __ __ _ _ __ ____ _ ___ ___ __ __ _ __ __ __ __ __ _ __ ___ _ __ _ __ __ _ ___ _ __ __ ___ __ _ ___ ___ _______ __ _ __ __ _ _ _ ___ __ ___ ___ ___ _ _ __ __ _ _ __ _ __ _ _ __ _ ___ __ __ ___ __ _ ___ __ __ __ __ _ __ __ ___ ___ ___ __ __ __ __ __ __ _ _ ___ _ __ __ __ _ __ __ __

Relative relevance

Accumulating rel. importance from Random Forest models for the identification of sensory drivers(with P. Granitto, IASMA)

Documents

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005