Upload
neil-fox
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Correlation Aware Feature Selection
http://mpa.itc.ithttp://mpa.itc.it
Annalisa BarlaCesare FurlanelloGiuseppe JurmanStefano MerlerSilvano Paoli
Berlin – 8/10/2005
Overview
On Feature Selection
Correlation Aware Ranking
Synthetic Example
Feature SelectionStep-wise variable selection:
n*<N
effective variables modeling the classification function
N f
eatu
res
N steps
Step 1 Step N…
One feature vs. N features
…
Feature Selection
Step-wise selection of the features.
Steps
Ranked Features
Discarded Features
Ranking
Classifier independent filters
Prefiltering is risky: you might discard features that turns out to be important. (ignoring labelling)
Induced by a classifier
Support Vector Machines
l
iKii
HffxfyV
1
2,min
l
iii xxKxf
1
,
Classification function:
Optimal Separating Hyperplane
iiii xyw
bxw
0 :OSH
The classification/ranking machine The RFE idea: given N features (genes)
Train a SVM Compute a cost function J from the weight coefficients of the
the SVM Rank features in terms of contribution to J Discard the feature less contributing to J Reapply procedure on the N-1 features
This is called Recursive Feature Elimination (RFE)
Features are ranked according to their contribute to the classification, given the training data.
Time and data consuming, and at risk of selection bias
Guyon et al. 2002
RFE-based Methods
Considering chunks of data at a time:
Parametrics Sqrt(N) – RFE Bisection – RFE
Non-Parametrics E – RFE (adapting to weight distribution):
thresholding weights to a value w*
Variable Elimination
Given F={x1, x2, …, xH}
such that:
Txx ji , for a given threshold T.
w(x1)~w(x2) ~ … ~ ε < w*
w(x1)+w(x2)+ … >> w*
Each single weight is negligible
Correlated genes
BUT
Correlated Genes (1)
Correlated Genes (2)
Synthetic DataBinary problem 100 (50 +50) samples of 1000 genes:
genes 150 : randomly extracted from N(1,1) and N(-1,1) respectively
genes 50100 : randomly extracted from N(1,1) and N(-1,1) respectively (1 repeated 50 times)
genes 101 1000 extracted from UNIF(-4,4)
Class 1: 50Class 1: 50
Class 2: 50Class 2: 50
5050 1x501x50
N(1,1)
N(-1,1)
1 f
eat
rep
eate
d
Unif(-4,4)
1 50 100 1000
51 significantfeatures
Our algorithm
jE
lgonly save :else
step j
Methodology
Implemented within the BioDCV system (50 replicates)
Realized through R - C code interaction
Synthetic Data1 50 100 1000
Gene 100 is consistently ranked as 2nd
ste
ps
Work in Progress
Preservation of high correlated genes with low initial weights on microarrays datasets
Robust correlation measures
Different techniques to detect Fl families (clustering, gene functions)
Synthetic DataStep features
1-50features 51-100
features>100
0 0 0 282
1 0 0 158
2 0 0 67
3 0 0 59
4 0 0 167
5 0 0 39
6 0 0 31
7 0 0 13
8 0 0 9
9 0 50 6
10 0 0 55
11 0 0 32
12 0 0 20
13 0 0 13
14 0 0 18
SAVED 50 0
Step features 1-50
features 51-100
features>100
0 0 0 282
1 0 0 158
2 0 0 67
3 0 0 59
4 0 0 167
5 0 0 39
6 0 0 31
7 0 0 13
8 0 0 9
9 0 49 6
10 0 0 55
11 0 0 32
12 0 0 20
13 0 0 13
14 0 0 18
SAVED 50 0
Synthetic Data
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
227 559 864 470 363 735
Features discarded at step 9 from E-RFE Features discarded at step 9 from E-RFE procedure:procedure:
Correlation Correction:Correlation Correction:
Saves feature 100
INFRASTRUCTURE•MPACluster -> available for batch jobs•Connecting with IFOM -> 2005•Running at IFOM -> 2005/2006•Production on GRID resources
(spring 2005)
Challenges
ALGORITHMS II
1. Gene list fusion: suite of algebraic/statistical methods
2. Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis
3. New SVM Kernels for prediction on spectrometry data within complete validation
ALGORITHMS II
1. Gene list fusion: suite of algebraic/statistical methods
2. Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis
3. New SVM Kernels for prediction on spectrometry data within complete validation
Challenges for predictive profiling
A few issues in feature selectionwith a particular interest on classificationof genomic data
WHY?
To ease computational burden To enhance information
Discard the (apparently) less significant features and train in a simplified space: alleviate the curse of dimensionality
Highlight (and rank) the most important features and improve the knowledge of the underlying process.
HOW?
As a pre-processing step As a learning step
Employ a statistical filter
(t-test, S2N)
Link the feature ranking to the classification task: wrapper methods, …
Prefiltering is risky: you might discard features that turns out to be important.
Nevertheless, wrapper methods are quite costing.
Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods.
A classic alternative is to map into linear combination of features,and then select.
Principal Component Analysis
Metagenes
(a simplified model for pathways: but
biological suggestions require caution)
But we are not working anymore with the original features.
eigen-cratersfor unexploded bomb risk maps
Feature Selection within Complete Feature Selection within Complete Validation Experimental SetupsValidation Experimental Setups
Complete Validation is needed to decouple model tuning from (ensemble) model accuracy estimation: otherwise selection bias effects …
Full
Dataset
Test Error 1
LS 1
TS 1
Feat. Rank + Model
Features Subset 1
TS 1
Test Error n
LS n
TS n
Feat. Rank + Model
Features Subset n
TS n
Final Features Selection
& Error
Estimationn
X-FirmX-RougO-ButyA-Rind
O-AmmoX-MoisV-DiamA-ButyV-ColoO-RindA-Egg.X-DefoA-Milk
O-Egg.X-Solu
V-NumbT-Bitt
X-FriaA-AmmoO-Cow.X-GranS-Pung
T-SaltX-AdheX-Elas
A-Cow.V-Dist
V-ShapO-MilkT-AcidA-Sour
T-SweeA-Frui
O-SourO-Boil
0.020 0.025 0.030 0.035 0.040
_ ____ ____ _ ____ _ __ __ _ __ _ _ ___ _ _ ___ _ __ __ _ __ _ __ ___ ___ _ _ __ _ __ __ __ __ ____ __ _ ___ ___ ___ __ _ ___ ___ __ _ __ ___ __ ___
_ ____ _ _ __ __ __ ___ __ _ __ __ _ ___ __ __ __ ___ __ _ _____ __ __ __ __ _ _ __ ___ _ _____ _ __ __ _ _ ___ _ _ _ _____ __ ____ _ _ __ ___ __ __
__ __ __ __ ___ _ ___ _ ____ _ ___ _ ___ __ __ _ _ ___ __ __ ___ ___ ___ __ __ __ ___ ___ __ _ ___ _ __ __ ___ __ __ _____ _ _ _ _ ___ __ _ __ __ _
_ ___ _ ___ __ _ __ _ ____ __ __ __ __ ____ __ _ _ __ _ ___ _ __ ___ ___ _ __ __ _ ___ ___ ____ ___ ___ __ __ ___ _ ___ _ _ ___ _ __ _ __ _ _ _ __ __
__ __ ___ __ _ __ __ __ ___ __ __ ___ ___ __ __ __ _ __ ___ _ __ __ __ _ ____ _ __ __ _ ___ _ __ __ __ _____ ___ _ ____ __ ___ ___ __ _ _ __ _ ___
___ __ __ __ __ _ ___ _ __ __ __ _ __ __ ___ ___ __ _ _ ___ _ _ _ ____ __ ___ _ _ __ _ __ __ __ ___ _ __ __ _ ___ _ ____ _ __ __ __ __ __ _ ___ __ _ ___ __ __ _ _ __ ____ __ ___ ___ __ _ ___ __ __ __ _ _ _ ____ ____ _ ___ ___ ____ __ _ ___ __ __ ___ __ __ _ ___ __ _ ___ _ _ ___ __ _ _ ___ _ ___ __
_ _ __ ____ _ ___ ____ ____ _ _ ___ _ __ ___ __ __ __ _ _ __ _ __ ___ _ _____ __ _ _ ___ _ _ __ __ _ ___ _ __ _____ __ _ __ __ _ __ ___ __ _ __ ___ __ __ _ _ ___ _ __ _ ___ __ _ _ ___ __ _ ___ _ __ _ ____ __ _ ___ _ __ __ _ ___ _ _ _ _ ___ __ ___ __ _ _ ___ ___ __ __ _ ___ _ ____ _ __ _ _ ___ __ ___ _
____ __ _ ___ ____ _ ___ __ ___ __ ___ _ _ ___ _ ___ _ __ __ _ ____ __ ____ _ __ ____ __ __ _ ___ ____ _ _ _ __ ___ __ __ __ _ ____ __ _ __ ___ ___ _ _ __ _ _ _ __ _ __ ___ ___ _ _____ __ ___ ___ ___ _ ___ ___ ___ _ ___ _ _ __ __ ___ ___ __ ____ _ ___ ___ __ __ _ ___ __ _ __ __ __ _ ___ __ _
_ _ ___ __ _ _ ____ _ __ _ __ ___ _ __ _ ___ ___ __ ____ __ __ ___ __ _ ___ __ __ _ ____ __ __ __ _ _ __ _ _ _ ____ ____ _ ____ __ _ __ __ _ ___ _ __
__ __ _____ _ __ _ _ __ _ ___ __ __ _ __ _ __ __ ___ ___ _ _ __ __ __ _ ___ _ __ __ __ _ __ __ ____ _ _ ___ __ _ __ __ ___ _ ____ _ __ _ _ __ _ __ _ __ _
_ __ _ ___ _ ___ ___ _ __ _ ___ ___ _ _ ___ __ _ ___ _ _ ___ ___ _ _ ___ __ __ ____ _ __ ___ _ _ __ _ __ _ __ __ _ ___ _ _ __ __ _ __ ___ ___ _ ____ __
__ _ ___ __ _ _ _ __ ___ _ _ ___ __ __ __ __ _ _ ____ __ __ _ __ __ __ __ _ __ _ ___ _ __ ___ __ ___ _ _ __ ___ ___ _ __ __ __ __ ___ _ _ __ ____ __ ____ ___ _ __ _ _ __ _ __ _ ___ _ __ __ ___ ___ __ __ _ __ __ ___ _ __ _ __ __ __ __ _ __ __ _____ ____ __ __ __ __ _ ___ _ ___ _ ____ __ __ _ ___ __ _
_ __ __ _ _ ___ _ ___ __ _ __ _ __ ____ __ _ __ __ __ __ ___ _ ____ _ __ __ _ _ _ ____ _ ___ _ _ __ _ __ __ _ _ __ _ __ _ __ _ __ _ ____ ____ __ __ _ ____ _ ___ _ __ __ __ _ __ ___ ___ _ __ _ ____ _ _ __ ___ __ _ __ __ __ _ __ _ _ _ __ __ _ __ _ _ __ __ ___ _ __ ___ __ ____ _ __ _ __ ___ __ __ __ _ ___ __
_ __ __ _ _ _ ___ __ __ _ _ __ ____ _ __ _ __ _ _ __ _ ___ _ __ __ ___ __ __ ___ ____ _ __ __ _ _ ____ __ ___ ____ ___ __ __ _ ___ __ _ __ __ _ __ __ _
__ _ ___ _ __ __ _____ __ ____ __ _ ___ __ ___ __ __ __ _ _ __ __ __ _ _ _ __ __ _ ___ _____ _ _ _ ___ _ _ ___ ____ __ __ _ __ ____ __ ___ __ _ ___
__ ___ _ __ ___ _ ___ __ _ __ ___ __ __ __ ____ _ __ __ ___ __ ___ ___ __ __ __ __ ___ _ ___ __ _ __ ___ ____ __ _ __ ___ _ _ _ ___ _ _ ___ __ ___
_ __ __ _ ____ ___ _ _ ____ ___ __ __ _ __ _ __ _ ___ ___ ___ _ __ __ __ __ _ ____ _ ___ __ _ _ __ _ _ ___ _ __ ___ ____ _ _ __ _ __ __ ___ __ __ __ __ ___ _ ___ _ ___ _ ___ ___ _ __ __ __ __ ____ ___ _ __ __ _ ___ _ __ _ __ __ __ ___ ___ _ __ __ ____ _ ____ __ _ __ ___ _ ___ __ _ _ __ ___ _ __ _ _
__ _ _ __ _ __ _ ____ _ _ ___ ____ __ ___ ___ __ _ __ _ ___ _ __ _ _ __ _ _______ __ __ __ __ _ ___ __ _ _ ___ __ __ _ ___ __ _ ____ __ __ __ _ ____
____ __ __ __ ____ __ _ _ ___ __ __ _ _ __ ___ __ __ __ __ _ ___ _ ___ _ ___ ___ _ ___ __ _ __ ___ _ ___ _ _____ __ __ __ __ _ __ __ __ _ _ _ __ __ ___ _ __ _ ____ _ __ _ __ ___ ____ _____ __ _ __ ___ ___ __ ___ ___ _ _ __ ___ __ _ __ _ __ __ __ ___ _ _ ___ __ __ _ _ _ __ ___ _ _ _____ ___ _ __ _
_ __ __ __ ____ ___ ___ ___ _ __ _ _ ___ __ _ _ __ _ __ __ ____ ___ ___ __ __ __ _ __ _ __ ___ ___ ___ ___ __ ____ ____ _ ____ __ _ _ _ __ ___ ___ _ __ _ __ _____ _ _ ___ _ ___ _ ___ _ _ ___ ____ __ _ _ _ ___ ___ _____ __ __ __ _ ___ _ _ __ __ _ _ _ _____ __ __ __ __ _ ___ __ __ __ ___ _ ____
__ ___ ___ __ _ __ _ _ __ _ __ _ __ __ ____ ___ _ __ __ _ _ __ __ ___ _ __ __ __ ____ ___ __ _ _________ __ __ _ __ __ __ __ ___ _ __ _ __ _ ___ _ _
_ _ _ _ __ __ __ ___ __ __ _ _ ____ __ _ __ __ __ ___ __ __ _ ___ ___ __ __ ____ ___ _ ____ __ __ _ ____ ___ _ __ _____ __ __ __ __ __ ___ __ ____ __ _ _ _ ____ __ _ ____ _ __ __ _ ___ __ __ __ __ _ _ ____ ___ _ ____ __ _ _ ___ _ __ _ __ __ __ _ ___ __ _ ____ __ __ _ __ __ __ _ ___ ____ _ ____
__ __ ___ __ _ __ _ __ _ __ _ __ ___ _ _ __ __ __ _ __ ____ _ __ __ _ _ _ __ __ __ __ __ _ _ ___ _ _ ___ __ __ __ __ _ ___ ___ _____ ___ ___ __ __ ____ __ __ _ _ ____ _ __ _ __ __ _ ____ __ ___ __ __ _ __ __ _ _ __ _ _ _ ____ __ ___ _ ___ _ _ _____ _ ___ ___ _ ___ _ __ ____ _ __ __ ___ ___ __ __ __
_ __ _____ ___ __ __ __ _ _ __ ____ _ ___ ___ __ __ _ __ __ __ __ __ _ __ ___ _ __ _ __ __ _ ___ _ __ __ ___ __ _ ___ ___ _______ __ _ __ __ _ _ _ ___ __ ___ ___ ___ _ _ __ __ _ _ __ _ __ _ _ __ _ ___ __ __ ___ __ _ ___ __ __ __ __ _ __ __ ___ ___ ___ __ __ __ __ __ __ _ _ ___ _ __ __ __ _ __ __ __
Relative relevance
Accumulating rel. importance from Random Forest models for the identification of sensory drivers(with P. Granitto, IASMA)