An introduction to variable and feature selection

AN INTRODUCTION TO VARIABLE AND FEATURE SELECTION

Meoni Marco – UNIPI – March 30th 2016

Isabelle Guyon Clopinet

André Elisseeff Max Planck Institute for Biological Cybernetics

PhD course in Optimization for Machine Learning

Definition and Goal • Variable/Attribute/Dimension/Feature Selection/Reduction

•  “variables”: the raw input variables •  “features”: variables constructed for the input variables

• Select a subset of learning algorithms’ relevant features •  Given a set of features find a subset

that “maximizes the learners ability to classify patterns” •  Model simplification to make it easier to interpret by users •  Shorter training time to improve learning algorithm’s performance •  Enhanced generalization to limit overfitting

1{ ,..., ,..., }i nF f f f= 'F F⊆

Feature Selection in Biology • Monkey performing classification task

• Diagnostic features: eye separation and height • Non-Diagnostic features: mouth height, nose length

3/54

Feature Selection in Machine Learning •  Information about the target class is intrinsic in the variables •  More info does not mean more discrimination power •  Dimensionality and Performance

-  Required #samples grows exponentially with #variables -  Classifier’s performance degrades for a large number of features

Variable Ranking - Scoring • Order a set of features F by the value of a scoring function

S(fi) computed from the training data

• Select the k highest ranked features according to S •  Computationally efficient: only calculation and sorting of n scores •  Statistically robust against overfitting, low variance

1' { ,..., ,... }

j ni i iF f f f=1

( ) ( ); 1,..., 1;j ji iS f S f j n

+≥ = −

Variable Ranking - Correlation • Criteria to detect linear dependency features/target

•  Pearson correlation coefficient

•  Estimate for m samples:

•  Higher correlation means higher score

•  mostly used R(xi,y)² or |R(xi,y)|

cov( , )( , )var( ) var( )

ii

i

f yf yf y

=R

( ) ( )( ) ( )

,1

2 2

,1 1

( , )m

k i i kki

m m

k i kk k

f f y yR f y

f fi y y

=

= =

− −=

− −

∑

∑ ∑

[ ]( , ) 1,1iX Y ∈ −R

Variable Ranking – Single Var Classifier • Select variables according to individual predictive power • Performance of a classifier built with 1 variable

•  e.g. the value of the variable itself (set threshold on the values) •  usually measured in terms of error rate (or criteria using fpr, fnr, …)

Variable Ranking – Mutual Information • Empirical estimates of mutual information features/target: •  If discrete variables (probabilities estimated from

frequency counts):

( ) ( , )( , ) , log( ) ( )

i

ii i

ix y

p x yI x y p x y dxdyp x p y

= ∫ ∫

( , )( , ) ( , ) log( ) ( )i

ii ix y

i

P X x Y yI x y P X x Y yP X x P Y y

= == = =

= =∑ ∑

Questions • Correlation variable/target not enough to assess relevance • Do not discard variables with small (redundant) score •  Low-score variables can be useful with others

Feature Subset Selection • Requirements:

•  Scoring function to asses the optimal feature subset •  Strategy to search the space of possible feature subsets

•  finding the optimal feature subset for arbitrary target is NP-hard

• Methods: •  Filters •  Wrappers •  Embedded

Feature Subset Selection - Filters •  Select subsets of variables as a pre-processing step,

independently of the used classifier •  Variable ranking with score function is a filter method

•  Fast •  Generic selection of features, not optimized for used classifier •  Sometimes used as a pre-processing step for other methods

Feature Subset Selection - Wrappers •  Score feature subsets based on learner predictive power •  Heuristic search strategies:

•  Forward selection: start with empty set and add features at each step •  Backward elimination: start with full set and discard features at each step

•  Predictive power measured on validation set or cross-validation •  Pro: learner as a black box makes wrappers simple •  Cons: required large amount of computation and risk of overfitting

Feature Subset Selection - Embedded •  Performs feature selection during training •  Nested Subset Methods

•  Guide the search process by predicting the changes in the objective function values when moving in variable subsets space: 1.  Finite difference method: differences calculated w/o retraining new models for each

candidate variable 2.  Quadratic approximation of cost function: used for backward elimination of variables 3.  Sensitivity of the objective function calculation: devise a forward selection procedure

•  Direct Objective Optimization •  Formalize the objective function of variable selection and optimize

1.  the goodness-of-fit (to be maximized) 2.  the number of variables (to be minimized)

Feature Selection - Summary •  Feature selection can increase performance of learning algos

•  Both accuracy and computation time, but not easy •  Ranking-criteria of features

• Don’t automatically discard variables with small scores •  Filters, Wrappers, Embedded Methods

•  How to search the space of all feature subsets? •  How to asses performance of learner that uses a given feature subset?

Feature Extraction •  Feature Selection:

•  Feature Construction

F

F‘

F F‘

11 .{ ,..., ,..., } { ,..., ,..., }j mi n i i if selectionf f f f f f⎯⎯⎯⎯→

1 1 1 1 1.{ ,..., ,..., } { ( ,..., ),..., ( ,..., ),..., ( ,..., )}i n n j n m nf extractionf f f g f f g f f g f f⎯⎯⎯⎯→

Feature Construction • Goal: reduce data dimensionality • Methods

•  Clustering: replace a group of “similar” variables by a cluster centroid (K-means, Hierarchical clustering)

•  Linear transform of input variables (PCA/SVD, LDA) •  Matrix factorization of variable subsets

Validation Methods •  Issues on Generalization Prediction and Model Selection

•  Determine the number of variables that are “significant” •  Guide and halt for good variables subsets •  Choose hyper-parameters •  Evaluate the final performance of the system

•  Model Selection •  Compare training errors with statistical tests (Rivals & Personaz 2003) •  Estimate generalization error confidence intervals (Bengio & Chapados

2003) •  Choose what fraction of the data to split (leave-one-out cross-

validation, Monari & Dreyfus 2000)

Advanced Topics & Open Problems • Variance of Variable Subset Selection

•  Methods sensitive to perturbations of the experimental conditions •  Variance is often the symptom of a model that does not generalize

• Variable Ranking in the Context of Others •  Ranking a subset may infer different criteria than a single variable

•  Forward vs Backward •  Depending on applications

Advanced Topics & Open Problems • Multi-class Problem

•  Some variable selection methods use multi-class rather than decompose in several two-class problems

•  Methods based on mutual information criteria extend to this case •  Inverse Problems

•  Reverse engineering: find the reasons from results of a predictor •  E.g. identify factors that triggered a disease

•  Key issue: distinction between correlation and casuality •  Method: use variables discarded by variable selection as additional

outputs of a neural network

THANK YOU!

Data & Analytics

An introduction to variable and feature selection