20
AN INTRODUCTION TO VARIABLE AND FEATURE SELECTION Meoni Marco – UNIPI – March 30 th 2016 Isabelle Guyon Clopinet André Elisseeff Max Planck Institute for Biological Cybernetics PhD course in Optimization for Machine Learning

An introduction to variable and feature selection

Embed Size (px)

Citation preview

Page 1: An introduction to variable and feature selection

AN INTRODUCTION TO VARIABLE AND FEATURE SELECTION

Meoni Marco – UNIPI – March 30th 2016

Isabelle Guyon Clopinet

André Elisseeff Max Planck Institute for Biological Cybernetics

PhD course in Optimization for Machine Learning

Page 2: An introduction to variable and feature selection

Definition and Goal • Variable/Attribute/Dimension/Feature Selection/Reduction

•  “variables”: the raw input variables •  “features”: variables constructed for the input variables

• Select a subset of learning algorithms’ relevant features •  Given a set of features find a subset

that “maximizes the learners ability to classify patterns” •  Model simplification to make it easier to interpret by users •  Shorter training time to improve learning algorithm’s performance •  Enhanced generalization to limit overfitting

1{ ,..., ,..., }i nF f f f= 'F F⊆

Page 3: An introduction to variable and feature selection

Feature Selection in Biology • Monkey performing classification task

• Diagnostic features: eye separation and height • Non-Diagnostic features: mouth height, nose length

3/54

Page 4: An introduction to variable and feature selection

Feature Selection in Machine Learning •  Information about the target class is intrinsic in the variables •  More info does not mean more discrimination power •  Dimensionality and Performance

-  Required #samples grows exponentially with #variables -  Classifier’s performance degrades for a large number of features

Page 5: An introduction to variable and feature selection

Variable Ranking - Scoring • Order a set of features F by the value of a scoring function

S(fi) computed from the training data

• Select the k highest ranked features according to S •  Computationally efficient: only calculation and sorting of n scores •  Statistically robust against overfitting, low variance

1' { ,..., ,... }

j ni i iF f f f=1

( ) ( ); 1,..., 1;j ji iS f S f j n

+≥ = −

Page 6: An introduction to variable and feature selection

Variable Ranking - Correlation • Criteria to detect linear dependency features/target

•  Pearson correlation coefficient

•  Estimate for m samples:

•  Higher correlation means higher score

•  mostly used R(xi,y)² or |R(xi,y)|

cov( , )( , )var( ) var( )

ii

i

f yf yf y

=R

( ) ( )( ) ( )

,1

2 2

,1 1

( , )m

k i i kki

m m

k i kk k

f f y yR f y

f fi y y

=

= =

− −=

− −

∑ ∑

[ ]( , ) 1,1iX Y ∈ −R

Page 7: An introduction to variable and feature selection

Variable Ranking – Single Var Classifier • Select variables according to individual predictive power • Performance of a classifier built with 1 variable

•  e.g. the value of the variable itself (set threshold on the values) •  usually measured in terms of error rate (or criteria using fpr, fnr, …)

Page 8: An introduction to variable and feature selection

Variable Ranking – Mutual Information • Empirical estimates of mutual information features/target: •  If discrete variables (probabilities estimated from

frequency counts):

( ) ( , )( , ) , log( ) ( )

i

ii i

ix y

p x yI x y p x y dxdyp x p y

= ∫ ∫

( , )( , ) ( , ) log( ) ( )i

ii ix y

i

P X x Y yI x y P X x Y yP X x P Y y

= == = =

= =∑ ∑

Page 9: An introduction to variable and feature selection

Questions • Correlation variable/target not enough to assess relevance • Do not discard variables with small (redundant) score •  Low-score variables can be useful with others

Page 10: An introduction to variable and feature selection

Feature Subset Selection • Requirements:

•  Scoring function to asses the optimal feature subset •  Strategy to search the space of possible feature subsets

•  finding the optimal feature subset for arbitrary target is NP-hard

• Methods: •  Filters •  Wrappers •  Embedded

Page 11: An introduction to variable and feature selection

Feature Subset Selection - Filters •  Select subsets of variables as a pre-processing step,

independently of the used classifier •  Variable ranking with score function is a filter method

•  Fast •  Generic selection of features, not optimized for used classifier •  Sometimes used as a pre-processing step for other methods

Page 12: An introduction to variable and feature selection

Feature Subset Selection - Wrappers •  Score feature subsets based on learner predictive power •  Heuristic search strategies:

•  Forward selection: start with empty set and add features at each step •  Backward elimination: start with full set and discard features at each step

•  Predictive power measured on validation set or cross-validation •  Pro: learner as a black box makes wrappers simple •  Cons: required large amount of computation and risk of overfitting

Page 13: An introduction to variable and feature selection

Feature Subset Selection - Embedded •  Performs feature selection during training •  Nested Subset Methods

•  Guide the search process by predicting the changes in the objective function values when moving in variable subsets space: 1.  Finite difference method: differences calculated w/o retraining new models for each

candidate variable 2.  Quadratic approximation of cost function: used for backward elimination of variables 3.  Sensitivity of the objective function calculation: devise a forward selection procedure

•  Direct Objective Optimization •  Formalize the objective function of variable selection and optimize

1.  the goodness-of-fit (to be maximized) 2.  the number of variables (to be minimized)

Page 14: An introduction to variable and feature selection

Feature Selection - Summary •  Feature selection can increase performance of learning algos

•  Both accuracy and computation time, but not easy •  Ranking-criteria of features

• Don’t automatically discard variables with small scores •  Filters, Wrappers, Embedded Methods

•  How to search the space of all feature subsets? •  How to asses performance of learner that uses a given feature subset?

Page 15: An introduction to variable and feature selection

Feature Extraction •  Feature Selection:

•  Feature Construction

F

F‘

F F‘

11 .{ ,..., ,..., } { ,..., ,..., }j mi n i i if selectionf f f f f f⎯⎯⎯⎯→

1 1 1 1 1.{ ,..., ,..., } { ( ,..., ),..., ( ,..., ),..., ( ,..., )}i n n j n m nf extractionf f f g f f g f f g f f⎯⎯⎯⎯→

Page 16: An introduction to variable and feature selection

Feature Construction • Goal: reduce data dimensionality • Methods

•  Clustering: replace a group of “similar” variables by a cluster centroid (K-means, Hierarchical clustering)

•  Linear transform of input variables (PCA/SVD, LDA) •  Matrix factorization of variable subsets

Page 17: An introduction to variable and feature selection

Validation Methods •  Issues on Generalization Prediction and Model Selection

•  Determine the number of variables that are “significant” •  Guide and halt for good variables subsets •  Choose hyper-parameters •  Evaluate the final performance of the system

•  Model Selection •  Compare training errors with statistical tests (Rivals & Personaz 2003) •  Estimate generalization error confidence intervals (Bengio & Chapados

2003) •  Choose what fraction of the data to split (leave-one-out cross-

validation, Monari & Dreyfus 2000)

Page 18: An introduction to variable and feature selection

Advanced Topics & Open Problems • Variance of Variable Subset Selection

•  Methods sensitive to perturbations of the experimental conditions •  Variance is often the symptom of a model that does not generalize

• Variable Ranking in the Context of Others •  Ranking a subset may infer different criteria than a single variable

•  Forward vs Backward •  Depending on applications

Page 19: An introduction to variable and feature selection

Advanced Topics & Open Problems • Multi-class Problem

•  Some variable selection methods use multi-class rather than decompose in several two-class problems

•  Methods based on mutual information criteria extend to this case •  Inverse Problems

•  Reverse engineering: find the reasons from results of a predictor •  E.g. identify factors that triggered a disease

•  Key issue: distinction between correlation and casuality •  Method: use variables discarded by variable selection as additional

outputs of a neural network

Page 20: An introduction to variable and feature selection

THANK YOU!