Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi

Feature Selection Feature Selection MethodsMethods

Part-IPart-IBy:By:

Dr. Rajeev SrivastavaDr. Rajeev SrivastavaIIT(BHU), VaranasiIIT(BHU), Varanasi

IntroductionIntroduction The feature is defined as a function of one or more measurements

each of which specifies some quantifiable property of an image, and is computed such that it quantifies some significant characteristics of the object.

Feature selection is the process of selecting a subset of relevant features for use in model construction.

The features removed should be useless, redundant, or of the least possible use

The goal of feature selection is to find the subset of features that produces the best target detection and recognition performance and requires the least computational effort.

Reasons of Feature SelectionReasons of Feature Selection• Feature selection is important to target detection and recognition

systems mainly for three reasons:• First, using more features can increase system complexity, yet it may

not always lead to higher detection/recognition accuracy. Sometimes, many features are available to a detection/recognition system. These features are not independent and may be correlated. A bad feature may greatly degrade the performance of the system. Thus, selecting a subset of good features is important

• Second, Selecting many features means a complicated model being used to approximate the training data. According to the minimum description length principle (MDLP), a simple model is better than a complex model

• Third, using fewer features can reduce the computational cost, which is important for real-time applications. Also it may lead to better classification accuracy due to the finite sample size effect.

• Feature selection techniques provide three main benefits when constructing predictive models:• Improved model interpretability• Shorter Computation times• Enhanced generalisation by reduction

Advantages of Feature Advantages of Feature SelectionSelection• It reduces the dimensionality of the feature space, to limit

storage requirements and increase algorithm speed• It removes the redundant, irrelevant or noisy data.• The immediate effects for data analysis tasks are speeding up

the running time of the learning algorithms.• Improving the data quality.• Increasing the accuracy of the resulting model.• Feature set reduction, to save resources in the next round of

data collection or during utilization;• Performance improvement, to gain in predictive accuracy;• Data understanding, to gain knowledge about the process that

generated the data or simply visualize the data

Taxonomy of Feature Taxonomy of Feature SelectionSelection

(Statistical pattern Recognition)

(Produce same subset on a given problem every time)

Feature Selection Feature Selection ApproachesApproaches• There are two approaches in Feature selection:

• Forward Selection: Start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error.

• Backward Selection: Start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. To reduce over fitting, the error referred to above is the error on a validation set that is distinct from the training set.

Schemes for Feature Schemes for Feature SelectionSelection• The relationship between a FSA and the inducer chosen to

evaluate the usefulness of the feature selection process can take three main forms:

• Filter Methods : These methods select features based on discriminating criteria that are relatively independent of classification• Minimum redundancy-maximum relevance (MRMR) method is example of

filter method. They supplement the maximum relevance criteria along with minimum redundancy criteria to choose additional features that are maximally dissimilar to already identified ones.

• Wrapper Methods : These methods select features based on discriminating criteria that are relatively independent of classification

• Embedded Methods : The inducer has its own FSA (either explicit or implicit). The methods to induce logical conjunctions provide an example of this embedding. Other traditional machine learning tools like decision trees or artificial neural networks are included in this scheme.

Filters vs WrappersFilters vs Wrappers• Filters:• Fast execution (+): Filters generally involve a non-iterative computation on the

dataset, which can execute much faster than a classifier training session• Generality (+): Since filters evaluate the intrinsic properties of the data, rather

than their interactions with a particular classifier, their results exhibit more generality: the solution will be “good” for a larger family of classifiers

• Tendency to select large subsets (-): Since the filter objective functions are generally monotonic, the filter tends to select the full feature set as the optimal solution. This forces the user to select an arbitrary cutoff on the number of features to be selected

• Wrappers:• Accuracy (+): wrappers generally achieve better recognition rates than filters since

they are tuned to the specific interactions between the classifier and the dataset• Ability to generalize (+): wrappers have a mechanism to avoid overfitting, since

they typically use cross-validation measures of predictive accuracy• Slow execution (-): since the wrapper must train a classifier for each feature

subset (or several classifiers if cross-validation is used), the method can become unfeasible for computationally intensive methods

• Lack of generality (-): the solution lacks generality since it is tied to the bias of the classifier used in the evaluation function. The “optimal” feature subset will be specific to the classifier under consideration

Naïve methodNaïve method

• Begin with a single solution (feature subset) & iteratively add or remove features until some termination criterion is met• Bottom up (forward method): begin with an empty set & add

features• Top-down (backward method): begin with a full set & delete

features• These “greedy” methods do not examine all possible subsets,

so no guarantee of finding the optimal subset

Sequential methodsSequential methods

• Sort the given d features in order of their prob. of correct recognition

• Select the top m features from this sorted list• Disadvantage: Feature correlation is not considered; best pair

of features may not even contain the best individual feature

Sequential Forward SelectionSequential Forward Selection1. Start with the empty set Y0={ }∅2. Select the next best feature X+ 3. Update Yk+1=Yk + X+ ; = +1𝑘 𝑘4. Go to 2•SFS performs best when the optimal subset is small, When the search is near the empty set, a large number of states can be potentially evaluated•Towards the full set, the region examined by SFS is narrower since most features have already been selected•The search space is drawn like an ellipse to emphasize the fact that there are fewer states towards the full or empty sets•Disadvantage: Once a feature is retained, it cannot be discarded; nesting problem

Sequential Backward SelectionSequential Backward Selection1. Start with the full set Y0=𝑋2. Remove the worst feature X-3. Update Yk+1=Yk – X- ; = +1𝑘 𝑘4. Go to 2•Sequential Backward Elimination works in the opposite direction of SFS•SBS works best when the optimal feature subset is large, since SBS spends most of its time visiting large subsets•The main limitation of SBS is its inability to re-evaluate the usefulness of a feature after it has been discarded

Generalized sequential forward Generalized sequential forward selectionselection

• Start with the empty set, X=0• Repeatedly add the most significant m-subset of (Y - X) (found

through exhaustive search)

Generalized sequential backward Generalized sequential backward selectionselection• Start with the full set, X=Y• Repeatedly delete the least significant m-subset of X (found

through exhaustive search)

Bidirectional Search (BDS)Bidirectional Search (BDS)• BDS is a parallel implementation of SFS

and SBS• SFS is performed from the empty set• SBS is performed from the full set• To guarantee that SFS and SBS

converge to the same solution: • Features already selected by SFS are not

removed by SBS • Features already removed by SBS are not

selected by SFS Start SFS with Yf ={ }∅ Start SBS with YB =𝑋 Select the best feature X+

Update YF(k+1)=YFk + X+ ; = +1𝑘 𝑘 Remove the worst feature X-

Update YB(k+1)=YBk + X- ; = +1𝑘 𝑘

Sequential floating selection (SFFS Sequential floating selection (SFFS & SFBS)& SFBS)• There are two floating methods• Sequential floating forward selection (SFFS)

starts from the empty set• After each forward step, SFFS performs backward steps

as long as the objective function increases• Sequential floating backward selection (SFBS)

starts from the full set• After each backward step, SFBS performs forward steps

as long as the objective function increases• SFFS algorithm:1. Y0={ }∅

2. Select the best feature X+ Update Yk+1=Yk + X+ ; = +1𝑘 𝑘

• Select the best feature X- • If J(Yk-x-)>J(Yk) then { J(x)=Criterion

Function}

Yk+1=Yk-x- ;k=k+1 go to step 3

Elsego to step 2

(We need to do some book-keeping to avoid infinite loop)

Genetic Algorithm Feature Genetic Algorithm Feature SelectionSelection• In a GA approach, a given feature subset is represented as a

binary string a “chromosome" of length n with a zero or one in position i denoting the absence or presence of feature i in the set ( n = total number of available features)

• A population of chromosomes is maintained• Each chromosome is evaluated from evaluation function to

determine its “fitness" which determines how likely the chromosome is to survive and breed into the next generation

• New chromosomes are created from old chromosomes by the processes of • Crossover, where parts of two different parent chromosomes are mixed to

create offspring• Mutation: where the bits of a single parent are randomly perturbed to

create a child

• Choosing an appropriate evaluation function is an essential step for successful application of GAs to any problem domain

Minimum Redundancy Maximum Minimum Redundancy Maximum Relevance Feature SelectionRelevance Feature Selection

• This approach is based on recognizing that the combinations of individually good variables do not necessarily lead to good classification

• To maximize the joint dependency of top ranking variables on the target variable, the redundancy among them must be reduced, So we select maximally relevant variables and avoiding the redundant ones

• First, mutual information (MI) between the candidate variable and the target variable is calculated (relevance term)

• Then average MI between the candidate variable and the variables that are already selected is computed (redundancy term)

• The entropy-based mRMR score (higher it is for a feature, more that feature is needed) is obtained by subtracting the redundancy from relevance

• Both relevance and redundancy estimation are low dimensional problems (involves only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the high dimensional space

• It only measures the quantity of redundancy between the candidate variables and the selected variables but does not deal with the type of this redundancy

ReferencesReferences

FEATURE SELECTION METHODS AND ALGORITHMS L.Ladha, Research Scholar, Department Of Computer Science, Sri Ramakrishna College Of Arts and Science for Women, Coimbatore, Tamilnadu, India

Feature Selection: Evaluation, application and small sample performance, Anil Jaiin, Douglas Zongker Michigan State University USA

Using covariates for improving the minimum Redundancy Maximum Relevance feature selection Method Olcay KURS¸UN1, C. Okan S¸AKAR2, Oleg FAVOROV3

Documents

Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi