Download ppt - Graph-based Iterative Hybrid Feature Selection Erheng Zhong † Sihong Xie † Wei Fan ‡ Jiangtao Ren † Jing Peng # Kun Zhang $ † Sun Yat-sen University ‡

Graph-based Iterative Hybrid Feature Selection

Erheng Zhong† Sihong Xie† Wei Fan‡ Jiangtao Ren† Jing Peng# Kun Zhang$

†Sun Yat-sen University‡IBM T. J. Watson Research Center

#Montclair State University$Xavier University of Louisiana

Where we are

Supervised Feature Selection Unsupervised Feature Selection Semi-supervised Feature Selection Hybrid:

Supervised to include key features Improve with semi-supervised

approach

Supervised Feature Selection

Sufficient Labeled Data

EffectiveFeaturesFeature Selection

Insufficient Labeled Data sample selection bias problem

IneffectiveFeaturesFeature Selection

Only feature 2 will be selected, but feature 1 is also useful!

Toy example (1)

Labeled data A(1,1,1,1;red) B(1,-1,1,-1;blue) Unlabeled data C(0,1,1,1;red) D(0,-1,1,1;red) Both feature 2 & 4 are correlate

d to class based on A and B. They are selected by supervised fs.

Semi-supervised Feature Selection

EffectiveFeatures

Feature SelectionMany Distinct Unlabeled Data

A few Labeled Data

Many Unlabeled Data but Indistinctive

IneffectiveFeatures

Toy example (2)

A semi-supervised approach “Spectral Based Feature Selection”. Features are ranked according to the smoothness between data points and consistency with label information.

Feature 2 will be selected if only one feature is desired.

Instances which are closer should be in the same cluster

31

313

13

131 Label information

violated by clustering

Solution Hybrid

Labeled data insufficient Sample selection bias Supervised fail

Unlabeled data indistinct Data from different class are not separated Semi-supervised fail

Both have disadvantages,

how to address?

Combine !

Hybrid Feature Selection [IteraGraph_FS]

Good Distant Measure

Many Unlabeled Data

Disticnt!

EffectiveFeatures


Most CriticalFeatures

Supervised Feature Selection

A few Labeled Data

Better Distant Measure

Toy example (3)

Feature 2 & 4 are selected based on A and B using a supervised approach

A(1,1;Red) B(-1,-1;Blue)C(1,1;Red) D(-1,1;Red)

Dimension Reduction

Prediction

A(1,1;Red) B(-1,-1;Blue)C(1,1;Red) D(-1,1;Red)

Feature Selection

Only feature 4 is useful

Supervised feature selection

Semi-supervised feature selection

Properties of feature selection The distance between any two examples

is approximately the same under the high-dimension feature space.[Theorem 3.1]

Feature selection can obtain a more distinguishable distance measure which lead to a better confidence estimate. [Theorem 3.2]

Theorems 3.1 and 3.2

3.1 Dimensionality increases Nearest neighbor approaches the farthest

neighbor3.2 More distinguishable similarity measure Better classification confidence matrix

4

2

1

3

2 2

2

2

2

Confidence2: (0.5 vs 0.5) 4: (0.5 vs 0.5)

Feature selection

4

2

1

3

1 1

2

2

3

Confidence2: (0.67 vs 0.33) 4: (0.33 vs 0.67)


Graph-based [Label Propagation] Expand the labeled set by adding unlabeled

data and their prediction labels which have high confidence (s%).

Perform feature selection on the new labeled set.

4 Labeled Data 6 Labeled Data

Label Propagation

Confidence and Margin (Lemma 3.2)

Bad Distance Measure

Better Distance Measure

Near Hit Near Miss

Low Confidence

High Confidence

Larger Margin can be achieved via

distance manipulation

Selection Strategy Comparison (Theorem 3.3)

Random SelectionOur Confidence-Based Strategy

Low Average Confidence

High Average Confidence

Small Margin

Larger Margin

Lemma 3.2

Experiments setup Data Set

Handwritten Digit Recognition Problem Biomedical and Gene Expression Data Text Documents [Reuters-21578]

Comparable Approach Supervised Feature selection: SFFS Semi-supervised approach: sSelect [SDM07]

Data Set -- Description

Feature Quality Study

Conclusions

Labeled information Critical features, better confidence estimates

Unlabeled data Improve this chosen feature set

Flexible Can incorporate many feature selection

methods which aim at revealing the relationship between data points.