25
Global Mutual Information Based Feature Selection By Quantum Annealing Kotaro Tanahashi*, Shinichi Takayanagi*, Tomomitsu Motohashi*, Shu Tanaka* Recruit Communications Co.,Ltd. Waseda University, JST PRESTO Apr. 11, 2018

Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

Global Mutual Information Based Feature Selection By Quantum Annealing

Kotaro Tanahashi*, Shinichi Takayanagi*, Tomomitsu Motohashi*, Shu Tanaka✝ * Recruit Communications Co.,Ltd. ✝ Waseda University, JST PRESTO

Apr. 11, 2018

Page 2: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.2

Introduction of Recruit

We provide various kinds of online services from job search to hotel reservations across the world.

Automobile

Education

Life & Local O2O

Travel

Beauty

Housing Bridal & Baby

Human Resources

IT & Trends Media

Dining

www.flaticon.com

Page 3: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.3

Introduction of Recruit

Internet Users Clients

• We help users to find the best clients through our services. • Data science plays an important role in the business.

Page 4: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.4

Data Science at Recruit

Recruit has hosted two data mining competitions in Kaggle Kaggle, KDD Cup: International competitions of data mining

{Engineers at Recruit (as of March 2018)

We are passionate about data scienceSome of us came in 1st and 2nd place in KDD Cup 2015

www.kdd.org/kdd-cup

Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015)

www.kaggle.com

Page 5: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.5

Feature Selection: A Key Technique

“Beating Kaggle the easy way”

• A key technique to win data mining competitions • Find the most relevant features • Balance bias-variance trade-off

Features

ndata

n featuresrelevant features

data

User 1 User 2 User 3 User 4

User n-1 User n

✔ Improve prediction ✔ Reduce computational cost

Benefits

https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf

Feature selection is essential in predictive analysis

Page 6: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.6

Types of Feature Selection (FS) Algorithms

Wrapper methods Iteratively evaluate a feature subset by black-box learning algorithm

Set of all features

Generate a subset Learning Algorithm

Selecting the best subset

Performance

Embedded methods Train a model and select features at the same time

Set of all features

Generate a subset

Learning Algorithm + Performance

Selecting the best subset

Filter methods Features are selected by some criteria such as Mutual Information

✔ Independent on learning algorithms ✔ Can be used as a pre-processing

Set of all features PerformanceSelecting the

best subsetLearning Algorithm

Filter methods are useful as a pre-processing since it does not dependent on the predictive models

Page 7: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.7

What is Mutual Information (MI)?

Figures are retrieved from http://minepy.readthedocs.io/en/latest/python.html

Mutual Information I(X;Y) is a measure of the mutual independence between two random variables X and Y

Shannon entropy

Pearson r = 0.8 MI = 0.5

Pearson r = 0.0 MI = 0.7

Pearson r = 0.0 MI = 0.1

✔ MI can capture non-linear relationships unlike Pearson’s correlation coefficient

Mutual Information I(X;Y)

Able to predict Y given X

Hard to predict Y given XLow

High

Page 8: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.8

General formulation of MIFS

Mutual Information based Feature Selection (MIFS)

MIFS: using Mutual Information as a criteria in filter methods

MIFS selects a feature subset with a size of k which maximizes the Mutual Information (MI) between the features and the target variable

Unfortunately, the exact calculation of is intractable…

Page 9: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.9

Heuristic MIFS Algorithms

[1] H. Peng et al., 2005 [2] J. R. Vergara & P. A. Estévez, 2015

Max Relevance method Selecting the most relevant feature iteratively

Mim Redundancy & Max Relevance method[1] (MRMR) Selecting the most relevant and least redundant feature iteratively

Repeat k times

Repeat k times

Some heuristic MIFS algorithms have been developed. However, these methods are greedy approximations[2].

Page 10: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.10

Our Contributions

MIFS optimization

QUBO formulation of MIFS

MI i

ncre

ase

(%) w

.r.t L

inea

r

5 6 7 8 10 15 20 25 30 40 #features

()06 2-4 1- 0

(2) We confirmed optimizations by D-Wave do well in MIFS

(1) We reformulate MIFS by QUBO

image is retrieved from https://www.dwavesys.com/resources/media-resourcesHOW?

Bet

ter

QUBO: Quadratic Unconstrained Binary Optimization

Page 11: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.11

Reformulation of MIFS by QUBO (1)

Theorem 1.1: Chain theorem for Conditional Mutual Information

Using theorem 1.1, the following equation holds for all i ∈ S

Averaging the equation above for all i leads to

Proof.

Expand the MI term

Page 12: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.12

Reformulation of MIFS by QUBO (2)

Approximate under the assumption of Conditional Independence (CI)

Proof.If we assume the conditional independence

We can obtain

Page 13: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.13

Optimization of MIFS

QUBO formulation of MIFS

α: penalty strengthMI Penalty for selecting

only k features

Reformulation of MIFS by QUBO (3)

MIFS can be optimized by Ising annealing machines

Page 14: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.14

Interpretation of the Derived Formulation

Heuristic methods such as Max Relevance or MRMR are included in the derived formulation

Expand the derived formulation

Increase: Relevance, Complementary Reduce: Redundancy

Relevance Redundancy Complementary

Page 15: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.15

Comparison of Optimization Methods

Binary Quadratic Problem (BQP)

QUBO

Linear Relaxation[1] (Linear)

Problem Formulation Optimization Methods

Truncated Power[1,2] (TPower)

Tabu Search by qbsolv[3]

D-Wave 2000Q

[1] H. Venkateswara, et al., 2015 [2] X. T. Yuan & T. Zhang, 2013 [3] https://github.com/dwavesystems/qbsolv

We compared several optimization methodsfor two types of formulations (BQP, QUBO)

Page 16: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.16

Linear Relaxation Method (Linear)

[1] H. Venkateswara, et al., 2015

Linearize the quadratic term by introducing new variables

One of the optimal conditions is , which leads to

Since Qij ≧ 0, the solution of this problem is given by k largest column sum of Q.This solution is tightly bounded[1]. Time complexity is O(nk).

The computation of Linear is fastand the solution is tightly bounded.

Page 17: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.17

Truncated Power Method (TPower)Finding the largest k-sparse eigenvector of Q is defined as

We select i th feature if xi > 0This is calculated by the following procedure[1]

[1] X. T. Yuan & T. Zhang, 2013 [2] H. Venkateswara, et al., 2015

Repeat T times

This method is confirmed to be the best-performing method for BQP problem with non-negative matrix[2]. Time complexity of the algorithm is O(Tn2).

TPower is known to be the state-of-the-art method for BQP problems

Page 18: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.18

Optimization by D-Wave Machine

• Machine: D-Wave 2000Q • Embedding: 64 bit full connection • Annealing Time: 20µs • Annealing Repetitions: 10

Full Connection Embedding for C(4,4,4)

We used the D-Wave machine with the following settings

When feature size n is larger than hardware size h (=64), we use Linear to narrow down the candidate features to h as a pre-processing.

Page 19: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.19

Comparison of Mutual Information Score

Data Name: a1a #features: 122 #data points: 8000

MI i

ncre

ase

(%) w

.r.t L

inea

r

5 6 7 8 10 15 20 25 30 40 #features

()06 2-4 1- 0

Mutual Information Score

Bet

ter

We compared MI scores of each optimization method for a public dataset. The increases with regard to Linear are shown in the graph below.

D-Wave obtained the best MI scores among other methods

Page 20: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.20

Classification AccuracyWe calculated the classification accuracy for different #features. Accuracy is a good measure to evaluate the quality of a selected subset of features.

Classification Accuracy

Original features

Selected k-features

Measure the classification accuracy by random forest classifiers

Page 21: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.

0.78

0.76

0.74

0.72

0.70

Acc

urac

y

403530252015105#features

D-Wave TPower Tabu(qbsolv) Linear

21

The accuracies of D-Wave are better when #features is small

Classification Accuracy

Better

Bet

ter

We evaluated each method by classification accuracy for different #features.

Data Name: a1a #features: 122 #data points: 8000

Classification Accuracy

Page 22: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.22

Summary

• We derived the QUBO formulation of MIFS so that the problem can be embedded in Ising machines

• We used the D-Wave quantum annealing machine as a solver in MIFS

• The optimization method by D-Wave outperformed TPower which is the state-of-the-art optimization method for BQP

• We are planning to use MIFS by D-Wave in Kaggle!

Page 23: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.23

Thank you for listening

Page 24: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.24

Runtime of Optimizations

Data Name: a1a #features: 122 #data points: 8000

method Averaege Runtime

Linear 9.0 msec

TPower 26.1 msec

Tabu(qbsolv) 14.3 sec

D-Wave 9.0 msec (Linear)+ 100 μsec (annealing)

Page 25: Apr. 11, 2018 Global Mutual Information Based Feature ... · Recruit Restaurant Visitor Forecasting (2018) Coupon Purchase Prediction (2015) . 5 (C)Recruit Communications Co.,

(C)Recruit Communications Co., Ltd.25

Comparison to MRMR, Max Rel.

0.78

0.76

0.74

0.72

0.70

Acc

urac

y

403530252015105#features

D-Wave MRMR Max Rel.

Data Name: a1a #features: 122 #data points: 8000