科学と機械学習のあいだ：変量の設計・変換・選択・交互作用・線形性

/361

JST

[email protected]

2016.11.18 @ 19 (IBIS2016)

/362 (

http://art.ist.hokudai.ac.jp

) 1974 Turing Award Lecture Computer Programming as an Art (Don Knuth)

: ()

ScienceEngineering Art

https://doi.org/10.1145/361604.361612

/363

(active): 1,737 (inactive): 26,895

https://pubchem.ncbi.nlm.nih.gov/bioassay/41

PC3

(/QSAR)

/364

Atom + Bond SYBYL MOL2 Pharmacophore

O

N

N

NH

NHN

N

N

CH3

CH3

Imatinib (CID 5291) Conformers in PubChem3D

1

2

3

284

Molecular Graph Representations(Implicit Hydrogen)

Molecular Descriptors

(variations: constitutional, topological, atom pairs, geometrical, electronic, thermodynamical, physicochemical, WHIM, fingerprints, RDF, autocorrelations, functional groups, structural keys, properties, interaction fields, )

(LogP)HOMO/LUMO ()

(

)

/365

[Ramakrishnan+ 2014] Sci Data. 2014 Aug 5;1:140022 Quantum chemistry structures and properties of 134 kilo molecules.

C,O,N,F9133,88515(C7H10O26,095 )

?(c.f. Ugly duckling theorem , 1969)

/366

(?)

: The Art of Feature Engineering

"Applied machine learning" is basically feature engineering. Andrew Ng

Feature Engineering is the next buzz word after big data. Nayyar A. Zaid

()

Art

/367

(confounder)?

()

()

!!

()

/368

Abduction/Induction: , , FM,

/DAG: RF, GBM/MART/AnyBoost, XGBoost, RGF, DJ

: Best SubsetLASSO, SCAD, MC+, SIS: Stability Selection (aka Randomized LASSO)

: Chance Correlation, Concentration of Measures

: , PLS, PCA, t-SNE, Embedding(2vec)

: RP/ELM/RC, ExtraTrees, VR-Trees: Stacked Generalization (aka Stacking/Blending)

: ACE (Alternative Conditional Expectations)

AD(Applicability Domain), Y-Scrambling Test

Leo Breiman (1928-2005)

CART (Classification and Regression Trees), PIMPLE Random Forest Arcing (aka Boosting) Bagging, Pasting ACE (Alternative Conditional Expectations) Stacked Generalization (aka Stacking/Blending) Nonnegative Garrote (LASSO for Subset) Instability / Stabilization in Model Selection

Shannon-McMillan-Breiman Theorem () Kelly-Breiman Strategy ()

UC Berkeley 2005 SIGKDD Innovation Award Probability Theorist

If statistics is an applied field and not a minor branch of mathematics, then 99% of the published papers are useless exercises. ("Reflections after refereeing papers for NIPS, The Mathematics of Generalization, Ed. D.H. Wolpert, 1995)

https://en.wikipedia.org/wiki/File:Leo_Breiman.jpg

/3610Abduction/Induction

:/

()

/

Hypotheses/Axioms

Experimental Facts

deduction abductioninduction

The grand aim of science is to cover the greatest number of experimental facts by logical deduction from the smallest number of hypotheses or axioms. (Albert Einstein)

/3611

: (+) (82)

14(23)primary features (A-B)

(ZB)

(WZ)

(RS)

(,,,,,etc) (10000)

1. LASSOpre-select 2. Subset(Best Subset)

Case Study: PRL 114, 105503, 2015

/3612

: (+)

2) EhC 3violate!

4

1.

2.

3.

4.

2,4violate! (KRR)

1) (ZA,ZB)

Case Study: PRL 114, 105503, 2015

/3613

(3)

Case Study: PRL 114, 105503, 2015

()

()

/3614

(=interaction)

ij

(e.g. XORParity)

1

/3615, (GAM), FM,

+ (e.g. Factorization Machines)

(PolyReg)

(?)

(GAM)

http://playground.tensorflow.org/ (by Big Picture group, Google)

inputArchitecture Engineering

(or )

XOR

(?) overfit

ReLU

off

1/3, 4

NN(Tuning?)

/3618:

(n)Best Subset(m)

QSAR (Topliss 1972, 1979)

J. Fan, Features of Big Data and sparsest solution in high confidence set, 2014

Fan, 2014 5

()

(= 5Best Subset)

Chance Correlation / Spurious Correlation

/36

Beyer+ 1999:

19:

K. Beyer+, When Is Nearest Neighbor Meaningful? ICDT99 V. Pestov, On the geometry of similarity search: dimensionality curse and concentration of measure, Information Processing Letters, 1999.

Pd()

n+1d

Concentration of Measures Phenomena

/3620: Best Subset

(///RELIEFF/t etc) Best Subset (L0)

LASSO (Tibshirani 1996) (L1, Basis Pursuit Denoising)

leaps(Furnival & Wilson 1974) or (Morgan & Tatar 1972) () +

LARS (Efron+ 2003) or (Friedman+ 2007) LASSO=Best Subset(=t) LASSOBest Subset (biased)

glmnet (Friedman+ 2008)L1+L2(Elastic-Net): p>nLASSOn

/3621LASSOSIS

(Fan & Li, 2001)

Adaptive LASSO(Zou 2006) 2LASSO SCAD(Fan & Li 2001) () MC+(Zhang 2010) SCAD

1: Best Subset 2:

: SISpre-selectSCAD

Sure Independence Screening (SIS) (Fan & Lv 2008)p(0,1)n

/3622: Randomized Sparse Models

Stability Selection (Meinshausen & Buhlmann 2010)

Randomized LASSO (Meinshausen & Buhlmann 2010)

) BootstrapmBolasso (Bach 2008)

Regularization Path LASSO Stability Path () LASSO Randomized LASSO

/3623:

CART (Breiman+ 1984), AID (Morgan & Sonquist 1963), CHAID (Kass 1980) CLS (Hunt 1966), ID3 (Quinlan 1986), C4.5/C5.0 (Quinlan 1993) VFDT/Hoeffding Trees (Domingos & Hulten 2000)

Hyafil, Laurent; Rivest, RL (1976). "Constructing Optimal Binary Decision Trees is NP-complete". Information Processing Letters. 5 (1): 1517. doi:10.1016/0020-0190(76)90095-8.

Known Facts Automatic Interaction Detector(AID) CARTBayes-risk consistent (Gordon & Olshen 1978, 1980) 2NP(Hyafil & Rivest 1976)greedy+pruning OK

(DNF)

https://dx.doi.org/10.1016/0020-0190(76)90095-8

/3624

/DAG: RF, GBM/MART/AnyBoost, XGBoost, RGF, DJ

: , PLS, PCA, t-SNE, Embedding

: RP/ELM/RC, ExtraTrees, VR-Trees: Stacked Generalization (aka Stacking/Blending)

: ACE(Alternative Conditional Expectations)

:

: MARS: (aka Kantorovich)

:

/3625

PCA(Bair+ 2006), Sparse PCA(Zou+ 2006), Sparse PLS(L Cao+ 2008; Chun & Kele 2010), ICA(Comon 1994),

: PLS, PCA,

(0,1)

(: )PLS (: )

Manifold Learning: ISOMAP(Tenenbaum+ 2000), LLE(Roweis & Saul 2000), t-SNE(van der Maaten & Hinton 2008),

Neural Networks: Embedding (2vec), AutoEncoders,

/3626

ACE (Alternative Conditional Expectations), Breiman & Friedman 1985

: ACE

/3627

Random Forest (Breiman 2001)

()

XGBoost (Chen & Guestrin KDD16) +L2 GBM/MART/GBDT/AnyBoost (Friedman 1999; Mason+ NIPS99)

Regularized Greedy Forests, RGF (Johnson & Zhang 2014)

Importance Sample Learning Ensemble, ISLE (Friedman & Popescu 2003) (1/2) LASSO

Decision Jungles (Shotton+ NIPS13) DAG

()

Bootstrap +

: split or

Greedy + Fully Corrective

/3628Feature ImportancePartial Dependence

: ?

CART ?

Best Subset

Feature Importance Partial Dependence Plot (PDP) ?

!! () PCABlending()

ESLII (2009)

/3629

OK ()

Randomized Trees

: : :FC FC

: :FC FC

Extreme Learning Machine, ELM (Huang 2006)

Reservoir Computing, RC (e.g. Schrauwen+ 2007)

()()

Extremely Randomized Trees (Geurts+ 2006) VR-Trees (Liu 2008)

Random Projections

or CART

/3630

Netflix PrizeStacking/Blending

: CV

12345678910

(Out-of-Sample Estimate)

()

CV

Stacked Generalization(Wolpert 1992; Breiman 1996)

/3631

(Linear)

+stack

=

(aka Kantorovich)

/3632

NIPS 2003 Challenge (Guyon+ NIPS04)

PCA or + NN(MCMC) ARD prior or Dirichlet Difussion Tree prior Bayesian Neural Networks (Neal & Zhang 2006) 2(258 units)

NIPS 2003 Feature Selection Challenge

%

Guyon5! (Guyon+ 2007)

ESLII(Hastie+ 2009)NN,RF

16 ()

(entryKernel methods)

/3633KDD Cup 2015

https://speakerdeck.com/smly/techniques-tricks-for-data-mining-competitions

Churn Prediction: MOOC(XuetangX)dropout ($20,000)

Techniques (Tricks) for Data Mining Competitions (@smly)

821(+leaky?)

Linear Stacker 3StackingStacker1,2

GBMNN + LR Stack () KRRET(Extra Trees)2

/3634

AD(Applicability Domain)QSAR

AD = ( or )

C. Rcker+, J. Chem. Inf. Model., 2007, 47 (6), pp 23452357

Y-Scrambling Test / Y-Randomization

y

(?)

/3635:

underfit (RF, XGBoost, RGF, ET, DJ, ) Blending?

(FM) diverse (RPNN) Stacking

(SIS, t)(t-SNE, PLS, PCA, etc)

Cross ValidationADY-Scrambling

(Best Subset)

(Boosting, Bagging, Stacking)(Stability Selection, Bagging/Feature Bagging, ELM, ExtraTrees, etc)

/3636

(: )

JST