36
/36 1 科学 機械学習 のあいだ: 設計・変換・選択・交互作・線形性 海道学・JSTさきがけ [email protected] 瀧川 2016.11.18 @ 第19回情報的学習ワークショップ (IBIS2016)

科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性

Embed Size (px)

Citation preview

  • /361

    JST

    [email protected]

    2016.11.18 @ 19 (IBIS2016)

  • /362 (

    http://art.ist.hokudai.ac.jp

    ) 1974 Turing Award Lecture Computer Programming as an Art (Don Knuth)

    : ()

    ScienceEngineering Art

    https://doi.org/10.1145/361604.361612

  • /363

    (active): 1,737 (inactive): 26,895

    https://pubchem.ncbi.nlm.nih.gov/bioassay/41

    PC3

    (/QSAR)

  • /364

    Atom + Bond SYBYL MOL2 Pharmacophore

    O

    N

    N

    NH

    NHN

    N

    N

    CH3

    CH3

    Imatinib (CID 5291) Conformers in PubChem3D

    1

    2

    3

    284

    Molecular Graph Representations(Implicit Hydrogen)

    Molecular Descriptors

    (variations: constitutional, topological, atom pairs, geometrical, electronic, thermodynamical, physicochemical, WHIM, fingerprints, RDF, autocorrelations, functional groups, structural keys, properties, interaction fields, )

    (LogP)HOMO/LUMO ()

    (

    )

  • /365

    [Ramakrishnan+ 2014] Sci Data. 2014 Aug 5;1:140022 Quantum chemistry structures and properties of 134 kilo molecules.

    C,O,N,F9133,88515(C7H10O26,095 )

    ?(c.f. Ugly duckling theorem , 1969)

  • /366

    (?)

    : The Art of Feature Engineering

    "Applied machine learning" is basically feature engineering. Andrew Ng

    Feature Engineering is the next buzz word after big data. Nayyar A. Zaid

    ()

    Art

  • /367

    (confounder)?

    ()

    ()

    !!

    ()

  • /368

    Abduction/Induction: , , FM,

    /DAG: RF, GBM/MART/AnyBoost, XGBoost, RGF, DJ

    : Best SubsetLASSO, SCAD, MC+, SIS: Stability Selection (aka Randomized LASSO)

    : Chance Correlation, Concentration of Measures

    : , PLS, PCA, t-SNE, Embedding(2vec)

    : RP/ELM/RC, ExtraTrees, VR-Trees: Stacked Generalization (aka Stacking/Blending)

    : ACE (Alternative Conditional Expectations)

    AD(Applicability Domain), Y-Scrambling Test

  • Leo Breiman (1928-2005)

    CART (Classification and Regression Trees), PIMPLE Random Forest Arcing (aka Boosting) Bagging, Pasting ACE (Alternative Conditional Expectations) Stacked Generalization (aka Stacking/Blending) Nonnegative Garrote (LASSO for Subset) Instability / Stabilization in Model Selection

    Shannon-McMillan-Breiman Theorem () Kelly-Breiman Strategy ()

    UC Berkeley 2005 SIGKDD Innovation Award Probability Theorist

    If statistics is an applied field and not a minor branch of mathematics, then 99% of the published papers are useless exercises. ("Reflections after refereeing papers for NIPS, The Mathematics of Generalization, Ed. D.H. Wolpert, 1995)

    https://en.wikipedia.org/wiki/File:Leo_Breiman.jpg

  • /3610Abduction/Induction

    :/

    ()

    /

    Hypotheses/Axioms

    Experimental Facts

    deduction abductioninduction

    The grand aim of science is to cover the greatest number of experimental facts by logical deduction from the smallest number of hypotheses or axioms. (Albert Einstein)

  • /3611

    : (+) (82)

    14(23)primary features (A-B)

    (ZB)

    (WZ)

    (RS)

    (,,,,,etc) (10000)

    1. LASSOpre-select 2. Subset(Best Subset)

    Case Study: PRL 114, 105503, 2015

  • /3612

    : (+)

    2) EhC 3violate!

    4

    1.

    2.

    3.

    4.

    2,4violate! (KRR)

    1) (ZA,ZB)

    Case Study: PRL 114, 105503, 2015

  • /3613

    (3)

    Case Study: PRL 114, 105503, 2015

    ()

    ()

  • /3614

    (=interaction)

    ij

    (e.g. XORParity)

    1

  • /3615, (GAM), FM,

    + (e.g. Factorization Machines)

    (PolyReg)

    (?)

    (GAM)

  • http://playground.tensorflow.org/ (by Big Picture group, Google)

  • inputArchitecture Engineering

    (or )

    XOR

    (?) overfit

    ReLU

    off

    1/3, 4

    NN(Tuning?)

  • /3618:

    (n)Best Subset(m)

    QSAR (Topliss 1972, 1979)

    J. Fan, Features of Big Data and sparsest solution in high confidence set, 2014

    Fan, 2014 5

    ()

    (= 5Best Subset)

    Chance Correlation / Spurious Correlation

  • /36

    Beyer+ 1999:

    19:

    K. Beyer+, When Is Nearest Neighbor Meaningful? ICDT99 V. Pestov, On the geometry of similarity search: dimensionality curse and concentration of measure, Information Processing Letters, 1999.

    Pd()

    n+1d

    Concentration of Measures Phenomena

  • /3620: Best Subset

    (///RELIEFF/t etc) Best Subset (L0)

    LASSO (Tibshirani 1996) (L1, Basis Pursuit Denoising)

    leaps(Furnival & Wilson 1974) or (Morgan & Tatar 1972) () +

    LARS (Efron+ 2003) or (Friedman+ 2007) LASSO=Best Subset(=t) LASSOBest Subset (biased)

    glmnet (Friedman+ 2008)L1+L2(Elastic-Net): p>nLASSOn

  • /3621LASSOSIS

    (Fan & Li, 2001)

    Adaptive LASSO(Zou 2006) 2LASSO SCAD(Fan & Li 2001) () MC+(Zhang 2010) SCAD

    1: Best Subset 2:

    : SISpre-selectSCAD

    Sure Independence Screening (SIS) (Fan & Lv 2008)p(0,1)n

  • /3622: Randomized Sparse Models

    Stability Selection (Meinshausen & Buhlmann 2010)

    Randomized LASSO (Meinshausen & Buhlmann 2010)

    ) BootstrapmBolasso (Bach 2008)

    Regularization Path LASSO Stability Path () LASSO Randomized LASSO

  • /3623:

    CART (Breiman+ 1984), AID (Morgan & Sonquist 1963), CHAID (Kass 1980) CLS (Hunt 1966), ID3 (Quinlan 1986), C4.5/C5.0 (Quinlan 1993) VFDT/Hoeffding Trees (Domingos & Hulten 2000)

    Hyafil, Laurent; Rivest, RL (1976). "Constructing Optimal Binary Decision Trees is NP-complete". Information Processing Letters. 5 (1): 1517. doi:10.1016/0020-0190(76)90095-8.

    Known Facts Automatic Interaction Detector(AID) CARTBayes-risk consistent (Gordon & Olshen 1978, 1980) 2NP(Hyafil & Rivest 1976)greedy+pruning OK

    (DNF)

    https://dx.doi.org/10.1016/0020-0190(76)90095-8

  • /3624

    /DAG: RF, GBM/MART/AnyBoost, XGBoost, RGF, DJ

    : , PLS, PCA, t-SNE, Embedding

    : RP/ELM/RC, ExtraTrees, VR-Trees: Stacked Generalization (aka Stacking/Blending)

    : ACE(Alternative Conditional Expectations)

    :

    : MARS: (aka Kantorovich)

    :

  • /3625

    PCA(Bair+ 2006), Sparse PCA(Zou+ 2006), Sparse PLS(L Cao+ 2008; Chun & Kele 2010), ICA(Comon 1994),

    : PLS, PCA,

    (0,1)

    (: )PLS (: )

    Manifold Learning: ISOMAP(Tenenbaum+ 2000), LLE(Roweis & Saul 2000), t-SNE(van der Maaten & Hinton 2008),

    Neural Networks: Embedding (2vec), AutoEncoders,

  • /3626

    ACE (Alternative Conditional Expectations), Breiman & Friedman 1985

    : ACE

  • /3627

    Random Forest (Breiman 2001)

    ()

    XGBoost (Chen & Guestrin KDD16) +L2 GBM/MART/GBDT/AnyBoost (Friedman 1999; Mason+ NIPS99)

    Regularized Greedy Forests, RGF (Johnson & Zhang 2014)

    Importance Sample Learning Ensemble, ISLE (Friedman & Popescu 2003) (1/2) LASSO

    Decision Jungles (Shotton+ NIPS13) DAG

    ()

    Bootstrap +

    : split or

    Greedy + Fully Corrective

  • /3628Feature ImportancePartial Dependence

    : ?

    CART ?

    Best Subset

    Feature Importance Partial Dependence Plot (PDP) ?

    !! () PCABlending()

    ESLII (2009)

  • /3629

    OK ()

    Randomized Trees

    : : :FC FC

    : :FC FC

    Extreme Learning Machine, ELM (Huang 2006)

    Reservoir Computing, RC (e.g. Schrauwen+ 2007)

    ()()

    Extremely Randomized Trees (Geurts+ 2006) VR-Trees (Liu 2008)

    Random Projections

    or CART

  • /3630

    Netflix PrizeStacking/Blending

    : CV

    12345678910

    (Out-of-Sample Estimate)

    ()

    CV

    Stacked Generalization(Wolpert 1992; Breiman 1996)

  • /3631

    (Linear)

    +stack

    =

    (aka Kantorovich)

  • /3632

    NIPS 2003 Challenge (Guyon+ NIPS04)

    PCA or + NN(MCMC) ARD prior or Dirichlet Difussion Tree prior Bayesian Neural Networks (Neal & Zhang 2006) 2(258 units)

    NIPS 2003 Feature Selection Challenge

    %

    Guyon5! (Guyon+ 2007)

    ESLII(Hastie+ 2009)NN,RF

    16 ()

    (entryKernel methods)

  • /3633KDD Cup 2015

    https://speakerdeck.com/smly/techniques-tricks-for-data-mining-competitions

    Churn Prediction: MOOC(XuetangX)dropout ($20,000)

    Techniques (Tricks) for Data Mining Competitions (@smly)

    821(+leaky?)

    Linear Stacker 3StackingStacker1,2

    GBMNN + LR Stack () KRRET(Extra Trees)2

  • /3634

    AD(Applicability Domain)QSAR

    AD = ( or )

    C. Rcker+, J. Chem. Inf. Model., 2007, 47 (6), pp 23452357

    Y-Scrambling Test / Y-Randomization

    y

    (?)

  • /3635:

    underfit (RF, XGBoost, RGF, ET, DJ, ) Blending?

    (FM) diverse (RPNN) Stacking

    (SIS, t)(t-SNE, PLS, PCA, etc)

    Cross ValidationADY-Scrambling

    (Best Subset)

    (Boosting, Bagging, Stacking)(Stability Selection, Bagging/Feature Bagging, ELM, ExtraTrees, etc)

  • /3636

    (: )

    JST