Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Jaemin Yoo (SNU) 1
EDiT: Interpreting Ensemble Models via Compact Soft Decision Trees
ICDM 2019
Jaemin YooSeoul National University
Seoul, South Korea
Lee SaelAjou University
Suwon, South Korea
Jaemin Yoo (SNU) 2
Outline
n Introductionn Proposed Methodn Experimentsn Conclusion
Jaemin Yoo (SNU) 3
Black Box Models
n Most ML models are black boxesq Learned structures are random and complexq Their decisions are not explainable
Image from https://www.investopedia.com/terms/b/blackbox.asp
Jaemin Yoo (SNU) 4
Interpretable ML
n Research to interpret a model’s decisionsq Important when each decision is irreversible
n Two types of interpretable models:q Linear modelsq Decision trees
n However, their accuracy is not good
Jaemin Yoo (SNU) 5
Ensemble Models
n Ensemble modelsq Combine the predictions of weak modelsq Produce robust and accurate predictions
n However, they have low interpretabilityq Decisions are made by hundreds of learners
Image from https://dsc-spidal.github.io/harp/docs/examples/rf/
Jaemin Yoo (SNU) 6
Problem Definition
n Given a trained ensemble model 𝑀n Train an interpretable classifier 𝑆n Such that
q 𝑆 achieves similar accuracy to 𝑀q 𝑆 contains fewer parameters than 𝑀
Jaemin Yoo (SNU) 7
Outline
n Introductionn Proposed Methodn Experimentsn Conclusion
Jaemin Yoo (SNU) 8
Proposed Method
n Ensemble to Distilled Tree (EDiT)q Given an ensemble modelq Trains a compact soft decision tree
n Interpretable & more efficient than SDTs
n EDiT is based on three main ideasq Idea 1: Knowledge distillationq Idea 2: Weight sparsificationq Idea 3: Tree pruning
Jaemin Yoo (SNU) 9
Preliminary: SDTs
n SDTs are interpretable tree-based modelsq Each internal node is a linear classifierq Each leaf node learns a probability distribution
Image from “Rule-Extraction from Soft Decision Trees” (L. Huang, M. Hsieh, and M. Rajati, BDAI 2019)
Jaemin Yoo (SNU) 10
Idea 1: Distillation
n Knowledge distillationq Transfers the knowledge of a teacher to a student
n Replace the labels 𝐲 in training data 𝒟 as
𝐲% ←𝑀 𝐱% + 𝐲%
2 for each 𝐱%, 𝐲% in 𝒟
q 𝐱% is a feature vector that corresponds to 𝑦%
Jaemin Yoo (SNU) 11
Idea 2: Sparse Weights (1)
n Weight sparsificationq Improves the efficiency by sparse weights
n Propose three different approachesq 1) L1 regularization
n Adds an L1 regularizer to the loss functionq 2) Weight masking
n Inactivates randomly some of the weightsq 3) Weight pruning
n Prunes weights whose learned values are small
Jaemin Yoo (SNU) 12
Idea 2: Sparse Weights (2)
n Weight masking: 2 steps
n Weight pruning: 3 steps
Random masking Training
Pre-training Pruning Fine-tuning
Jaemin Yoo (SNU) 13
Idea 3: Tree Pruning
n Tree pruningq Removes nodes of small arrival probabilities q Enables a large depth to be adopted
n Tree pruning vs. weight pruningq Weight pruning removes redundant weightsq Tree pruning removes redundant tree nodes
Jaemin Yoo (SNU) 14
Summary
n Result of applying our ideas to an SDT
q Sparse weights from sparsificationq Narrow tree structure from tree pruning
Jaemin Yoo (SNU) 15
Outline
n Introductionn Proposed Methodn Experimentsn Conclusion
Jaemin Yoo (SNU) 16
Accuracy & Complexity
n Does EDiT outperform the baselines?n EDiT shows the best balance in all cases
q High accuracy with only a few parameters
Jaemin Yoo (SNU) 17
Sparsification Methods
n Which is the best sparsification method?n Weight pruning works generally the best
q L1 regularization fails even with large 𝜆
Jaemin Yoo (SNU) 18
Outline
n Introductionn Proposed Methodn Experimentsn Conclusion
Jaemin Yoo (SNU) 19
Conclusion
n Ensemble to Distilled Tree (EDiT)q Our approach to interpret ensemble models
n Idea 1: Knowledge distillationn Idea 2: Weight sparsificationn Idea 3: Tree pruning
n EDiT gives the most efficient predictionsq Accuracy: DT ≪ RF ≈ SDT ≈ EDiTq Parameters: DT ≈ EDiT ≪ SDT < RF
Jaemin Yoo (SNU) 20
Thank you !GitHub: https://github.com/leesael/EDiT