18
LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

LECTURE 19:

TREE-BASED METHODS PT. 2November 15, 2017SDS 293: Machine Learning

Page 2: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Outline

üBasic mechanics of tree-based methodsüClassification exampleüChoosing good splitsüPruning

• How to avoid over-fitting- Bootstrap aggregation (“bagging”)- Random forests- Boosting

• Lab

Page 3: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

!"#$#!% &'(#$#)*+*,-./ $#0,1/2

3 4

!"#$#!% &'(#$#

4 3

VJBKD;BCW

4 3

(566/2/7,8

Page 4: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

VJBKD;BCW

92-:./;*'DEGD'8B<EBFC9

Page 5: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

/EKCHKKE7F

P <=/*,5-7>#QDBI'CBF'Q9'M7'I7'C7:;BI'DEGD'8B<EBFC9XP )7*?/2>#;77IKI<BO'BFM'BGG<9GBI9Y

Page 6: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

,BGGEFG

P @5A#5B/C> HK9'<9GHJB<'7JM';77IKI<BOOEFG'I7'G9F9<BI9'B';HFCD'7L'KB:OJ9'I<BEFEFG'K9IK>';HEJM'I<99K'L7<'9BCD'7F9>'BFM'CD/2CA/ ID9E<'<9KHJIEFG'O<9MECIE7FK

Z:BG9'C7H<I9K['&BCD7OEF@@'7F'/98EBFI-<I

Page 7: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Estimating test error

• Fun fact: there is a straightforward way to estimate the test error of a bagged model, without needing to use a test set or cross-validate!

• Key: trees are repeatedly fit to bootstrapped subsets - Each bagged tree only trains on ~2/3 of the observations- The remaining ~1/3 are called the out-of-bag (OOB) observations

• Question: how does this help?

Page 8: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

22,'9<<7<

$<BEF9M'7F'IDEK'7;K9<8BIE7FX$<BEF9M'7F'IDEK'

&9O9BI'L7<'/D/2+#-:*/2DC,5-7>'B89<BG9'I7'G9I'0."'7<'CJBKKELECBIE7F'9<<7<

^QEID'9F7HGD'I<99K>'IDEK'EK'9KK9FIEBJJ['9_HE8BJ9FI'I7'!22#`'9<<7<

Page 9: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

09BKH<EFG'O<9MECI7<'E:O7<IBFC9

P aEID'bHKI'7F9'I<99>'EI'QBK'9BK['I7'OECW'7HI'ID9':7KI'E:O7<IBFI'O<9MECI7<'RQDECD'7F9'QBK'EIXU

Page 10: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

09BKH<EFG'O<9MECI7<'E:O7<IBFC9

P aEID'J7IK'7L'I<99K>'Q9'CBFcI'bHKI'S<9BM L<7:'ID9'I7OT

P ZFKI9BM>'Q9'CBF'J77W'BI'B89<BG9'<9MHCIE7F'EF'&..'7<'dEFE'MH9'I7'KOJEIK'7F'B'GE89F'O<9MECI7<'RQ9cJJ'K99'IDEK'EF'JB;U

Page 11: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

eHKI'7F9'O<7;J9:f

P 2F9'EKKH9'QEID';BGGEFG'EK'IDBI'EI'K7:9IE:9K'GE89K'HK'I<99K'IDBI'B<9'O<9II['15A1.+#E-22/.C,/B#RQD[XU

P -89<BGEFG'DEGDJ['C7<<9JBI9M'8BJH9K'B-/*7I,#2/B=E/#DC25C7E/BK':HCD'BK'B89<BGEFG'HFC7<<9JBI9M'_HBFIEIE9K

ZL'Q9'DB89'7F9'D/2+#*,2-7A#J2/B5E,-2#EF'ID9'MBIB'K9I>':7KI'7<'BJJ'7L'ID9'I<99K'QEJJ'HK9'IDEK'O<9MECI7<'EF'ID9'

,-J#*J.5,

Page 12: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Random forests

• Strange idea: what if each time we go to make a split, we randomly limit the choice to some subset of predictors?-Only consider m out of p predictors each time we decide on a split

- Roughly (p-m)/p splits won’t even consider that bully predictor, so other predictors will have a chance

- Note: when m = p, this is just bagging!

Page 13: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

,77KIEFG

P 92/D5-=*#;/,1-B*>#G9F9<BI9'B';HFCD'7L'I<BEFEFG'K9IK>'LEI'B'I<99'7F'9BCD'7F9'EFM9O9FM9FIJ[>'BFM'BGG<9GBI9'<9KHJIK

P @--*,57A Q7<WK'EF'B'KE:EJB<'QB[>'9NC9OI'IDBI'9BCD'I<99'EK'G<7QF'HKEFG'EFL7<:BIE7F'L<7:'J2/D5-=*#,2//*

Page 14: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Boosting

• Big idea: fit each new tree using the residuals from the previous tree as the response

• A shrinkage parameter 𝜆 slows the process even more, allowing different-shaped trees to try to deal w/ residuals

• By fitting small trees to the residuals (i.e. variance we haven’t yet explained), we slowly* improve the model in areas where it makes mistakes

* in general, statistical learning approaches that learn slowly tend to perform well

Page 15: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

,77KIEFG*'BJG7<EID:

(4 ZFEIEBJEg9'"# $ % & BFM''( % )(*L7<'BJJ'+ EF'ID9'I<BEFEFG'K9I54 V7<'9BCD',* % *-. /. 0 . 1*BU VEI'B'I<99'"#2 QEID'3 KOJEIK'R3 4 - I9<:EFBJ'F7M9KU'I7'56. '7

;U %OMBI9'"# ;['BMMEFG'EF'B'KD<HFW9F'89<KE7F'7L'ID9'F9Q'I<99*"# $ 8 "# $ 4 !"#2 $ *

CU %OMBI9'ID9'<9KEMHBJK*''( 8 '( 9 !"#2 $(

A4 2HIOHI'ID9';77KI9M':7M9J*

"# $ % :"#2 $;

2<=

Page 16: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Takeaways

• Tree-based methods partition the predictors into a number of simple regions, and use the average value of each region to make predictions

• While easy to interpret, trees typically won’t outperformother methods we’ve seen in terms of prediction accuracy

• Bagging, random forests, and boosting all try to fix this by growing multiple trees and using “consensus prediction”

• In lab, we will see that combining a large number of trees can result in dramatic improvements in prediction accuracy (at the expense of some loss in interpretability)

Page 17: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Lab: bagging, random forests, & boosting

• To do today’s lab in R: tree, randomForest, gbm• To do today’s lab in python: graphviz

• Instructions and code:[course website]/labs/lab14-r.html

[course website]/labs/lab14-py.html

• Full version can be found beginning on p. 324 of ISLR

Page 18: LECTURE 19: TREE-BASED METHODS PT. 2jcrouser/SDS293/lectures/19-trees-pt2.pdf · LECTURE 19: TREE-BASED METHODS PT. 2 November 15, 2017 SDS 293: Machine Learning

Coming up

• A7 due tonight by 11:59pm

• Next week: final project workshop• After break: support vector machines