36
DS 4400 Alina Oprea Associate Professor, CCIS Northeastern University December 2 2019 Machine Learning and Data Mining I

Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

DS 4400

Alina OpreaAssociate Professor, CCISNortheastern University

December 2 2019

Machine Learning and Data Mining I

Page 2: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Logistics• Final exam– Wed, Dec 4, 2:50-5pm, in class– Office hours: after class today, Tuesday 1-2pm– No office hours on Wed

• Final project – Presentations: Mon, Dec 9, 1-5pm, ISEC 655– 8 minutes per team– Final report due on Gradescope on Tue, Dec. 10– No late days for final report! Please submit on

time

2

Page 3: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Final Exam Review

Page 4: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

What we covered

Linear classification• Perceptron• Logistic regression• LDA• Linear SVM

• Metrics• Cross-validation• Regularization• Feature selection• Gradient Descent• Density Estimation

Linear Regression

Non-linear classification• kNN• Decision trees• Kernel SVM• Naïve Bayes

Linear algebra Probability and statistics

Deep learning• Feed-forward Neural Nets• Convolutional Neural Nets• Architectures• Forward/back propagation

Ensembles• Bagging• Random forests• Boosting• AdaBoost

4

Page 5: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Terminology

5

• Hypothesis space 𝐻 = 𝑓: 𝑋 → 𝑌• Training data D = 𝑥*, 𝑦* ∈ 𝑋×𝑌• Features: 𝑥/ ∈ 𝑋• Labels / response variables 𝑦* ∈ 𝑌– Classification: discrete 𝑦* ∈ {0,1}– Regression: 𝑦* ∈ R

• Loss function: 𝐿 𝑓, 𝐷– Measures how well 𝑓 fits training data

• Training algorithm: Find hypothesis 3𝑓: 𝑋 → 𝑌– 3𝑓 = argmin

:∈;𝐿 𝑓, 𝐷

Page 6: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Supervised Learning: Classification

Data Pre-processing

Feature extraction

Learning model

Training

Labeled Classification

Testing

New data

Unlabeled

Learning model Predictions

PositiveNegative

Normalization Feature Selection

Classification

6

𝑥/, 𝑦/ ∈ {0,1} 𝑓(𝑥)

𝑓(𝑥)𝑥′

𝑦C = 𝑓 𝑥C ∈ {0,1}

Page 7: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Supervised Learning: Regression

Data Pre-processing

Feature extraction

Learning model

Training

Labeled Regression

Testing

New data

Unlabeled

Learning model Predictions

Response variable

Normalization Feature Selection

Regression

𝑥/, 𝑦/ ∈ 𝑅 𝑓(𝑥)

𝑓(𝑥)𝑥′

𝑦C = 𝑓 𝑥C ∈ 𝑅

7

Page 8: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Methods for Feature Selection• Wrappers– Select subset of features that gives best prediction

accuracy (using cross-validation)– Model-specific

• Filters– Compute some statistical metrics (correlation

coefficient, mutual information)– Select features with statistics higher than threshold

• Embedded methods– Feature selection done as part of training– Example: Regularization (Lasso, L1 regularization)

8

Page 9: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Decision Tree Learning

9

Page 10: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Learning Decision Trees

10

ID3 algorithm uses Information GainInformation Gain reduces uncertainty on Y

Page 11: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Regression Trees

Predict average response of all training data at each leaf

11

𝑋E ≤ 𝑠 𝑋E ≥ 𝑠

𝑅I(𝑗, 𝑠) 𝑅K(𝑗, 𝑠)

Min sum of residual on both branches

Predict 𝑐I Predict 𝑐K

Page 12: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Decision trees topics

• Entropy, Conditional entropy• Information gain• How to train a decision tree – Recursive algorithm– Impurity metrics (Gini, information gain)

• How to evaluate a decision tree• Strategies to prevent overfitting– Pruning

12

Page 13: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Ensembles

13Majority Votes

Page 14: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Bagging

14

Page 15: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Random Forests

15

Trees are de-correlated by choice of random subset of features

Page 16: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Overview of AdaBoost

16

Better classifiers will get higher weights• Mis-classified examples

get higher weights• Correct examples get

lower weights

Uniform weights ℎI(𝑥)

ℎK(𝑥)

ℎN(𝑥)

ℎO(𝑥)

Page 17: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

AdaBoost

17

Page 18: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Ensemble Topics

• Bagging – Parallel training– Bootstrap samples– Out-of-bag (OOB) error– Random forest

• Boosting– Sequential training– Training with weights– AdaBoost

18

Page 19: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Neural Network Architectures

19

Feed-Forward Networks• Neurons from each layer

connect to neurons from next layer

Convolutional Networks• Includes convolution layer

for feature reduction• Learns hierarchical

representations

Recurrent Networks• Keep hidden state• Have cycles in

computational graph

Page 20: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Feed-Forward Neural Network

20

𝑎Q[I]

𝑎I[I]

𝑎K[I]

𝑎N[I]

Layer 0 Layer 1 Layer 2

𝑊[K]

𝑏[I]𝑏[K]

Training example𝑥 = (𝑥I, 𝑥K, 𝑥N)

𝑎V[I]

𝑎I[K]

No cycles

𝑊[I]

𝜃 = (𝑏[I], 𝑊[I], 𝑏[K], 𝑊[K])

Page 21: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

21

Sigmoid Softmax

Page 22: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Softmax classifier

22

• Predict the class with highest probability• Generalization of sigmoid/logistic regression to multi-class

Page 23: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Feed-Forward Neural Networks Topics

• Forward propagation– Linear operations and activations

• Activation functions– Examples, non-linearity

• Design networks for simple operations • Estimate number of parameters– Count both weights and biases

• Activations for binary / multi-class classification• Regularization (dropout and weight decay)

23

Page 24: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Representing Boolean Functions

24

Logistic unit

? ? ?

Page 25: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Representing Boolean Functions

25

30

NOT (𝒙𝟏 AND 𝒙𝟐)

Page 26: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

XOR with 1 Hidden layer

26

𝒚 = 𝒉𝟏 AND 𝒉𝟐

𝒉𝟏 = 𝒙𝟏 OR 𝒙𝟐

𝒉𝟐 =NOT (𝒙𝟏 AND 𝒙𝟐)

𝒙𝟏 XOR 𝒙𝟐 = (𝒙𝟏 OR 𝒙𝟐 ) AND (NOT (𝒙𝟏 AND 𝒙𝟐))

Page 27: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Convolutional Nets

27

Page 28: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Convolutional Nets

28

Topics• Input and output size• Number of parameters• Convolution operation (stride, pad)• Max pooling

Page 29: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Given training set 𝑥I, 𝑦I , … , 𝑥^, 𝑦^Initialize all parameters 𝑊[ℓ], 𝑏[ℓ] randomly, for all layers ℓLoop

Training NN with Backpropagation

29

Update weights via gradient step

• 𝑊/E[ℓ] = 𝑊/E

[ℓ] − 𝛼bcd[ℓ]

^• Similar for 𝑏/E

[ℓ]

Until weights converge or maximum number of epochs is reached

EPOCH

Page 30: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Stochastic Gradient Descent

• Initialization– For all layers ℓ• Set 𝑊[ℓ], 𝑏[ℓ]at random

• Backpropagation– Fix learning rate 𝛼– For all layers ℓ (starting backwards)• For all training examples 𝑥/, 𝑦/

–𝑊[ℓ] = 𝑊[ℓ] − 𝛼 ef( ghc,hc)ei ℓ

– 𝑏[ℓ] = 𝑏[ℓ] − 𝛼 ef( ghchc)ej ℓ

30

Incremental version of GD

Page 31: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Mini-batch Gradient Descent

• Initialization– For all layers ℓ• Set 𝑊[ℓ], 𝑏[ℓ] at random

• Backpropagation– Fix learning rate 𝛼– For all layers ℓ (starting backwards)• For all batches b of size B with training examples 𝑥/j, 𝑦/j

–𝑊[ℓ] = 𝑊[ℓ] − 𝛼 ∑/lIm ef( ghcn,hcn)ei ℓ

– 𝑏[ℓ] = 𝑏[ℓ] − 𝛼 ∑/lIm ef( ghcn,hcn)ej ℓ

31

Page 32: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Linear SVM - Max Margin

32

• Support vectors are “closest” to hyperplane• If support vectors change, classifier changes

Page 33: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

SVM with Kernels

33

Page 34: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

SVM Topics

• Linear SVM– Maximum margin– Error budget– Solution depends only on support vectors

• Kernel SVM– Examples of kernels

34

Page 35: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

Comparing classifiersAlgorithm Interpretable Model size Predictive

accuracyTraining time Testing time

Logistic regression

High Small Lower Low Low

kNN Medium Large Lower No training High

LDA Medium Small Lower Low Low

Decision trees

High Medium Lower Medium Low

Ensembles Low Large High High High

Naïve Bayes Medium Small Lower Medium Low

SVM Medium Small High High Low

Neural Networks

Low Large High High Low35

Page 36: Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks •Neurons from each layer connect to neurons from next layer Convolutional Networks •Includes

36