Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks...

DS 4400

Alina OpreaAssociate Professor, CCISNortheastern University

December 2 2019

Machine Learning and Data Mining I

Logistics• Final exam– Wed, Dec 4, 2:50-5pm, in class– Office hours: after class today, Tuesday 1-2pm– No office hours on Wed

• Final project – Presentations: Mon, Dec 9, 1-5pm, ISEC 655– 8 minutes per team– Final report due on Gradescope on Tue, Dec. 10– No late days for final report! Please submit on

Final Exam Review

What we covered

Linear classification• Perceptron• Logistic regression• LDA• Linear SVM

• Metrics• Cross-validation• Regularization• Feature selection• Gradient Descent• Density Estimation

Linear Regression

Non-linear classification• kNN• Decision trees• Kernel SVM• Naïve Bayes

Linear algebra Probability and statistics

Deep learning• Feed-forward Neural Nets• Convolutional Neural Nets• Architectures• Forward/back propagation

Ensembles• Bagging• Random forests• Boosting• AdaBoost

Terminology

• Hypothesis space 𝐻 = 𝑓: 𝑋 → 𝑌• Training data D = 𝑥*, 𝑦* ∈ 𝑋×𝑌• Features: 𝑥/ ∈ 𝑋• Labels / response variables 𝑦* ∈ 𝑌– Classification: discrete 𝑦* ∈ {0,1}– Regression: 𝑦* ∈ R

• Loss function: 𝐿 𝑓, 𝐷– Measures how well 𝑓 fits training data

• Training algorithm: Find hypothesis 3𝑓: 𝑋 → 𝑌– 3𝑓 = argmin

:∈;𝐿 𝑓, 𝐷

Supervised Learning: Classification

Data Pre-processing

Feature extraction

Learning model

Training

Labeled Classification

Testing

New data

Unlabeled

Learning model Predictions

PositiveNegative

Normalization Feature Selection

Classification

𝑥/, 𝑦/ ∈ {0,1} 𝑓(𝑥)

𝑓(𝑥)𝑥′

𝑦C = 𝑓 𝑥C ∈ {0,1}

Supervised Learning: Regression

Data Pre-processing

Feature extraction

Learning model

Training

Labeled Regression

Testing

New data

Unlabeled

Learning model Predictions

Response variable

Normalization Feature Selection

Regression

𝑥/, 𝑦/ ∈ 𝑅 𝑓(𝑥)

𝑓(𝑥)𝑥′

𝑦C = 𝑓 𝑥C ∈ 𝑅

Methods for Feature Selection• Wrappers– Select subset of features that gives best prediction

accuracy (using cross-validation)– Model-specific

• Filters– Compute some statistical metrics (correlation

coefficient, mutual information)– Select features with statistics higher than threshold

• Embedded methods– Feature selection done as part of training– Example: Regularization (Lasso, L1 regularization)

Decision Tree Learning

Learning Decision Trees

ID3 algorithm uses Information GainInformation Gain reduces uncertainty on Y

Regression Trees

Predict average response of all training data at each leaf

𝑋E ≤ 𝑠 𝑋E ≥ 𝑠

𝑅I(𝑗, 𝑠) 𝑅K(𝑗, 𝑠)

Min sum of residual on both branches

Predict 𝑐I Predict 𝑐K

Decision trees topics

• Entropy, Conditional entropy• Information gain• How to train a decision tree – Recursive algorithm– Impurity metrics (Gini, information gain)

• How to evaluate a decision tree• Strategies to prevent overfitting– Pruning

Ensembles

13Majority Votes

Bagging

Random Forests

Trees are de-correlated by choice of random subset of features

Overview of AdaBoost

Better classifiers will get higher weights• Mis-classified examples

get higher weights• Correct examples get

lower weights

Uniform weights ℎI(𝑥)

ℎK(𝑥)

ℎN(𝑥)

ℎO(𝑥)

AdaBoost

Ensemble Topics

• Bagging – Parallel training– Bootstrap samples– Out-of-bag (OOB) error– Random forest

• Boosting– Sequential training– Training with weights– AdaBoost

Neural Network Architectures

Feed-Forward Networks• Neurons from each layer

connect to neurons from next layer

Convolutional Networks• Includes convolution layer

for feature reduction• Learns hierarchical

representations

Recurrent Networks• Keep hidden state• Have cycles in

computational graph

Feed-Forward Neural Network

𝑎Q[I]

𝑎I[I]

𝑎K[I]

𝑎N[I]

Layer 0 Layer 1 Layer 2

𝑊[K]

𝑏[I]𝑏[K]

Training example𝑥 = (𝑥I, 𝑥K, 𝑥N)

𝑎V[I]

𝑎I[K]

No cycles

𝑊[I]

𝜃 = (𝑏[I], 𝑊[I], 𝑏[K], 𝑊[K])

Sigmoid Softmax

Softmax classifier

• Predict the class with highest probability• Generalization of sigmoid/logistic regression to multi-class

Feed-Forward Neural Networks Topics

• Forward propagation– Linear operations and activations

• Activation functions– Examples, non-linearity

• Design networks for simple operations • Estimate number of parameters– Count both weights and biases

• Activations for binary / multi-class classification• Regularization (dropout and weight decay)

Representing Boolean Functions

Logistic unit

Representing Boolean Functions

NOT (𝒙𝟏 AND 𝒙𝟐)

XOR with 1 Hidden layer

𝒚 = 𝒉𝟏 AND 𝒉𝟐

𝒉𝟏 = 𝒙𝟏 OR 𝒙𝟐

𝒉𝟐 =NOT (𝒙𝟏 AND 𝒙𝟐)

𝒙𝟏 XOR 𝒙𝟐 = (𝒙𝟏 OR 𝒙𝟐 ) AND (NOT (𝒙𝟏 AND 𝒙𝟐))

Convolutional Nets

Topics• Input and output size• Number of parameters• Convolution operation (stride, pad)• Max pooling

Given training set 𝑥I, 𝑦I , … , 𝑥^, 𝑦^Initialize all parameters 𝑊[ℓ], 𝑏[ℓ] randomly, for all layers ℓLoop

Training NN with Backpropagation

Update weights via gradient step

• 𝑊/E[ℓ] = 𝑊/E

[ℓ] − 𝛼bcd[ℓ]

^• Similar for 𝑏/E

Until weights converge or maximum number of epochs is reached

Stochastic Gradient Descent

• Initialization– For all layers ℓ• Set 𝑊[ℓ], 𝑏[ℓ]at random

• Backpropagation– Fix learning rate 𝛼– For all layers ℓ (starting backwards)• For all training examples 𝑥/, 𝑦/

–𝑊[ℓ] = 𝑊[ℓ] − 𝛼 ef( ghc,hc)ei ℓ

– 𝑏[ℓ] = 𝑏[ℓ] − 𝛼 ef( ghchc)ej ℓ

Incremental version of GD

Mini-batch Gradient Descent

• Initialization– For all layers ℓ• Set 𝑊[ℓ], 𝑏[ℓ] at random

• Backpropagation– Fix learning rate 𝛼– For all layers ℓ (starting backwards)• For all batches b of size B with training examples 𝑥/j, 𝑦/j

–𝑊[ℓ] = 𝑊[ℓ] − 𝛼 ∑/lIm ef( ghcn,hcn)ei ℓ

– 𝑏[ℓ] = 𝑏[ℓ] − 𝛼 ∑/lIm ef( ghcn,hcn)ej ℓ

Linear SVM - Max Margin

• Support vectors are “closest” to hyperplane• If support vectors change, classifier changes

SVM with Kernels

SVM Topics

• Linear SVM– Maximum margin– Error budget– Solution depends only on support vectors

• Kernel SVM– Examples of kernels

Comparing classifiersAlgorithm Interpretable Model size Predictive

accuracyTraining time Testing time

Logistic regression

High Small Lower Low Low

kNN Medium Large Lower No training High

LDA Medium Small Lower Low Low

Decision trees

High Medium Lower Medium Low

Ensembles Low Large High High High

Naïve Bayes Medium Small Lower Medium Low

SVM Medium Small High High Low

Neural Networks

Low Large High High Low35

Machine Learning and Data Mining I · Neural Network Architectures 19 Feed-Forward Networks...

Documents

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 … · OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS Matt Eclavea (meclavea@brocade.com) Senior Solutions Architect,

Transport Layer Computer Networks Computer Networks Term B10

##1. Layer-specific Regulation of Cortical Neurons by Interhemispheric Inhibition

Computer Networks Data link layer and Network layer

Convergence and stability in networks with spiking neurons

Neurons and Neural Networks - Harvard University

Computer Networks OSI LAYER

Online Robot Teleoperation Using Human Hand Gestures: A ... · layer with 21 neurons and in the output layer is used 12 neurons. Table 3 presents a detailed parametrization of the

Connections between Neural Networks and Boolean Functions · neural networks. The focus is on cases in which the individual neurons are linear threshold neurons, sigmoid neurons,

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

Predicting Eye Fixations using Convolutional Neural Networks · Actv x x= (3) The last classification layer is usually a softmax layer with the amount of neurons equaling the number

From Neurons to Neural Networks

The eﬀects of inhibi.ng neurons in Layer-‐II of the

Overproduction of Upper-Layer Neurons in the Neocortex ...iplab.ust.hk/pdf/Journals pdf/2014_12 CellReports.pdf · Cell Reports Report Overproduction of Upper-Layer Neurons in the

Networks of Neurons

Emergent Properties of Layer 2/3 Neurons Reflect the Collinear ... · Emergent Properties of Layer 2/3 Neurons Reflect the Collinear Arrangement of Horizontal Connections in Tree

Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

CHAPTER BACKBONE NETWORKS · BACKBONE NETWORKS 263 Application Layer Fundamental Concepts Transport Layer Network Layer Data Link Layer Physical Layer Network Technologies LAN WLAN

TRAINING ALGORITHMS FOR NETWORKS OF SPIKING NEURONS …

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation