Machine Learning 10-601tom/10601_fall2012/slides/NNets-9_27_2012.pdf · NNets-9_27_2012.pptx...

Machine Learning 10-601 Tom M. Mitchell

Machine Learning Department Carnegie Mellon University

Septermber 27, 2012

Today: •  Artificial neural networks •  Backpropagation •  Cognitive modeling •  Deep belief networks

Reading: •  Mitchell: Chapter 4 •  Bishop: Chapter 5

Optional: •  Le: ICML 2012 paper (on website)

Logistic Regression is linear classifier

[slide: Aarti Singh]

Artificial Neural Networks to learn f: X à Y

•  f might be non-linear function •  X (vector of) continuous and/or discrete vars •  Y (vector of) continuous and/or discrete vars

•  Represent f by network of logistic units •  Each unit is a logistic function

•  MLE: train weights of all units to minimize sum of squared errors of predicted network outputs

•  MAP: train to minimize sum of squared errors plus weight magnitudes

ALVINN [Pomerleau 1993]

•  Consider regression problem f:XàY , for scalar Y y = f(x) + ε

assume noise N(0,σε), iid

deterministic

M(C)LE Training for Neural Networks

Learned neural network

•  Let’s maximize the conditional data likelihood

•  Consider regression problem f:XàY , for scalar Y y = f(x) + ε

noise N(0,σε)

deterministic

MAP Training for Neural Networks

Gaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2

xd = input

td = target output

od = observed unit output

wi = weight i

xd = input

td = target output

od = observed unit output

wij = wt from i to j

Dealing with Overfitting

Our learning algorithm involves a parameter n=number of gradient descent iterations

How do we choose n to optimize future error? (note: similar issue for logistic regression, decision trees, …) e.g. the n that minimizes error rate of neural net over future data

Dealing with Overfitting

Our learning algorithm involves a parameter n=number of gradient descent iterations

How do we choose n to optimize future error? •  Separate available data into training and validation set •  Use training to perform gradient descent •  n ß number of iterations that optimizes validation set error

à gives unbiased estimate of optimal n (but a biased estimate of true error)

K-Fold Cross Validation Idea: train multiple times, leaving out a disjoint subset of data each time

for test. Average the test set accuracies. ________________________________________________ Partition data into K disjoint subsets For k=1 to K

testData = kth subset h ß classifier trained* on all data except for testData

accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies

* might withhold some of this to choose number of gradient decent steps

Leave-One-Out Cross Validation This is just k-fold cross validation leaving out one example each iteration ________________________________________________ Partition data into K disjoint subsets, each containing one example For k=1 to K

testData = kth subset h ß classifier trained* on all data except for testData

accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies

* might withhold some of this to choose number of gradient decent steps

[slide: Aarti Singh]

w0 left strt right up

Deep Belief Networks

•  Problem: training networks with many hidden layers doesn’t work very well –  local minima, very slow training if initialize with zero weights

•  Deep belief networks –  autoencoder networks to learn low dimensional encodings

–  but more layers, to learn better encodings

[Hinton & Salakhutdinov, 2006]

original image

reconstructed from 2000-1000-500-30 DBN reconstructed from 2000-300, linear PCA

[Hinton & Salakhutdinov, 2006] Deep Belief Networks

versus

Encoding of digit images in two dimensions

784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet

[Hinton & Salakhutdinov, 2006]

Deep Belief Networks: Training [Hinton & Salakhutdinov, 2006]

Restricted Boltzman Machine

•  Bipartite graph, logistic activation •  Inference: fill in any nodes, estimate other

nodes •  consider vi, hj are boolean variables

v1 v2 vn …

h1 h2 h3

Very Large Scale Use of DBN’s Data: 10 million 200x200 unlabeled images, sampled from YouTube Training: use 1000 machines (16000 cores) for 1 week Learned network: 3 multi-stage layers, 1.15 billion parameters Achieves 15.8% accuracy classifying 1 of 20k categories in ImageNet data

[Quoc Le, et al., ICML, 2012]

Images that most excite the feature:

Image synthesized to most excite the feature:

Semantic Memory Model Based on ANN’s [McClelland & Rogers, Nature

No hierarchy given.

Train with assertions, e.g., Can(Canary,Fly)

Humans act as though they have a hierarchical memory organization

1.  Victims of Semantic Dementia progressively lose knowledge of objects But they lose specific details first, general properties later, suggesting hierarchical memory organization

Thing Living

Animal Plant

NonLiving

Bird Fish

Canary

2. Children appear to learn general categories and properties first, following the same hierarchy, top down*.

* some debate remains on this.

Question: What learning mechanism could produce this emergent hierarchy?

Memory deterioration follows semantic hierarchy [McClelland & Rogers, Nature 2003]

ANN Also Models Progressive Deterioration [McClelland & Rogers, Nature

average effect of noise in inputs to hidden layers

What you should know: Artificial Neural Networks

•  Highly non-linear regression/classification •  Vector-valued inputs and outputs •  Potentially millions of parameters to estimate •  Hidden layers learn intermediate representations •  Actively used to model distributed computation in brain

•  Gradient descent, local minima problems •  Overfitting and how to deal with it

•  Many extensions

Machine Learning 10-601tom/10601_fall2012/slides/NNets-9_27_2012.pdf · NNets-9_27_2012.pptx...

Documents

MITCHELL-LAMA RESIDENTS COALITION · information@mitchell-lama.org Mitchell-Lama Residents Coalition, Inc. Officers ... Alice Mitchell; ... Lindville Housing

The Book of Texas David Mitchell - Norman Satterwhitenormansatterwhite.com/Satterwhite/David Mitchell Satterwhite TX.pdf1 The Book of Texas David Mitchell David Mitchell Satterwhite

Mitchell Caverns

Introducción a las Redes Neuronales y su aplicación a la ...research.iac.es/sieinvens/SINFIN/Sie_Courses_PDFs/NNets/confiac.pdf · Arquitecturas Neuronales: – Según e/o/s •

Hopfield NNets

Mitchell mortaza

Mitchell asha2010movies

Projects in the Mitchell Region › media › documents › Mitchell Region STIP... · 1 day ago · Mitchell Region Interstate Median Protection for Mitchell Region Lincoln Minnehaha

Let your love light the way. - Grace Hospital …...MITCHELL, ALAN MITCHELL, DORIS MITCHELL, HOLDEN MITCHELL, JOHN MITCHELL, RUTH MOFFAT, GEORGE F. MOODY, GREGORY MOONEY, DOUGLAS In

Schooley Mitchell

MITCHELL COLLEGE - Sasakiprojects.sasaki.com › mitchell › Documents › MCJUNEPPTFORWEB.pdf · Mitchell College, a private, higher education institution, provides a transforming

MITCHELL HATTIE D 0 PV005 MITCHELL HATTIE MARIE 0 PV005 ...sites.rootsweb.com/~rigenweb/Cemetery2/CemeteryM8.pdf · mitchell hattie d 0 pv005 mitchell hattie marie 0 pv005 mitchell

Mitchell-Mylne - FDCA files/LockitM3.pdf · Mitchell-Mylne 8092 1833 21 February ... Mitchell John Baker son of William MItchell, Maltman 5844 1737 18 October Mitchell William Merchant

Mitchell and Mitchell v. Fritz

Mitchell Presentation

Mitchell Cinematography

Brian Mitchell - Drexel University MCS680-FCS 1 Brian Mitchell

Terry Mitchell

Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

NNets-601 short 4 22 2015ninamf/courses/601sp15/slides/26_deep_4...2015/04/22 · NNets-601_short_4_22_2015.pptx Author Tom Mitchell Created Date 4/23/2015 11:22:18 AM