Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Noriko Tomuro 1
CSC 578Neural Networks and Deep Learning
9. Hopfield Networks, Boltzmann Machines
Unsupervised Neural Networks
Noriko Tomuro 2
1. Hopfield Networks1. Concepts
2. Boltzmann Machines1. Concepts2. Restricted Boltzmann Machines3. Deep Boltzmann Machines
1 Hopfield Network
Noriko Tomuro 3
• A Hopfield network is a form of recurrent artificial neural network. It is also one of the oldest neural networks.
• There are many variations. The one presented here is a discrete network which takes bipolar inputs (1 or -1).
• Hopfield network stores patterns --then recovers stored patterns from partial or corrupted patterns. --Associative Memory.
• Hopfield networks also have been applied to combinatorial optimization problems, e.g. Traveling Salesman Problem
4
5
Overview of Hopfield Network• Weights between units are bi-
directional (thus "Feedback" or "Recurrent" network) => A network is a fully-connected network (but no self-loop weights, i.e., wii = 0).
• Each unit/node represents a neuron. And the activation of a neuron is (in the case of binary network; similar to thresholdedperceptron)
• If xi is 1, a unit is called "active", and if -1, it is called "inactive".
• Every neuron functions as both input and output unit.
x1
x2 x3
W21= W12 W31= W13
W23= W32
−
≥⋅=
∑ otherwise ,1
0w if ,1j
ij jxix
6
• A state of a network is defined by the activation of nodes --<x1,..,xn>.
• Given a set of (current) weights, values of nodes are updated asynchronously (parallel relaxation). 1. Pick a node randomly, and compute the new activation for that
node -- a node fires if it becomes 1.2. Repeat the procedure until no node changes value.
• Then, the network settles in to one of the stable state.
Noriko Tomuro 7
x1
x2 x3
-2 1
1
(0) Pattern <-1, -1, -1> presented
-1 -1
-1
(1) After activating x2
x1
x2 x3
-2 1
11 -1
-1
(2) After activating x1
x1
x2 x3
-2 1
11 -1
-1
(3) After activating x3
x1
x2 x3
-2 1
11 1
-1
(4), (5), (6) After activating x1 through x3 again, no more state change occurs.
8
• It is proven that Hopfield network with asynchronous update will always converge to a stable state.
• Depending on the weights and input values, there may be several states to which the network converges.
• The change of the network state is essentially a search through the possible state space.
• Closeness to a stable state is measured by the notion of energy
– 𝑤𝑤𝑖𝑖𝑖𝑖 j is the connection weight between unit j and unit i.– 𝑠𝑠𝑖𝑖 is the state, si ∈ {0,1}.– 𝜃𝜃𝑖𝑖 is the bias of unit i. (- 𝜃𝜃𝑖𝑖 is the activation threshold for the unit).
When a node is activated, the change in the energy is always <= 0.
𝐸𝐸 = − �𝑖𝑖<𝑖𝑖
𝑤𝑤𝑖𝑖𝑖𝑖𝑠𝑠𝑖𝑖𝑠𝑠𝑖𝑖 + �𝑖𝑖
𝜃𝜃𝑖𝑖 𝑠𝑠𝑖𝑖
9
• So, searching for a stable state is a minimization problem, and the same Gradient Descent can be used to find the minimum.
• However, there is the danger of getting stuck in a local minima -- the convergence goes to the closest local minima.
10
• Capacity limitation of Hopfield networks:It has been shown that Hopfield networks can only memorize limited numbers of patterns.
• A newer result showed the recall accuracy between vectors and nodes was 0.138 (approximately 138 vectors can be recalled from storage for every 1000 nodes) (Hertz et al., 1991).
A Hopfield network with N nodes can store M patterns where• M = 0.15N (for binary network), or• M = N / 2 log2N (for bipolar network)
11
• Training for Hopfield networks:Values in the input patterns are bipolar: e.g. <1, -1, 1>Weights are updated incrementally.We basically want the stored patterns to be the stables states.
0. Initialize network weights.1. Do until no change in the weights occur2. Initialize delta_w's to be 0.0.3. For each pattern d in the training set, do4. Present the pattern to the network.5. For each node xi,6. If xi's activation is different from input xi,7. update the weights connected to xi.
And the weight update for node xi is
where eta is the learning rate, and xi, xj are the values in the input pattern d.
jiij xxw ⋅⋅←∆ η
12
Modern Hopfield networks:• Instead of using the net to store
memories, we use it to construct interpretations of sensory input. The input is represented by the visible units, the interpretation is represented by the states of the hidden units, and the badness of the interpretation is represented by the energy.
[Video by Geoff Hinton, 2012]
13
Summary:Hopfield networks• suffer from spurious local minima that form on the
energy hypersurface.• require the input patterns to be uncorrelated.• are limited in capacity of patterns that can be stored.• are usually fully connected and not stacked.
2 Boltzmann Machines
https://keras.io/getting-started/sequential-model-guide/ 14
• A Boltzmann machine (also called stochastic Hopfield network with hidden units) is a type of stochastic recurrent neural network (and Markov random field).
• Its units produce binary results. Unlike Hopfield nets, Boltzmann machine units are stochastic.[Wikipedia]
A graphical representation of an example Boltzmann machine. Each undirected edge represents dependency. In this example there are 3 hidden units and 4 visible units.
15
• The global energy in a Boltzmann machine is identical to that of a Hopfield network:
where– 𝑤𝑤𝑖𝑖𝑖𝑖 j is the connection weight between unit j and unit i.– 𝑠𝑠𝑖𝑖 is the state, si ∈ {0,1}.– 𝜃𝜃𝑖𝑖 is the bias of unit i. (- 𝜃𝜃𝑖𝑖 is the activation threshold for the unit).
• The goal of learning for Boltzmann machine learning algorithm is to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set.
• For the connection between the assigned probabilities and the energy (and temperature) consult this [Wikipedia] page.
𝐸𝐸 = − �𝑖𝑖<𝑖𝑖
𝑤𝑤𝑖𝑖𝑖𝑖𝑠𝑠𝑖𝑖𝑠𝑠𝑖𝑖 + �𝑖𝑖
𝜃𝜃𝑖𝑖 𝑠𝑠𝑖𝑖
Noriko Tomuro 16
17
• Training of Boltzmann Machines usually use KL-divergence, or log likelihood. The loss function G (for binary vectors) is
where– 𝑃𝑃+(𝑣𝑣) is the distribution over the training set V– 𝑃𝑃−(𝑣𝑣) is the distribution over the visible (i.e., not hidden) units in the
network
And we want to maximize G since the probabilities are not negated.
𝐺𝐺 = �𝑣𝑣
𝑃𝑃+(𝑣𝑣) � ln 𝑃𝑃+(𝑣𝑣)𝑃𝑃−(𝑣𝑣)
18
• It was discovered that original Boltzmann Machines stop learning correctly when the machine is scaled up to anything larger than a trivial machine.
• Then in 2014, an architecture called the "restricted Boltzmann machine" or "RBM“ was invented.
• RBM does not allow intralayer connections between hidden units. This type of architecture was shown to make inference and learning easier.
Restricted Boltzmann Machines
19
20
• RBM Learning:
21
Deep Restricted Boltzmann Machines
https://www.quora.com/What-is-the-difference-between-autoencoders-and-a-restricted-Boltzmann-machine 23
• Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation (encoding) for input data.
• RBM shares similar idea, but uses stochastic stochastic units with particular (usually binary of Gaussian) distribution. The task of training is to find out how visible random variables are actually connected/related to hidden random variables.
RBM vs. Autoencoder
https://en.wikipedia.org/wiki/Generative_adversarial_network 24
• Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework.
RBM vs. GAN
https://stats.stackexchange.com/questions/338328/restricted-boltzmann-machines-vs-gan 25