Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A FPGA Implementation of Large Restricted BoltzmannMachines
by
Charles Lo
Supervisor: Paul ChowApril 2010
Abstract
A FPGA Implementation of Large Restricted Boltzmann Machines
Charles Lo
Engineering Science
2010
Restricted Boltzmann Machines (RBMs) are a type of Artificial Neural Network andthe fundamental building blocks of Deep Belief Networks (DBNs) [1]. DBNs have beensuccessfully applied to a number of machine learning problems [2, 3, 1]. However, theO(n2) complexity of training a RBM presents a serious impediment to their use in largeapplications. Attempts have been made to accelerate the process using custom FPGAhardware [4, 5], but no implemenation has been demonstrated to run RBMs of 1000-2000nodes necessary for real world applications. This thesis builds upon a virutalized FPGAarchitecture presented by Ly, et al. [4] with the goal of investigating its scalability to-wards large RBMs. The virtualized architecture time-multiplexes the hardware resourcesof a single FPGA to implement large virtual RBMs. To maintain the performance gainof the custom hardware in the presence of context switches, a number of approaches wereused. The architecture was ported to a faster, more modern FPGA, the data represen-tation of reduced from 32-bits to 16-bits to increase throughput in communicaition andcoarse grain parallelism was provided by extending the architecture to four FPGAs. Asequential benchmark written in C was used to test the performance of the architecture.The analysis shows a strong dependence of performance on the communication overheadbetween the supervising microprocessor and the hardware cores. Although very littlespeedup is possible with the implementation presented, this thesis provides a directionfor further improvements to the architecture.
ii
Acknowledgements
I would like to express my gratitude to Professor Paul Chow for giving me the opportunity
to work on this project as well as for his guidance over the course of this thesis. I would
also like to thank Daniel Ly for helping to define the direction of this thesis and always
being available to answer my questions. Finally, I am grateful to Chris Madill, Arun
Patel, Manuel Saldana, Geng Liu and Chu Pang for their assistance during the past
year.
iii
Contents
1 Introduction 1
2 Background 3
2.1 Restricted Boltzmann Machine Operation . . . . . . . . . . . . . . . . . 3
2.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Alternating Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Learning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.5 Batch Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.7 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Methods for accelerating Restricted Boltzmann Machines . . . . . . . . . 10
3 Virutalized FPGA Architecture 13
3.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Computational Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Restricted Boltzmann Machine Core . . . . . . . . . . . . . . . . 15
3.2.2 Energy Accumulator Core and Node Select Core . . . . . . . . . . 17
3.3 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Large Restricted Boltzmann Machine Architecture 19
iv
4.1 Investigation of Data Bit Widths . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Memory and Communication Considerations . . . . . . . . . . . . . . . . 22
4.2.1 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Communication Overhead . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Extension to Four FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Results and Analysis 26
5.1 Test Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 Test Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 Batch Size vs. Speedup . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.2 Intrinsic RBM Size . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Conclusion 31
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.1 Weight Matrix Caching . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.2 Distributed Energy Accumulator Core Structure . . . . . . . . . . 32
Bibliography 33
A Outline of MicroBlaze Operation 36
v
List of Tables
5.1 Summary of Performance Measurements . . . . . . . . . . . . . . . . . . 30
vi
List of Figures
2.1 Structure of a 3x3 Restricted Boltzmann Machine. . . . . . . . . . . . . . 4
2.2 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Layout of a Deep Belief Network . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Weight Distribution in BRAM . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Structure of the Virtualized Restricted Boltzmann Machine Architecture 18
4.1 MicroBlaze PLB Connectivity . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Weight distribution of eight partitions among four FPGAs . . . . . . . . 24
4.3 Overall Layout in Four FPGA System. All MPI Ranks are interconnected. 25
5.1 Mini-batch size vs. speedup . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Mini-batch size vs. speedup without node calculation . . . . . . . . . . . 29
5.3 Mini-batch size vs. speedup for a virutal 512x512 network with intrinsic
RBM sizes of 64 and 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
vii
Chapter 1
Introduction
The paradigm of machine learning deals with methods that allow a computer to ex-
tract complex patterns underlying data. The applications of such methods are extensive,
including visual pattern recognition, speech recognition and video game artificial intelli-
gence. One popular method of machine learning is the use of artificial neural networks
(ANNs). Such networks roughly model the structure of the biological neural networks
in the brain, in that they consist of many parallel simple neurons, connected together
through weighted relationships. The activation of the neurons dependant on the weights
and states of connected neurons determines the reaction of the network to some input.
By controlling the value of the weights, the network can be trained to recognize certain
patterns or features of a dataset.
Many different types of ANNs exist with different network topologies, activation func-
tions and learning algorithms. A particularly popular architecture is the Restricted Boltz-
mann Machine (RBM); a stochastic, generative model that has proven to perform well
in problems such as face recognition [6]. Recently, it has been shown that when several
RBMs are stacked together to form a Deep Belief Network (DBN), an efficient learning
algorithm exists to train the entire network [1]. DBNs have the benefit of being able
to learn more complex features and have been applied to problems of generating facial
1
Chapter 1. Introduction 2
expressions [2], semantic hashing of documents [3] and recognition of hand written digits
[1]. Although the learning algorithm is relatively efficient, training the large networks
required for the real-world applications above can still take several days or weeks on
a general purpose desktop computer [6]. The parallel nature of the RBM architecture
makes it very tractable by hardware implementations and several groups have created
FPGA and GPU based RBM solutions providing much needed speed-up [7, 8, 5]. In par-
ticular, Ly et al.’s FPGA architecture has produced a 145x speed-up relative to a desktop
PC [7]. However, it has only implemented relatively small RBM networks of 256x256
neurons whereas real world applications require much larger networks. For example, the
DBN used to recognize handwritten digits [1] contained a RBM of size 2000x510.
The goal of my thesis is to scale Ly, et al.’s FPGA architecture [4] up, to be capable
of handling the thousands of neurons necessary in real world DBN applications while
maintaining maximum performance. The main bottleneck limiting the current imple-
mentation size of the FPGA architecture is the size of the weight matrix. This data
structure is necessarily large since each node must be connected through a weight to all
of the nodes in the next layer due to the bipartite graph organization of the RBM. To
allow for larger networks, my project will first involve adapting the FPGA architecture
to a larger, faster FPGA platform. This allows for the possibility of higher clock speeds
as well as better interconnects between FPGAs and thus greater performance. In addi-
tion, I will investigate the effect of decreasing the bit width of the weights. This could
allow more weights to be stored on-chip and a increase in communication bandwidth
between computational cores provided that the network is trainable at lower precision.
Finally, by time-multiplexing the resources of four FPGAs, I hope accelerate the training
performance of arbitrary size networks.
Chapter 2
Background
2.1 Restricted Boltzmann Machine Operation
Artificial Neural Networks (ANNs) are models of biological neural networks that rely on
the interactions between simple units called neurons or nodes to perform computations.
By modifying connections between nodes, ANNs may be taught to model patterns in a
set of training data [9]. A Restricted Boltzmann Machine(RBM) [10, 11] is a type of
ANN that has become recently popular due to its role as a building block in Deep Belief
Networks (DBNs). An interesting property of RBMs is they are taught to reproduce the
training data they are given. The internal model they create allow them to generate new
data statistically similar to the training set and thus a RBM is said to be a generative
neural network.
2.1.1 Structure
The RBM consists of two layers of nodes. A visible layer representing the input to the
network and a hidden layer. Each node is connected to all of the nodes of the opposite
layer through a weighted connection. The real valued weights on the connections are the
learning parameters of the RBM and it is through their adjustment that the network
3
Chapter 2. Background 4
v0
h0
v1
h1
v2
h2
Figure 2.1: Structure of a 3x3 Restricted Boltzmann Machine.
may be trained. We will denote the weight connecting visible node i to hidden node j as
wi,j. The restriction that nodes of the same type are not interconnected allows for the
development of a fast learning algorithm and is one of the properties that separates the
RBM from general Boltzmann Machines. Generally, the node states are binary valued.
However, some applications benefit from having real valued visible nodes to represent
data such as greyscale images.
With this topology in mind, we can write the elements of a RBM in matrix notation:
W =
w0,0 · · · w0,J−1
.... . .
...
wI−1,0 · · · wI−1,J−1
(2.1)
V = [v0 · · · vI−1] (2.2)
H = [h0 · · ·hJ−1] (2.3)
The RBM is a stochastic neural network in that its node states are determined through
a probabilistic function rather than a deterministic one. The function used in a RBM is
the sigmoid or logistic function [10] (Fig. 2.2). Thus, the probability of activating a node
may be calculated, given that the opposite layer is determined, as the logistic function
Chapter 2. Background 5
−4 400
1
Node
Act
ivat
ion
Pro
bab
ilit
y
Node Energy
Figure 2.2: Sigmoid Function
of its weighted inputs.
P (vi = 1) =1
1 + exp(−J−1∑j=0
hjwi,j)
(2.4)
P (hj = 1) =1
1 + exp(−I−1∑i=0
viwi,j)
(2.5)
2.1.2 Alternating Gibbs Sampling
We can now describe the main operating mode of a RBM called Alternating Gibbs Sam-
pling (AGS). Looking at Eqns. 2.4 and 2.5, the node states for one layer can be deter-
mined as long as the other layer is fixed. In the first AGS phase, the hidden layer is
generated based on test data on the visible nodes; the second AGS phase reconstructs
the test data by clamping the hidden nodes and stochastically finding the states of the
visible nodes. This process can continue to higher order AGS phases as each layer is
clamped or determined in turn. To keep track of node states we will denote the AGS as
a superscript, for example V 3 would represent the visible layer node states at the third
AGS phase.
Chapter 2. Background 6
2.1.3 Energy
The Restricted Boltzmann Machine draws inspiration from the Boltzmann Distribution
of statistical mechanics which describes the probability distribution for a set of states in
a system [12]. The state of a RBM is defined by its visible and hidden layers. Thus,
given a certain configuration of visible and hidden nodes and fixing the weights, we can
define an energy:
E(V,H) = −I−1∑i=0
J−1∑j=0
vihjwi,j (2.6)
From the Boltzmann Distribution:
P (V,H) ∝ exp(−E(V,H)) (2.7)
The goal of learning in an RBM is to model the training set, this can be accomplished
by modifying the weights such that we obtain a Boltzmann distribution where the prob-
ability of obtaining configurations with the training vectors is maximized. Looking at
the previous equation, we see that to maximize the probability of a training vector P (V )
we need to minimize the Energy associated with its configurations.
In addition, from the energy equation, we can see that for each visible or hidden node,
there is an associated local energy.
E(V,H) = −I−1∑i=0
viEi = −J−1∑j=0
hjEj (2.8)
Ei =J−1∑j=0
hjwi,j (2.9)
Ej =I−1∑i=0
viwi,j (2.10)
The local energy for a given node is in fact the weighed sum of the states of the nodes
from the opposite layer. Therefore, we can rewrite Eqns. 2.4, 2.5 as:
Chapter 2. Background 7
P (vi = 1) =1
1 + e−Ei(2.11)
P (hj = 1) =1
1 + e−Ej(2.12)
We can write these local energies more succinctly as members of vectors:
EV = [E0 · · ·EI−1] = H ·W T (2.13)
EH = [E0 · · ·EJ−1] = V ·W (2.14)
Thus the node state vectors V and H are functions of the energy vectors EV and EH
respectively.
2.1.4 Learning Rules
Given this concept of state energy, we can find the learning rule for a RBM by differen-
tiating the log probability of obtaining a particular visible layer configuration:
δlog(P (V ))
δwi,j=< vihj >
0 − < vihj >∞ (2.15)
Where < a >n denotes the expected value of a at the nth AGS phase. From this
equation we can see that to increase the probability of training vectors, we can apply the
following weight update rule:
∆wi,j = ε(< vihj >0 − < vihj >
∞) (2.16)
Where ε is the learning rate. This learning rate must be carefully controlled since
large weight updates would not be able to reach the energy minima, while slow rates
would take too long to reach them. One solution is to dynamically decrease the learning
Chapter 2. Background 8
rate during training using the process of simulated annealing [12].
Clearly, getting a sample from the infinite AGS is not feasible in computation time.
However, it has been shown that we can estimate the infinite phase with a finite one;
this is called contrastive divergence (CD) learning [13]. Using CD, we no longer perform
gradient descent in weight space, but it has been shown to work well even with as few as
three AGS phases.
2.1.5 Batch Learning
To have weight updates which represent the entire set of training data, it would be
best to calculate the average weight update for the entire training set before committing
the change. This type of weight update is called Batch Learning. For large sets batch
learning would result in long computation times between weight updates. To address
this problem, we can reduce the batch size and create mini-batches to increase update
rate at the expense of update precision. At the limit of one training vector per weight
update, we are performing on-line learning.
For a batch size of L, the learning rule becomes:
∆wi,j =ε
L
L−1∑l=0
(< vihj >0 − < vihj >
∞) (2.17)
Chapter 2. Background 9
2.1.6 Summary
The following procedure describes the RBM training operation with three AGS Phases
and a mini-batch size of L.
1. Apply a training vector to the visible layer. This becomes V 1 in the next step.
2. AGS Phase 1 Compute the local energy E1H = V 1W and apply the logistic function
to find the node states H1 = f(E1H)
3. Increment the weight update: ∆wi,j = ∆wi,j + (V 1)TH1
4. AGS Phase 2 Compute the local energy E2V = H2W T and apply the logistic func-
tion to find the node states V 2 = f(E2V )
5. AGS Phase 3 Compute the local energy E3H = V 3W and apply the logistic function
to find the node states H3 = f(E3H)
6. Decrement the weight update: ∆wi,j = ∆wi,j − (V 1)TH1
7. Repeat steps 1-6 for each training vector in the mini-batch
8. Commit the weight update: wi,j = wi,j + εL
∆wi,j
9. Repeat steps 1-8 for each mini-batch in the training set
It should be noted that the the computations of energy in each AGS phase is of
complexity O(n2) and the weight update computation is also O(n2). This makes training
RBMs with thousands of nodes a very time consuming process.
2.1.7 Deep Belief Networks
The Restricted Boltzmann Machine is powerful in itself to extract features from test
data. However, it becomes even more useful as a part of a Deep Belief Network (DBN).
Chapter 2. Background 10
v0
h0
v1
h1
v2
h2
v0
h0
v1
h1
v2
h2
RBM 2
RBM 1
Figure 2.3: Layout of a Deep Belief Network
In effect, DBNs consist of multiple RBMs stacked upon each other; the hidden nodes for
one layer become the visible nodes for the next as in Fig 2.3. The additional layers of
hidden nodes are used to model patterns within the patterns generated by earlier layers.
Thus, the DBN is able to model more complex features in data. What is interesting
about these deep networks is that they can be greedily trained layer by layer using the
same efficient algorithm presented above [1].
To get better classification or generative properties additional training can be per-
formed using wake-sleep or backpropagation algorithms.
2.2 Methods for accelerating Restricted Boltzmann
Machines
A number of computationally intensive operations need to be performed during RBM
training. In calculating the local energies, a vector-matrix multiplication must be per-
formed as well as a matrix transposition during even AGS phases. In addition, to evaluate
the node states, the non-linear logistic function must be evaluated.
These operations can be slow on sequential general purpose processors. However,
there have been a number of attempts to accelerate the process. Of particular interest are
three published implementations: One design using the inherent parallelism in Graphics
Chapter 2. Background 11
Processing Units (GPUs) [8] and two custom hardware designs implemented on Field
Programmable Gate Arrays (FPGAs) [5, 4].
Modern Graphics Processing Units (GPUs) offer several layers of parallelism much
greater than standard multi-core CPUs allowing them to operate on large batches of data
at once. In addition, optimized linear algebra packages are available for them. Raina et
al. [8] used an NVIDIA GTX 280 GPU with 1GB of RAM and the CUDA Application
Layer to accelerate RBM operations and build deep belief networks. On a single RBM
of 4096x11008 size, they achieved a speed-up of 72.6x over a software implementation
using the optimized matrix operation library Goto BLAS running on a 3.16GHz Dual-
Core processor. To minimize data transfer of weights in DBNs, they developed the idea
of ”overlapping patches”. By representing the visible layer as a 2D surface and tiling
patches across it, they were able to create local connections between hidden layers where
the patches overlapped. Using this method, they were able to build 4-layer DBNs with
96 million parameters. However, the amount of overlapping areas decreases as the order
of overlap increases, so this method is inherently limited to DBNs of decreasing size for
higher layers. In addition, the layers are not fully connected with the overlapping patches
method thus this implementation is limited to applications in a subset of DBN problems.
Kim et al. [5] developed a hardware implementation of an RBM on an Altera Stratix
III EP3SL340 FPGA. In this design the authors decided to use 16-bit fixed point words
to represent the weights, energies and visible node states. The main computational cores
of this design were partitioned into groups of adders and multipliers to perform the
vector-matrix operation of local energy calculation. To perform the energy calculation
for all of the nodes of a given layer in parallel, all of the row or column elements of the
weight matrix must be available at the same time and thus must be stored on separately
addressable memory elements. To avoid this problem, the authors stored each column
of the weight matrix in separate memory blocks such that a single row was available at
a time. This allowed the visible energies to be calculated simply using a multiplier and
Chapter 2. Background 12
tree adder. Then, by using an accumulator structure to calculate the hidden energies,
they did not have to modify their memory structure. To compute the logistic function, a
Piecewise Linear Approximate of Nonlinear function (PLAN) was implemented. When
benchmarked against a software implementation running on a 2.4GHz Intel Core 2 sys-
tem, they achieved a speed-up of 25x over single precision MATLAB code and 30x over
double precision. The maximum network size achieved was 512x512.
The final FPGA Implementation by Ly et al [4], was developed on a Berkeley Emula-
tion Engine 2 hardware platform consisting of five interconnected Virtex-II Pro XC2VP70
FPGAs. In this design, a set of tree adders was used to calculate the visible and hidden
energies. The problem of weight addressing was allievated by storing diagonal sections
of the matrix in different memory blocks. In this way, the same set of memory blocks
could be used to access a row or column of the weight matrix. The logistic function was
performed using a Piecewise Linear Interpolator. Some significant differences with the
FPGA design by Kim et. al. are that the weights and energies are represented as 32-bit
fixed point numbers rather than 16-bit ones. In addition, the visible nodes can only
be binary valued, whereas they are real valued in Kim et. al’s design. Three different
designs were presented: one on a single FPGA running a 128x128 RBM, one using coarse
grain parallelism across four FPGAs to run a 256x256 RBM and one time-multiplexing
the resources of a single FPGA to realize a 256x256 network. The speed-ups obtained
were 61x, 145x and 32x respectively over an optimized C implementation running on a
2.8GHz Pentium 4 processor. The works described here did not use a common bench-
mark, so it is difficult to compare performance directly. The GPU implementation has a
clear advantage in network size, but the limitations of its overlapping patches technique
make it unusable for large, general DBNs. Of the two FPGA applications, the one by Ly
et. al, has a clear performance advantage especially considering that it is implemented
on older FPGA hardware. Notably, no designs have been published implementing real
world DBN applications.
Chapter 3
Virutalized FPGA Architecture
The work in this thesis is built on top of the Virtualized FPGA RBM architecture
designed by Ly, et al. [4] In this chapter, some important aspects of the architecture will
be discussed.
Custom FPGA hardware cores are able to perform the compoutations involved in
RBM training very quickly, but a FPGA has a finite amount of resources. Therefore,
the size of the network a single FPGA can work on is limited. One way to increase the
workable network size is by simply adding more FPGAs. However as the size of the
application grows, this method becomes quickly cost and power prohibitive. A better
approach is to time-multiplex the hardware to handle problems of almost arbitrary size.
The tradeoff in this approach is that a context switch is required to work on different
portions of the network. The virtualized RBM architecture that this thesis is based on
uses the time-multiplexing approach to work on networks whose size would not normally
fit on a single FPGA.
3.1 Partitioning
To use a virtualized system for performing Restricted Boltzmann Machine operations,
the computations must first be partitioned into independent work units. By partitioning
13
Chapter 3. Virutalized FPGA Architecture 14
the visible and hidden vectors into A and B parts respectively, the weight matrix can be
broken into a group of block matrices.
W =
W0,0 · · · W0,B−1
.... . .
...
WA−1,0 · · · WA−1,B−1
(3.1)
V = [V0 · · ·VA−1] (3.2)
H = [H0 · · ·HB−1] (3.3)
The energy calculation then becomes:
EH = V ·W =
EH0
...
EHB−1
=
V0 ·W0,0+ · · · +VA−1 ·WA−1,0
.... . .
...
V0 ·W0,B−1+ · · · +VA−1 ·WA−1,B−1
(3.4)
EV = H ·W T =
EV0
...
EVA−1
=
H0 ·W0,0+ · · · +HB−1 ·W0,B−1
.... . .
...
H0 ·WA−1,0+ · · · +HB−1 ·WA−1,B−1
(3.5)
In this configuration, the energies EHjand EVi
required to calculate a block of node
states is now divided into a number of partial energy blocks involving the calcuation
ViWi,j or HjWi,j. These partial energy calculations can be done independently and later
recombined to resolve the node states. Also, by partioning the weights in this way, the
weight update calculations may be performed on each weight matrix Wi,j independently
as well.
Chapter 3. Virutalized FPGA Architecture 15
3.2 Computational Cores
The hardware architecture consists of three major cores. The Restricted Boltzmann
Machine Core (RBMC) performs the primary vector-matrix energy calculation as well
as the weight update. The Energy Accumulator Core (EAC) is used to sum the partial
energies described in the previous section. Finally, the Node Select Core (NSC) evaluates
the node states using the sigmoid function.
3.2.1 Restricted Boltzmann Machine Core
The RBMC performs the O(n2) energy calcuation (Eqns. 2.13, 2.14) and weight update
(Eqn. 2.16) steps in O(n) time. To achieve this speed, a number of restrictions are
applied to the RBM network.
• The sizes of the visible and hidden layers must be the same to allow for reuse of
the same computational logic for both layers.
• Node states must be binary values. This condition allows the use of AND gates in
place of multipliers for operations involving the node states.
• Weights and energies use a 32-bit fixed point representation. The choice of fixed
point over floating point simplifies the arithmetic logic for energy and weight update
calculations.
• The layer size must be a power of two. This limitation allows the use of a binary
tree adder when calculating the energy.
The size restrictions on the node layers do not limit the space of problems the architecture
is capable of handling since unused nodes may be added to reach the next power of two.
However, maximum effective performance will not be attained unless the sizes of the
layers in the application match the previous descriptions. Also a 32-bit fixed point
representation provides a very large range of supported values and would likely not run
Chapter 3. Virutalized FPGA Architecture 16
W =
BRAM 0
BRAM 1
BRAM 2
w0,0 w0,1 w0,2
w1,0 w1,1 w1,2
w2,0 w2,1 w2,2
w0,0 w0,1 w0,2
w1,0 w1,1 w1,2
w2,0 w2,1 w2,2
Figure 3.1: Weight Distribution in BRAM
into overflow or underflow issues unless the radix was chosen poorly. The choice of binary
valued node states on the other hand does limit the number of real world applications,
but the simplicity of the logic required allows for very fast computation in the set of
problems the architecture can handle.
With these restrictions in place, the node energies may be calculated by accessing an
entire row or column of the weight matrix, performing a logical AND with the node states
and feeding the result into a binary tree adder. Due to the pipelined nature of the tree
adder, one energy can be produced every clock cycle, thus reducing the computational
complexity toO(n). Likewise an entire row or column of weight updates may be generated
in parallel by performing a logical AND between the visible and hidden node states and
using the outcome to decide whether or not the learning rate should be applied to the
weight updates.
Clearly an important characteristic of the RBMC is its ability to access a full column
or row of the weight matrix in parallel. To facilitate this, for a RBM of size nxn, n
physical dual ported Block RAMs (BRAMs) are instantiated, each containing a diagonal
of the weight matrix. An example for a 3x3 RBM is shown in Fig. 3.1. Notice that for
every row and column, each weight is stored on a different BRAM.
The parallel storage of weights turns out to be the limiting factor in terms of size of
RBM synthesizable on a single FPGA. For weight storage, n BRAMs are required and
an additional n BRAMs are required to store the weight updates.
Chapter 3. Virutalized FPGA Architecture 17
3.2.2 Energy Accumulator Core and Node Select Core
The EAC and NSC work in tandem to find the node states given the a set of partial
energies. First, as a stream of partial energies arrives at the EAC, they are summed and
stored in a BRAM First In First Out (FIFO) memory structure. Once the total energies
has been computed, the EAC sends them to the NSC which performs node state selection
using an approximated sigmoid function and a uniform random number generator. The
sigmoid function is calculated using a look up table (LUT) whose output is sent through
a pipelined piecewise linear interpolator (PLI) in order to get a better estimate. Once
the node states have been determined, they are sent back to the EAC and from there
back to the source of the partial energies.
3.3 Message Passing Interface
All three cores performing different parts of the RBM calculations must be connected
to each other as well as a supervising microprocessor. In order to provide a simple,
high bandwidth, communication channel TMD-MPI [14] was used to connect the cores.
TMD-MPI implements a subset of the Message Passing Interface (MPI) standard for
embedded systems. This communication layer offers a level of abstraction away from the
implementation of the computational cores. Data is sent point to point as through a
network as packets called messages. Each message has a defined source, destination, tag
and word count where words are 32-bit pieces of data. At initialization, each device on
the MPI network is given a specific address called a rank which is used to route packets
through the network. When data is recieved it is stored in a message queue. Once a
hardware core begins to read the message, a new word of the data is available each clock
cycle. This operation allows the cores to operate asynchornously and yet still have high
bandwidth communication between each other.
The microprocessor has its own Message Passing Engine (MPE) which supports direct
Chapter 3. Virutalized FPGA Architecture 18
PPC
EAC
RBMC
NSC
R0 R1
R2R3
Figure 3.2: Structure of the Virtualized Restricted Boltzmann Machine Architecture
memory access (DMA) and burst access to memory. These features allow a minimal
overhead from the processor since only four 32-bit words must be sent to the MPE
before it may begin streaming data. The MPI connectivity of the full system is shown in
Fig. 3.2. The circles represent the MPI hardware and show the ranks of each computing
element. The platform used in [4] was a Berkeley Emulation Engine 2 (BEE2) [15] with
five Virtex-II Pro XC2VP70 FPGAs in a communication mesh and a hard PowerPC
(PPC) processor.
Chapter 4
Large Restricted Boltzmann
Machine Architecture
The goal of this thesis was to investigate the scalability of the FPGA architecture dis-
cussed in the previous chapter and adapt it to handle the size of Restricted Boltzmann
Machines required in real world Deep Belief Network applications. In this chapter, the
design modifications to the existing architecture will be discussed. One of the first steps
taken to increase the performance of the virtualized FPGA architecture was to move the
design to a more modern FPGA. The BEECube Berkeley Emulation Engine 3 (BEE3)
[16] hardware platform was chosen as a logical upgrade from the Virtex II based BEE2.
The BEE3 contains four Xilinx Virtex-5 5VLX155T FPGAs connected in a ring, each
with acces to up to 16GB of DDR2 external RAM. The only major design change during
this transition was a switch from a PowerPC processor managing the hardware cores to
soft MicroBlaze processors.
4.1 Investigation of Data Bit Widths
The existing Virtualized Restrcted Boltzmann Machine architecture represented weight
and energy values as 32-bit fixed-point numbers. This bit width was a convinient design
19
Chapter 4. Large Restricted Boltzmann Machine Architecture 20
choice since the MPI hardware operated with 32-bit data widths and the configurable
dual ported BRAMs supported up to 36-bit data width. However, significant performance
improvements may be realized with a reduction in bit width. For example, given the 32-
bit width of the MPI channel, a bit width of 16-bits would result in double the amount of
weights or energies transferred per clock cycle since two of them could be packed into each
MPI word. A width of 8-bits would allow four times the number of weights or energies
transferred. Since the weights must be transferred during every context switch, this
gain in throughput becomes significant in the Virtualized RBM architecture. With data
packing, the operation of the EAC and NSC may also be parallelized to add multiple
energies and calculate multiple node states per clock cycle. Finally, a RBM of size n
requires 2n physical dual ported BRAMs to store the weight and weight update matrices
on the RBMC. If the BRAMs on the FPGA may be split into smaller width, but more
plentiful dual ported BRAMs the size of RBM that can be synthesized on a single FPGA
may also be increased. This would significantly increase the performance of the RBMC.
The drawback of using fewer bits to represent data is the reduction in the range
of possible values. Depending on the RBM application, there exists the possibility of
overflow or underflow. This could lead to problems finding a set of values in weight
space to accurately represent the given training set. To roughly estimate the effect of
using different data widths, a simple experiment was carried out. A RBM was trained in
software with three different signed fixed-point representations: 32-bit with 8 magnitude
bits, 16-bit with 8 magnitude bits and 8-bit with 4 mangitude bits. The network of size
1024x512 was trained for 100 epochs to recognize an image of the number 0.
As a comparison metric, the trained networks were fed back the training image and
AGS was run for 1025 phases. If the weights had been set properly, the network would
be able to reproduce the image faithfully. Fig 4.1 shows the average number of errors
found in a bitwise comparison of the original image versus the reconstructed one over ten
attempts. The reconstruction with 16-bit weights produced a result similar to the case
Chapter 4. Large Restricted Boltzmann Machine Architecture 21
10 15 20 25 30
100
200
300
400
500
Weight and Energy Bit Width
Ave
rage
Rec
onst
ruct
ion
Err
or
Bit Width vs. Average Reconstruction Error
with 32-bit weights while the network with 8-bit weights failed to reproduce the image
at all. Previous studies [17, 5] on data width reinforce these results and suggest that
16-bits is adequate for many neural network applications. Thus, this project uses 16-bit
representations for weight and energy values.
Due to the choice of 16-bit widths, two energies are packed in each word transmitted
over MPI, the EAC and NSC were modified to perform the energy summation and node
state calcuation in parallel for both incomming energies. The Virtex-5 family Block
RAMs are 36Kbit dual ported modules configurable in a number of width and depth
settings. Each BRAM may also be configured as two independent 18Kbit dual ported
modules. In addition, both the 36Kbit and 18Kbit BRAMs may be configured in simple
dual-port mode in which there is a single dedicated read port and a single dedicated
write port. In this configuration, the 36Kbit BRAM width is doubled to 72 bits and the
18Kbit BRAM width is doubled to 36 bits.[18] The 5VLX155T FPGA has 212 36Kbit
BRAMs and thus 424 18Kbit BRAMs available. Therefore the maximum RBM which
can be implemented on the FPGA is 128x128 using 256 18Kbit BRAMs.[19] The RBMC
only requires a single read port and single write port for each weight storage BRAM, thus
the maximum RBMC is 128x128 with both 32-bit data widths and 16-bit data widths.
Chapter 4. Large Restricted Boltzmann Machine Architecture 22
μB
MPMC
PLB
MPE
PLB
BRAM
DRAM
FPGA 0
Figure 4.1: MicroBlaze PLB Connectivity
However, given the improvements in the communication performance from the reduction
in bit widths is still beneficial to overall performance.
4.2 Memory and Communication Considerations
4.2.1 Data Storage
A weight matrix for a RBM of size 128x128 with 16-bit weight representation requires
32 KBytes of memory to store. As we increase the RBM network size, the weight matrix
becomes quadratically larger and quickly exceeds the storage resources on a FPGA. In
addition, when training real world applications the number of training vectors can become
very large. For example, the MNIST database with 60 000 images was used while training
a network to recognize handwritten numbers[1]. In order to store all of this data off chip
DDR2 RAM was used. Fig 4.1 shows the local connections to the MicroBlaze processor.
Data is streamed through the processor local bus (PLB) from the DDR2 RAM to the
multip ported memory controller (MPMC) and finally to the PLB MPE core after which
it is distributed to the appropriate computational core through the MPI network.
Chapter 4. Large Restricted Boltzmann Machine Architecture 23
4.2.2 Communication Overhead
Since much of the data must be located external to the FPGA, additional latency is
introduced during transfers between the MicroBlaze and the computational cores. The
MPE core is designed with direct memory access (DMA) and is able to perform burst
writes and reads to and from the external memory through the PLB. In principle this is
fast, especially for large blocks of data such as the weight matrices. The only overhead
from the MicroBlaze processor is the transmission of four words to set up the MPE
core. However, when transferring smaller batches of data this overhead becomes very
significant since the MicroBlaze is slow compared to the hardware cores. In particular,
node states for a 128x128 system are only four 32-bit words long themselves and a set
of energies is only 64 words long. Therefore operations heavily involving these elements
such as the node state calculation are subject to a significant performance hit.
In addition to the overhead from the MicroBlaze, another significant performance
reduction comes from the context switch operation itself. Although the transmission of
weight matrices is relatively fast given DMA, bursting and the two packed 16-bit elements
in each MPI word, the communication time represents a significant period where the
RBMC is idle. One simple way to address this problem is by increasing the mini-batch
size. This allows the weight matrix to remain on the RBMC for a longer period of time
and as batch size increases, theoretically the computation time of the RBMC would
eventually become the limiting factor. One key drawback to using larger batch sizes is
the need to store more partial energies before node state calculations may occur. In this
particular implemention since the energies are stored on large external DDR RAM, this
does not have a significant effect.
Chapter 4. Large Restricted Boltzmann Machine Architecture 24
W =
FPGA 0
FPGA 1
FPGA 2
FPGA 3
W0,0 W0,1 W0,2 W0,3 W0,4 W0,5 W0,6 W0,7
W1,0 W1,1 W1,2 W1,3 W1,4 W1,5 W1,6 W1,7
W2,0 W2,1 W2,2 W2,3 W2,4 W2,5 W2,6 W2,7
W3,0 W3,1 W3,2 W3,3 W3,4 W3,5 W3,6 W3,7
W4,0 W4,1 W4,2 W4,3 W4,4 W4,5 W4,6 W4,7
W5,0 W5,1 W5,2 W5,3 W5,4 W5,5 W5,6 W5,7
W6,0 W6,1 W6,2 W6,3 W6,4 W6,5 W6,6 W6,7
W7,0 W7,1 W7,2 W7,3 W7,4 W7,5 W7,6 W7,7
W0,0 W0,1 W0,2 W0,3 W0,4 W0,5 W0,6 W0,7
W1,0 W1,1 W1,2 W1,3 W1,4 W1,5 W1,6 W1,7
W2,0 W2,1 W2,2 W2,3 W2,4 W2,5 W2,6 W2,7
W3,0 W3,1 W3,2 W3,3 W3,4 W3,5 W3,6 W3,7
W4,0 W4,1 W4,2 W4,3 W4,4 W4,5 W4,6 W4,7
W5,0 W5,1 W5,2 W5,3 W5,4 W5,5 W5,6 W5,7
W6,0 W6,1 W6,2 W6,3 W6,4 W6,5 W6,6 W6,7
W7,0 W7,1 W7,2 W7,3 W7,4 W7,5 W7,6 W7,7
Figure 4.2: Weight distribution of eight partitions among four FPGAs
4.3 Extension to Four FPGAs
Additional performance was obtained by using the four FPGAs available on the BEE3
platform to provide coarse grain parallelism. Since communication links between FPGAs
are not as fast as on-chip links, minimizing the inter-FPGA communication was essential
to maintaining performance. In addition, it was important to ensure that the work load
was shared evenly among the FPGAs. From these two conditions, the weight matrices
representing the largest data transfer, were assigned to FPGAs where they were just
streamed locally in and out of DDR2 RAM. The partioning of the matrices shown in
Fig. 4.2 is similar to the weight distribution within the RBMC from Fig. 3.1. The one
difference being that there may be fewer FPGAs than weight matrices and thus multiple
sets of calculations may be required to get all the partial energies for a set of nodes. This
structure allows all of the FPGAs to work together computing either a set of visible or
hidden nodes at once.
The overall system layout is shown in Fig. 4.3. In this configuration, all of the partial
energies from each FPGA must be sent to a single location to be added together and
the node states must then be distributed from that single source back to the rest of the
FPGAs. Given the operation of the EAC, this was the simplest method of connectivity.
However it results in a communication bottleneck that becomes more significant as FP-
Chapter 4. Large Restricted Boltzmann Machine Architecture 25
μB
EAC
RBMC
NSC
RA
M
R0 R1
R2R3
μB RBMC
RA
M
R4 R5
μB RBMC
RA
M
R8 R9
μB RBMC
RA
M
R6 R7
FPGA 0 FPGA 1
FPGA 3 FPGA 2
Figure 4.3: Overall Layout in Four FPGA System. All MPI Ranks are interconnected.
GAs are added. In addition, a key limitation in this implementation is network size. A
bug exists in the EAC in which it ceases operation when more than ten partial energies
are delivered to it. Due to time constraints, this bug was not addressed for this thesis.
Therefore, the maximum network size implementable by this system is 8n where n is the
size of RBM synthesized on a single FPGA.
An outline of the code run on each FPGA is provided in Appendix A.
Chapter 5
Results and Analysis
5.1 Test Methods
5.1.1 Test Setup
The design was tested on the BEE3 platform with all computational cores and MicroBlaze
soft processors running at 100MHz. In addition, an external 2GB DDR2-667 RDIMM
module running at 200MHz was connected through a multi ported memory controller
(MPMC) to the processor local bus (PLB) of the MicroBlaze processor. In hardware,
two different network sizes were synthesized: 64x64 and 128x128. Virtual network sizes
of 1024x1024 and 512x512 were tested on the 128x128 system while only 512x512 was
tested on the 64x64 system. The fmax of the 128x128 system reported by Xilinx Synthesis
Tool (XST) was 145.624MHz. However, due to time constraints, system clock frequencies
greater than 100MHz were not explored. In testing the effect of various batch sizes on
performance, batches of 1,8,16,32,64 and 84 were run.
A sequential implementation written in C was used as a basis for comparison for
relative speedup. The equivalent software versions of the hardware components were
written such that the output of the software benchmark matched the output of the
hardware implementation. The C implemenation was compiled using gcc version 4.4.1
26
Chapter 5. Results and Analysis 27
with optimization level 2. The benchmark was run on a Intel Core 2 Duo E8400 at 3GHz
on a 32-bit version of Ubuntu Linux running kernel 2.6.31-15.
To record the computation time, the function gettimeofday() was used on the software
implementation. The results of 25 runs was averaged to get the final computation time.
For the hardware implementation, the function MPI TIME() was used and the results
were averaged over 10 runs.
5.1.2 Test Metric
Although relative speedup is an interesting measure, it is difficult to compare different
architectures without an absolute measure of performance. One popular method of mea-
suring neural network training performance is Connection Updates per Second (CUPS)
[20]. This is defined as the number of weight updates per second or
CUPS =n2
T(5.1)
Where n is the size of node layers and T is the amount of time for all of the weights to
be updated for one test vector.
The speedup over the sequential C implementation was taken to be the ratio of
Connection Updates per Second of the hardware implementation and that of the software
implementation.
S =CUPShCUPSs
(5.2)
5.2 Results
5.2.1 Batch Size vs. Speedup
As previously mentioned, one of the major concerns with scaling a RBM architecture
is the O(n2) growth in the number of context switches and thus data transfers. Fig.
Chapter 5. Results and Analysis 28
0 20 40 60 800
2
4
6
8
Mini-batch size
Sp
eedup
Batch Size vs. Speedup
1024x1024512x512
Figure 5.1: Mini-batch size vs. speedup
5.2.1 shows the effect of increasing batch size on overall performance. Both network sizes
plotted were implemented on a system with 128x128 intrinsic core size. From Fig. 5.2.1,
we can see that as batch size increases the larger 1024x1024 system becomes increasingly
faster than the 512x512 network. We can infer from this result that at a batch size
of one, the weight transfers consume a vast majority of the computation time. As the
network size is increased, the RBMC and EAC/NSC operations begin to take up a greater
percentage of time and the O(n) benefits of those cores become apparent relative to the
O(n2) software baseline.
If the RBM training time was limited purely by the RBMC or EAC/NSC computation
times, we would expect to see a continued increase in speedup with batch size. However,
the plots level off fairly quickly. One other operation that keeps the RBMC idle apart
from the weight transfer is the node state calculation. Particularly in this architecture,
the partial energies must be transferred from all of the FPGAs to one point and the node
states must then be streamed back. This operation must be done synchronously between
FPGAs thus, some overhead in setting up the timings between FPGAs is required and
the computation must wait for the most delayed FPGA to be ready.
Chapter 5. Results and Analysis 29
0 20 40 60 80
0
5
10
15
Mini-batch size
Sp
eedup
Batch Size vs. Speedup Without Node Selection
1024x1024512x512
Figure 5.2: Mini-batch size vs. speedup without node calculation
As an artificial experiment, the same test was run but without the node calculation
stage. Here, the speedup increases noticebly beyond the last test, but still begins to taper
off before reaching very high performance. Since from [4], the hardware cores themselves
are very fast, this is likely due once again to communication bottlenecks between the
RAM and the compute cores. From these results, we can see that the communication
overhead due to transfers between the MicroBlaze and the hardware cores limit the
overall system performance. Since a number of communications increase as O(n2), this
has significant implications.
5.2.2 Intrinsic RBM Size
A second performance factor measured was the effect of changing the intrinsic network
size n. A reduction in n would degrade the performance benefit of the O(n) compute
cores and require more context switches for the same virtualized RBM size. However it
would also reduce the overall transfer time of data between computations. Fig. 5.2.2
shows a comparison of running a virtual 512x512 RBM on both n = 64 and n = 128
hardware over varying batch sizes. From this plot we can see that increasing the intrinsic
Chapter 5. Results and Analysis 30
0 20 40 60 800
1
2
3
4
5
Mini-batch size
Sp
eedup
512x512 Batch Size vs. Speedup for n
n = 128n = 64
Figure 5.3: Mini-batch size vs. speedup for a virutal 512x512 network with intrinsicRBM sizes of 64 and 128
size of RBM is very beneficial above small batch sizes. 64x64 vs 128x128
5.2.3 Summary
The absolute CUPS results are summarized in table 5.2.3. It is interesting to note how
CUPS is approximately the same for batch size 1 and n = 128 regardless of virutalized
size. This represents a O(n2) relationship.
Platform RBM Size Batch Size MCUPS
Virtualized FPGA n = 64 512x512 1 60.1Virtualized FPGA n = 64 512x512 84 314.9Virtualized FPGA n = 128 512x512 1 79.8Virtualized FPGA n = 128 512x512 84 769.12Virtualized FPGA n = 128 1024x1024 1 82.4Virtualized FPGA n = 128 1024x1024 84 995.3
Table 5.1: Summary of Performance Measurements
Chapter 6
Conclusion
6.1 Conclusions
The purpose of this thesis was to investigate the scalability of Ly, et al.’s [4], virtualized
FPGA architecture. The primary impediment to using the architecture for large networks
was the communication overhead required in context switching. As the size of the RBM
grows linearly, the number of transfers increases as O(n2). To maintain performance,
the architecture was ported to a faster, more modern FPGA platform, the BEE3. The
data representation was also changed from 32-bits to 16-bits in order to reduce the time
required to transfer a set of energies or weights. Finally, the system was implemented
four FPGAs in order to provide some extra coarse grain parallelsim.
When compared to a sequential O(n2) C benchmark, the results showed several dif-
ferent communication overhead problems. The speed at low batch sizes was limited by
the weight transfers during AGS phases. At higher batch sizes, the transfer of data to
the EAC became the bottleneck and if that was removed, additional overhead reduced
performance before a good speedup could be observed. The design presented in this the-
sis only achieves a small speedup over software at high batch sizes. However, the analysis
provided may be used as a basis for further improvements to the architecture.
31
Chapter 6. Conclusion 32
6.2 Future Work
The architecture presented in this thesis still has a great deal of room for improvement.
Primarily, a reduction in the significant communication overheads present in the virtu-
alized system would allow the system to more fully utilize the computational cores.
6.2.1 Weight Matrix Caching
The transfer of weights during the context switches is a significant performance bottleneck
of this architecture. As shown in this work, the effect of the context switch can be
partially allieviated by using large batch sizes. However during the transfer from external
DDR2 RAM, the RBMC is still inactive. To reduce the transfer latency, the next weight
matrix to be processed may be cached in a compact structure within the RBMC. Another
possibility is to use the leftover depth of the weight storage BRAMs to cache multiple
weight matrices. In the 128x128 16-bit architecture, only 128 out of 1K elements of the
18Kbit DRAMs are being used; by making use of the independent write port, additional
weight matrices may be loaded while the RBMC performs other calcuations.
6.2.2 Distributed Energy Accumulator Core Structure
Another method of reducing communication overhead is to improve the operation of the
EAC. If more FPGAs are added to the system, the single EAC point in the current
architecture will become increasingly bottlenecked. In order for the performance of the
architecture so scale well when implemented on many FPGAs, the calculation of the
node states should be distributed in a tree or ring fashion. This would reduce the com-
munication bottleneck as well as improve the node selection time in the case of the tree
structure.
Bibliography
[1] G. E. Hinton and S. Osindero, “A Fast Learning Algorithm for Deep Belief Nets,”
Neural Computation, vol. 18, p. 2006, 2006.
[2] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson, “Generating
Facial Expressions with Deep Belief Nets.”
[3] R. Salakhutdinov and G. Hinton, “Semantic Hashing,” Int. J. Approx. Reasoning,
vol. 50, no. 7, pp. 969–978, 2009.
[4] D. Ly, “A High Performance, Reconfigurable Architecture for Restricted Boltzmann
Machines,” Master’s thesis, University of Toronto, 2009.
[5] S. Kim, MacAfee, P. L. McMahon, and K. Olukoton, “A Highly Scalable Restricted
Boltzmann Machine FPGA Implementation,” in International Conference on Field
Programmable Logic and Applications, 2009.
[6] Y. W. Teh and G. E. Hinton, “Rate-coded Restricted Boltzmann Machines for Face
Recognition,” in In Advances in Neural Information Processing Systems. MIT
Press, 2001, pp. 908–914.
[7] D. L. Ly and P. Chow, “A High-Performance FPGA Architecture for Restricted
Boltzmann Machines,” in FPGA ’09: Proceeding of the ACM/SIGDA international
symposium on Field programmable gate arrays. New York, NY, USA: ACM, 2009,
pp. 73–82.
33
Bibliography 34
[8] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale Deep Unsupervised Learning
using Graphics Processors,” in ICML ’09: Proceedings of the 26th Annual Interna-
tional Conference on Machine Learning. New York, NY, USA: ACM, 2009, pp.
873–880.
[9] G. E. Hinton, “Connectionist Learning Procedures,” pp. 185–234, 1990.
[10] P. Smolensky, “Information Processing in Dynamical Systems: Foundations of Har-
mony Theory,” pp. 194–281, 1986.
[11] Y. Freund and D. Haussler, “Unsupervised Learning of Distributions on Binary
Vectors Using Two Layer Networks,” Santa Cruz, CA, USA, Tech. Rep., 1994.
[12] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltz-
mann Machines,” Cognitive Science, vol. 9, pp. 147–169, 1985.
[13] G. E. Hinton, “Training Products of Experts by Minimizing Contrastive Diver-
gence,” Neural Comput., vol. 14, no. 8, pp. 1771–1800, 2002.
[14] M. Saldana, A. Patel, C. Madill, D. Nunes, D. Wang, H. Styles, A. Putnam, R. Wit-
tig, and P. Chow, “MPI as an Abstraction for Software-Hardware Interaction for
HPRCs,” in Second International Workshop on High-Performance Reconfigurable
Computing Technology and Applications, 2008.
[15] C. Chang, J. Wawrzynek, and R. W. Brodersen, “114 Configurable Computing:
Fabrics and Systems BEE2: A High-End Reconfigurable Computing System.”
[16] J. D. Davis, C. P. Thacker, and C. Chang, “BEE3: Revitalizing computer architec-
ture research,” Tech. Rep., 2009.
[17] J. L. Holt and T. E. Baker, “Back Propagation Simulations using Limited Precision
Calculations,” in International Joint Conference on Neural Networks. Volume II,
Seattle, WA, USA, 1991, pp. 121–126.
Bibliography 35
[18] Xilinx Inc., “Xilinx UG190 Virtex-5 User Guide,” April 2006, [Revised Nov. 5, 2009].
[19] Xilinx Inc, “Xilinx DS100 Virtex-5 Family Overview,” April 2006, [Revised Feb. 6,
2009].
[20] Y. Liao, “Neural Networks in Hardware: A Survey,” Tech. Rep., 2001.
Appendix A
Outline of MicroBlaze Operation
The following is a pseudocode outline of the C code which is run on each of the four
MicroBlaze processors in Fig. 4.3. A partition is the row or column of the block weight
matrix (Fig. 4.2). If the size of the network is larger than 4n, then the FPGAs must
perform multiple energy calculations before the node states for a partition can be found.
Each context switch required in a partition is a work unit.
Many details including the memory addressing scheme are removed from the following
code for clarity. The pseudocode is intended as a reference for the amount of communica-
tion transfers that occur and as a guideline for the order of operations in the computation
engines.
36
Appendix A. Outline of MicroBlaze Operation 37
// MPI_Send and MPI_Recv format:
// MPI_Send(<Memory Address>, <Word Count>, <Destination>)
// MPI_Recv(<Memory Address>, <Word Count>, <Source>)
n = intrinisic RBM size
SIZE = Size of virtualized RBM
PART = SIZE / n
WORK = PART / 4
MPI_RANK = Rank of the current MicroBlaze
for (all epochs)
{
//Run for three AGS phases
for (ags = 0 to 3)
{
//Generate Phase
if ( (ags & 0x00000001) == 0)
{
for (p = 0 to PART)
{
for (w = 0 to WORK)
{
MPI_Send(RBMC_Initialization, BATCH*3+1, MPI_RANK+1);
//Send the weights to the RBMC
MPI_Send(weight, n*n/2, MPI_RANK+1);
for (b = 0 to BATCH)
{
//Send the visible nodes
MPI_Send(visible, n/32, MPI_RANK+1);
//Recieve the Partial Energy
MPI_Recv(energy, n/2, MPI_RANK+1);
if (w == WORK-1)
{
//The primary MicroBlaze must initialize the EAC
//Before other MicroBlazes send their energy
#if MPI_RANK == 0
MPI_Send(EAC_Initialization, 2, 3);
//Send a message around the ring of FPGAs
//To esure they are synchronized
Appendix A. Outline of MicroBlaze Operation 38
MPI_Send(test, 1, 4);
MPI_Recv(test, 1, 8);
for (c = 0 to WORK)
{
MPI_Send(energy, n/2, 3);
}
//Since the rank 0 FPGA initialized the EAC
//It recieves the node states
MPI_Recv(hidden, n/32, 3);
//Distribute the Node States
MPI_Send(hidden, n/32, 4);
#else
//Synchronize
#if MPI_RANK == 4
MPI_Recv(test, 1, 0);
MPI_Send(test, 1, 6);
#endif
#if MPI_RANK == 6
MPI_Recv(test, 1, 4);
MPI_Send(test, 1, 8);
#endif
#if MPI_RANK == 8
MPI_Recv(test, 1, 6);
MPI_Send(test, 1, 0);
#endif
//Send Energies
for (c = 0 to WORK)
{
MPI_Send(energy, n/2, 3);
}
//Recieve Weights
#if MPI_RANK == 4
MPI_Recv(hidden, n/32, 0);
MPI_Send(hidden, n/32, 6);
#endif
#if MPI_RANK == 6
Appendix A. Outline of MicroBlaze Operation 39
MPI_Recv(hidden, n/32, 4);
MPI_Send(hidden, n/32, 8);
#endif
#if MPI_RANK == 8
MPI_Recv(hidden, n/32, 6);
#endif
}
}
}
}
}
//Reconstruct Phase
else
{
for (p = 0 to PART)
{
for (w = 0 to WORK)
{
MPI_Send(RBMC_Initialization, BATCH*3+1, MPI_RANK+1);
//Send the weights to the RBMC
MPI_Send(weight, n*n/2, MPI_RANK+1);
for (b = 0 to BATCH)
{
//Send the hidden nodes
MPI_Send(hidden, n/32, MPI_RANK+1);
//Recieve the Partial Energy
MPI_Recv(energy, n/2, MPI_RANK+1);
if (w == WORK-1)
{
//The primary MicroBlaze must initialize the EAC
//Before other MicroBlazes send their energy
#if MPI_RANK == 0
MPI_Send(EAC_Initialization, 2, 3);
//Send a message around the ring of FPGAs
//To esure they are synchronized
MPI_Send(test, 1, 4);
MPI_Recv(test, 1, 8);
for (c = 0 to WORK)
Appendix A. Outline of MicroBlaze Operation 40
{
MPI_Send(energy, n/2, 3);
}
//Since the rank 0 FPGA initialized the EAC
//It recieves the node states
MPI_Recv(visible, n/32, 3);
//Distribute the Node States
MPI_Send(visible, n/32, 4);
#else
//Synchronize
#if MPI_RANK == 4
MPI_Recv(test, 1, 0);
MPI_Send(test, 1, 6);
#endif
#if MPI_RANK == 6
MPI_Recv(test, 1, 4);
MPI_Send(test, 1, 8);
#endif
#if MPI_RANK == 8
MPI_Recv(test, 1, 6);
MPI_Send(test, 1, 0);
#endif
//Send Energies
for (c = 0 to WORK)
{
MPI_Send(energy, n/2, 3);
}
//Recieve Weights
#if MPI_RANK == 4
MPI_Recv(visible, n/32, 0);
MPI_Send(visible, n/32, 6);
#endif
#if MPI_RANK == 6
MPI_Recv(visible, n/32, 4);
MPI_Send(visible, n/32, 8);
#endif
#if MPI_RANK == 8
Appendix A. Outline of MicroBlaze Operation 41
MPI_Recv(visible, n/32, 6);
#endif
}
}
}
}
}
}
for (p = 0 to PART)
{
for (w = 0 to WORK)
{
//Initialize the RBMC for weight updates
MPI_Send(RBMC_Initialization, BATCH*6+4, MPI_RANK+1);
//Send the weights to the RBMC
MPI_Send(weight, n*n/2, MPI_RANK+1);
//Send the learning rate
MPI_Send(learning_rate, 1, MPI_RANK+1);
for (b = 0 to BATCH)
{
//Send the node states for positive weight update
MPI_Send(visible, n/32, MPI_RANK+1);
MPI_Send(hidden, n/32, MPI_RANK+1);
//Send the node states for negative weight update
MPI_Send(visible, n/32, MPI_RANK+1);
MPI_Send(hidden, n/32, MPI_RANK+1);
}
//Recieve the updated weights
MPI_Recv(weight, n*n/2, MPI_RANK+1);
}
}
}