A FPGA Implementation of Large Restricted Boltzmann Machinespc/research/publications/ugrad/2009/lo.pdf · A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering

A FPGA Implementation of Large Restricted BoltzmannMachines

by

Charles Lo

Supervisor: Paul ChowApril 2010

Abstract

A FPGA Implementation of Large Restricted Boltzmann Machines

Charles Lo

Engineering Science

2010

Restricted Boltzmann Machines (RBMs) are a type of Artificial Neural Network andthe fundamental building blocks of Deep Belief Networks (DBNs) [1]. DBNs have beensuccessfully applied to a number of machine learning problems [2, 3, 1]. However, theO(n2) complexity of training a RBM presents a serious impediment to their use in largeapplications. Attempts have been made to accelerate the process using custom FPGAhardware [4, 5], but no implemenation has been demonstrated to run RBMs of 1000-2000nodes necessary for real world applications. This thesis builds upon a virutalized FPGAarchitecture presented by Ly, et al. [4] with the goal of investigating its scalability to-wards large RBMs. The virtualized architecture time-multiplexes the hardware resourcesof a single FPGA to implement large virtual RBMs. To maintain the performance gainof the custom hardware in the presence of context switches, a number of approaches wereused. The architecture was ported to a faster, more modern FPGA, the data represen-tation of reduced from 32-bits to 16-bits to increase throughput in communicaition andcoarse grain parallelism was provided by extending the architecture to four FPGAs. Asequential benchmark written in C was used to test the performance of the architecture.The analysis shows a strong dependence of performance on the communication overheadbetween the supervising microprocessor and the hardware cores. Although very littlespeedup is possible with the implementation presented, this thesis provides a directionfor further improvements to the architecture.

ii

Acknowledgements

I would like to express my gratitude to Professor Paul Chow for giving me the opportunity

to work on this project as well as for his guidance over the course of this thesis. I would

also like to thank Daniel Ly for helping to define the direction of this thesis and always

being available to answer my questions. Finally, I am grateful to Chris Madill, Arun

Patel, Manuel Saldana, Geng Liu and Chu Pang for their assistance during the past

year.

iii

Contents

1 Introduction 1

2 Background 3

2.1 Restricted Boltzmann Machine Operation . . . . . . . . . . . . . . . . . 3

2.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Alternating Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.4 Learning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.5 Batch Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.7 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Methods for accelerating Restricted Boltzmann Machines . . . . . . . . . 10

3 Virutalized FPGA Architecture 13

3.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Computational Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Restricted Boltzmann Machine Core . . . . . . . . . . . . . . . . 15

3.2.2 Energy Accumulator Core and Node Select Core . . . . . . . . . . 17

3.3 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Large Restricted Boltzmann Machine Architecture 19

iv

4.1 Investigation of Data Bit Widths . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Memory and Communication Considerations . . . . . . . . . . . . . . . . 22

4.2.1 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.2 Communication Overhead . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Extension to Four FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Results and Analysis 26

5.1 Test Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.2 Test Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Batch Size vs. Speedup . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.2 Intrinsic RBM Size . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusion 31

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2.1 Weight Matrix Caching . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2.2 Distributed Energy Accumulator Core Structure . . . . . . . . . . 32

Bibliography 33

A Outline of MicroBlaze Operation 36

v

List of Tables

5.1 Summary of Performance Measurements . . . . . . . . . . . . . . . . . . 30

vi

List of Figures

2.1 Structure of a 3x3 Restricted Boltzmann Machine. . . . . . . . . . . . . . 4

2.2 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Layout of a Deep Belief Network . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Weight Distribution in BRAM . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Structure of the Virtualized Restricted Boltzmann Machine Architecture 18

4.1 MicroBlaze PLB Connectivity . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Weight distribution of eight partitions among four FPGAs . . . . . . . . 24

4.3 Overall Layout in Four FPGA System. All MPI Ranks are interconnected. 25

5.1 Mini-batch size vs. speedup . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Mini-batch size vs. speedup without node calculation . . . . . . . . . . . 29

5.3 Mini-batch size vs. speedup for a virutal 512x512 network with intrinsic

RBM sizes of 64 and 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

vii

Chapter 1

Introduction

The paradigm of machine learning deals with methods that allow a computer to ex-

tract complex patterns underlying data. The applications of such methods are extensive,

including visual pattern recognition, speech recognition and video game artificial intelli-

gence. One popular method of machine learning is the use of artificial neural networks

(ANNs). Such networks roughly model the structure of the biological neural networks

in the brain, in that they consist of many parallel simple neurons, connected together

through weighted relationships. The activation of the neurons dependant on the weights

and states of connected neurons determines the reaction of the network to some input.

By controlling the value of the weights, the network can be trained to recognize certain

patterns or features of a dataset.

Many different types of ANNs exist with different network topologies, activation func-

tions and learning algorithms. A particularly popular architecture is the Restricted Boltz-

mann Machine (RBM); a stochastic, generative model that has proven to perform well

in problems such as face recognition [6]. Recently, it has been shown that when several

RBMs are stacked together to form a Deep Belief Network (DBN), an efficient learning

algorithm exists to train the entire network [1]. DBNs have the benefit of being able

to learn more complex features and have been applied to problems of generating facial

1

Chapter 1. Introduction 2

expressions [2], semantic hashing of documents [3] and recognition of hand written digits

[1]. Although the learning algorithm is relatively efficient, training the large networks

required for the real-world applications above can still take several days or weeks on

a general purpose desktop computer [6]. The parallel nature of the RBM architecture

makes it very tractable by hardware implementations and several groups have created

FPGA and GPU based RBM solutions providing much needed speed-up [7, 8, 5]. In par-

ticular, Ly et al.’s FPGA architecture has produced a 145x speed-up relative to a desktop

PC [7]. However, it has only implemented relatively small RBM networks of 256x256

neurons whereas real world applications require much larger networks. For example, the

DBN used to recognize handwritten digits [1] contained a RBM of size 2000x510.

The goal of my thesis is to scale Ly, et al.’s FPGA architecture [4] up, to be capable

of handling the thousands of neurons necessary in real world DBN applications while

maintaining maximum performance. The main bottleneck limiting the current imple-

mentation size of the FPGA architecture is the size of the weight matrix. This data

structure is necessarily large since each node must be connected through a weight to all

of the nodes in the next layer due to the bipartite graph organization of the RBM. To

allow for larger networks, my project will first involve adapting the FPGA architecture

to a larger, faster FPGA platform. This allows for the possibility of higher clock speeds

as well as better interconnects between FPGAs and thus greater performance. In addi-

tion, I will investigate the effect of decreasing the bit width of the weights. This could

allow more weights to be stored on-chip and a increase in communication bandwidth

between computational cores provided that the network is trainable at lower precision.

Finally, by time-multiplexing the resources of four FPGAs, I hope accelerate the training

performance of arbitrary size networks.

Chapter 2

Background

2.1 Restricted Boltzmann Machine Operation

Artificial Neural Networks (ANNs) are models of biological neural networks that rely on

the interactions between simple units called neurons or nodes to perform computations.

By modifying connections between nodes, ANNs may be taught to model patterns in a

set of training data [9]. A Restricted Boltzmann Machine(RBM) [10, 11] is a type of

ANN that has become recently popular due to its role as a building block in Deep Belief

Networks (DBNs). An interesting property of RBMs is they are taught to reproduce the

training data they are given. The internal model they create allow them to generate new

data statistically similar to the training set and thus a RBM is said to be a generative

neural network.

2.1.1 Structure

The RBM consists of two layers of nodes. A visible layer representing the input to the

network and a hidden layer. Each node is connected to all of the nodes of the opposite

layer through a weighted connection. The real valued weights on the connections are the

learning parameters of the RBM and it is through their adjustment that the network

3

Chapter 2. Background 4

v0

h0

v1

h1

v2

h2

Figure 2.1: Structure of a 3x3 Restricted Boltzmann Machine.

may be trained. We will denote the weight connecting visible node i to hidden node j as

wi,j. The restriction that nodes of the same type are not interconnected allows for the

development of a fast learning algorithm and is one of the properties that separates the

RBM from general Boltzmann Machines. Generally, the node states are binary valued.

However, some applications benefit from having real valued visible nodes to represent

data such as greyscale images.

With this topology in mind, we can write the elements of a RBM in matrix notation:

W =

w0,0 · · · w0,J−1

.... . .

...

wI−1,0 · · · wI−1,J−1

(2.1)

V = [v0 · · · vI−1] (2.2)

H = [h0 · · ·hJ−1] (2.3)

The RBM is a stochastic neural network in that its node states are determined through

a probabilistic function rather than a deterministic one. The function used in a RBM is

the sigmoid or logistic function [10] (Fig. 2.2). Thus, the probability of activating a node

may be calculated, given that the opposite layer is determined, as the logistic function


−4 400

1

Node

Act

ivat

ion

Pro

bab

ilit

y

Node Energy

Figure 2.2: Sigmoid Function

of its weighted inputs.

P (vi = 1) =1

1 + exp(−J−1∑j=0

hjwi,j)

(2.4)

P (hj = 1) =1

1 + exp(−I−1∑i=0

viwi,j)

(2.5)

2.1.2 Alternating Gibbs Sampling

We can now describe the main operating mode of a RBM called Alternating Gibbs Sam-

pling (AGS). Looking at Eqns. 2.4 and 2.5, the node states for one layer can be deter-

mined as long as the other layer is fixed. In the first AGS phase, the hidden layer is

generated based on test data on the visible nodes; the second AGS phase reconstructs

the test data by clamping the hidden nodes and stochastically finding the states of the

visible nodes. This process can continue to higher order AGS phases as each layer is

clamped or determined in turn. To keep track of node states we will denote the AGS as

a superscript, for example V 3 would represent the visible layer node states at the third

AGS phase.


2.1.3 Energy

The Restricted Boltzmann Machine draws inspiration from the Boltzmann Distribution

of statistical mechanics which describes the probability distribution for a set of states in

a system [12]. The state of a RBM is defined by its visible and hidden layers. Thus,

given a certain configuration of visible and hidden nodes and fixing the weights, we can

define an energy:

E(V,H) = −I−1∑i=0

J−1∑j=0

vihjwi,j (2.6)

From the Boltzmann Distribution:

P (V,H) ∝ exp(−E(V,H)) (2.7)

The goal of learning in an RBM is to model the training set, this can be accomplished

by modifying the weights such that we obtain a Boltzmann distribution where the prob-

ability of obtaining configurations with the training vectors is maximized. Looking at

the previous equation, we see that to maximize the probability of a training vector P (V )

we need to minimize the Energy associated with its configurations.

In addition, from the energy equation, we can see that for each visible or hidden node,

there is an associated local energy.

E(V,H) = −I−1∑i=0

viEi = −J−1∑j=0

hjEj (2.8)

Ei =J−1∑j=0

hjwi,j (2.9)

Ej =I−1∑i=0

viwi,j (2.10)

The local energy for a given node is in fact the weighed sum of the states of the nodes

from the opposite layer. Therefore, we can rewrite Eqns. 2.4, 2.5 as:


P (vi = 1) =1

1 + e−Ei(2.11)

P (hj = 1) =1

1 + e−Ej(2.12)

We can write these local energies more succinctly as members of vectors:

EV = [E0 · · ·EI−1] = H ·W T (2.13)

EH = [E0 · · ·EJ−1] = V ·W (2.14)

Thus the node state vectors V and H are functions of the energy vectors EV and EH

respectively.

2.1.4 Learning Rules

Given this concept of state energy, we can find the learning rule for a RBM by differen-

tiating the log probability of obtaining a particular visible layer configuration:

δlog(P (V ))

δwi,j=< vihj >

0 − < vihj >∞ (2.15)

Where < a >n denotes the expected value of a at the nth AGS phase. From this

equation we can see that to increase the probability of training vectors, we can apply the

following weight update rule:

∆wi,j = ε(< vihj >0 − < vihj >

∞) (2.16)

Where ε is the learning rate. This learning rate must be carefully controlled since

large weight updates would not be able to reach the energy minima, while slow rates

would take too long to reach them. One solution is to dynamically decrease the learning


rate during training using the process of simulated annealing [12].

Clearly, getting a sample from the infinite AGS is not feasible in computation time.

However, it has been shown that we can estimate the infinite phase with a finite one;

this is called contrastive divergence (CD) learning [13]. Using CD, we no longer perform

gradient descent in weight space, but it has been shown to work well even with as few as

three AGS phases.

2.1.5 Batch Learning

To have weight updates which represent the entire set of training data, it would be

best to calculate the average weight update for the entire training set before committing

the change. This type of weight update is called Batch Learning. For large sets batch

learning would result in long computation times between weight updates. To address

this problem, we can reduce the batch size and create mini-batches to increase update

rate at the expense of update precision. At the limit of one training vector per weight

update, we are performing on-line learning.

For a batch size of L, the learning rule becomes:

∆wi,j =ε

L

L−1∑l=0

(< vihj >0 − < vihj >

∞) (2.17)


2.1.6 Summary

The following procedure describes the RBM training operation with three AGS Phases

and a mini-batch size of L.

1. Apply a training vector to the visible layer. This becomes V 1 in the next step.

2. AGS Phase 1 Compute the local energy E1H = V 1W and apply the logistic function

to find the node states H1 = f(E1H)

3. Increment the weight update: ∆wi,j = ∆wi,j + (V 1)TH1

4. AGS Phase 2 Compute the local energy E2V = H2W T and apply the logistic func-

tion to find the node states V 2 = f(E2V )

5. AGS Phase 3 Compute the local energy E3H = V 3W and apply the logistic function

to find the node states H3 = f(E3H)

6. Decrement the weight update: ∆wi,j = ∆wi,j − (V 1)TH1

7. Repeat steps 1-6 for each training vector in the mini-batch

8. Commit the weight update: wi,j = wi,j + εL

∆wi,j

9. Repeat steps 1-8 for each mini-batch in the training set

It should be noted that the the computations of energy in each AGS phase is of

complexity O(n2) and the weight update computation is also O(n2). This makes training

RBMs with thousands of nodes a very time consuming process.

2.1.7 Deep Belief Networks

The Restricted Boltzmann Machine is powerful in itself to extract features from test

data. However, it becomes even more useful as a part of a Deep Belief Network (DBN).


v0

h0

v1

h1

v2

h2

v0

h0

v1

h1

v2

h2

RBM 2

RBM 1

Figure 2.3: Layout of a Deep Belief Network

In effect, DBNs consist of multiple RBMs stacked upon each other; the hidden nodes for

one layer become the visible nodes for the next as in Fig 2.3. The additional layers of

hidden nodes are used to model patterns within the patterns generated by earlier layers.

Thus, the DBN is able to model more complex features in data. What is interesting

about these deep networks is that they can be greedily trained layer by layer using the

same efficient algorithm presented above [1].

To get better classification or generative properties additional training can be per-

formed using wake-sleep or backpropagation algorithms.

2.2 Methods for accelerating Restricted Boltzmann

Machines

A number of computationally intensive operations need to be performed during RBM

training. In calculating the local energies, a vector-matrix multiplication must be per-

formed as well as a matrix transposition during even AGS phases. In addition, to evaluate

the node states, the non-linear logistic function must be evaluated.

These operations can be slow on sequential general purpose processors. However,

there have been a number of attempts to accelerate the process. Of particular interest are

three published implementations: One design using the inherent parallelism in Graphics


Processing Units (GPUs) [8] and two custom hardware designs implemented on Field

Programmable Gate Arrays (FPGAs) [5, 4].

Modern Graphics Processing Units (GPUs) offer several layers of parallelism much

greater than standard multi-core CPUs allowing them to operate on large batches of data

at once. In addition, optimized linear algebra packages are available for them. Raina et

al. [8] used an NVIDIA GTX 280 GPU with 1GB of RAM and the CUDA Application

Layer to accelerate RBM operations and build deep belief networks. On a single RBM

of 4096x11008 size, they achieved a speed-up of 72.6x over a software implementation

using the optimized matrix operation library Goto BLAS running on a 3.16GHz Dual-

Core processor. To minimize data transfer of weights in DBNs, they developed the idea

of ”overlapping patches”. By representing the visible layer as a 2D surface and tiling

patches across it, they were able to create local connections between hidden layers where

the patches overlapped. Using this method, they were able to build 4-layer DBNs with

96 million parameters. However, the amount of overlapping areas decreases as the order

of overlap increases, so this method is inherently limited to DBNs of decreasing size for

higher layers. In addition, the layers are not fully connected with the overlapping patches

method thus this implementation is limited to applications in a subset of DBN problems.

Kim et al. [5] developed a hardware implementation of an RBM on an Altera Stratix

III EP3SL340 FPGA. In this design the authors decided to use 16-bit fixed point words

to represent the weights, energies and visible node states. The main computational cores

of this design were partitioned into groups of adders and multipliers to perform the

vector-matrix operation of local energy calculation. To perform the energy calculation

for all of the nodes of a given layer in parallel, all of the row or column elements of the

weight matrix must be available at the same time and thus must be stored on separately

addressable memory elements. To avoid this problem, the authors stored each column

of the weight matrix in separate memory blocks such that a single row was available at

a time. This allowed the visible energies to be calculated simply using a multiplier and


tree adder. Then, by using an accumulator structure to calculate the hidden energies,

they did not have to modify their memory structure. To compute the logistic function, a

Piecewise Linear Approximate of Nonlinear function (PLAN) was implemented. When

benchmarked against a software implementation running on a 2.4GHz Intel Core 2 sys-

tem, they achieved a speed-up of 25x over single precision MATLAB code and 30x over

double precision. The maximum network size achieved was 512x512.

The final FPGA Implementation by Ly et al [4], was developed on a Berkeley Emula-

tion Engine 2 hardware platform consisting of five interconnected Virtex-II Pro XC2VP70

FPGAs. In this design, a set of tree adders was used to calculate the visible and hidden

energies. The problem of weight addressing was allievated by storing diagonal sections

of the matrix in different memory blocks. In this way, the same set of memory blocks

could be used to access a row or column of the weight matrix. The logistic function was

performed using a Piecewise Linear Interpolator. Some significant differences with the

FPGA design by Kim et. al. are that the weights and energies are represented as 32-bit

fixed point numbers rather than 16-bit ones. In addition, the visible nodes can only

be binary valued, whereas they are real valued in Kim et. al’s design. Three different

designs were presented: one on a single FPGA running a 128x128 RBM, one using coarse

grain parallelism across four FPGAs to run a 256x256 RBM and one time-multiplexing

the resources of a single FPGA to realize a 256x256 network. The speed-ups obtained

were 61x, 145x and 32x respectively over an optimized C implementation running on a

2.8GHz Pentium 4 processor. The works described here did not use a common bench-

mark, so it is difficult to compare performance directly. The GPU implementation has a

clear advantage in network size, but the limitations of its overlapping patches technique

make it unusable for large, general DBNs. Of the two FPGA applications, the one by Ly

et. al, has a clear performance advantage especially considering that it is implemented

on older FPGA hardware. Notably, no designs have been published implementing real

world DBN applications.

Chapter 3

Virutalized FPGA Architecture

The work in this thesis is built on top of the Virtualized FPGA RBM architecture

designed by Ly, et al. [4] In this chapter, some important aspects of the architecture will

be discussed.

Custom FPGA hardware cores are able to perform the compoutations involved in

RBM training very quickly, but a FPGA has a finite amount of resources. Therefore,

the size of the network a single FPGA can work on is limited. One way to increase the

workable network size is by simply adding more FPGAs. However as the size of the

application grows, this method becomes quickly cost and power prohibitive. A better

approach is to time-multiplex the hardware to handle problems of almost arbitrary size.

The tradeoff in this approach is that a context switch is required to work on different

portions of the network. The virtualized RBM architecture that this thesis is based on

uses the time-multiplexing approach to work on networks whose size would not normally

fit on a single FPGA.

3.1 Partitioning

To use a virtualized system for performing Restricted Boltzmann Machine operations,

the computations must first be partitioned into independent work units. By partitioning

13

Chapter 3. Virutalized FPGA Architecture 14

the visible and hidden vectors into A and B parts respectively, the weight matrix can be

broken into a group of block matrices.

W =

W0,0 · · · W0,B−1

.... . .

...

WA−1,0 · · · WA−1,B−1

(3.1)

V = [V0 · · ·VA−1] (3.2)

H = [H0 · · ·HB−1] (3.3)

The energy calculation then becomes:

EH = V ·W =

EH0

...

EHB−1

=

V0 ·W0,0+ · · · +VA−1 ·WA−1,0

.... . .

...

V0 ·W0,B−1+ · · · +VA−1 ·WA−1,B−1

(3.4)

EV = H ·W T =

EV0

...

EVA−1

=

H0 ·W0,0+ · · · +HB−1 ·W0,B−1

.... . .

...

H0 ·WA−1,0+ · · · +HB−1 ·WA−1,B−1

(3.5)

In this configuration, the energies EHjand EVi

required to calculate a block of node

states is now divided into a number of partial energy blocks involving the calcuation

ViWi,j or HjWi,j. These partial energy calculations can be done independently and later

recombined to resolve the node states. Also, by partioning the weights in this way, the

weight update calculations may be performed on each weight matrix Wi,j independently

as well.


3.2 Computational Cores

The hardware architecture consists of three major cores. The Restricted Boltzmann

Machine Core (RBMC) performs the primary vector-matrix energy calculation as well

as the weight update. The Energy Accumulator Core (EAC) is used to sum the partial

energies described in the previous section. Finally, the Node Select Core (NSC) evaluates

the node states using the sigmoid function.

3.2.1 Restricted Boltzmann Machine Core

The RBMC performs the O(n2) energy calcuation (Eqns. 2.13, 2.14) and weight update

(Eqn. 2.16) steps in O(n) time. To achieve this speed, a number of restrictions are

applied to the RBM network.

• The sizes of the visible and hidden layers must be the same to allow for reuse of

the same computational logic for both layers.

• Node states must be binary values. This condition allows the use of AND gates in

place of multipliers for operations involving the node states.

• Weights and energies use a 32-bit fixed point representation. The choice of fixed

point over floating point simplifies the arithmetic logic for energy and weight update

calculations.

• The layer size must be a power of two. This limitation allows the use of a binary

tree adder when calculating the energy.

The size restrictions on the node layers do not limit the space of problems the architecture

is capable of handling since unused nodes may be added to reach the next power of two.

However, maximum effective performance will not be attained unless the sizes of the

layers in the application match the previous descriptions. Also a 32-bit fixed point

representation provides a very large range of supported values and would likely not run


W =

BRAM 0

BRAM 1

BRAM 2

w0,0 w0,1 w0,2

w1,0 w1,1 w1,2

w2,0 w2,1 w2,2

w0,0 w0,1 w0,2

w1,0 w1,1 w1,2

w2,0 w2,1 w2,2

Figure 3.1: Weight Distribution in BRAM

into overflow or underflow issues unless the radix was chosen poorly. The choice of binary

valued node states on the other hand does limit the number of real world applications,

but the simplicity of the logic required allows for very fast computation in the set of

problems the architecture can handle.

With these restrictions in place, the node energies may be calculated by accessing an

entire row or column of the weight matrix, performing a logical AND with the node states

and feeding the result into a binary tree adder. Due to the pipelined nature of the tree

adder, one energy can be produced every clock cycle, thus reducing the computational

complexity toO(n). Likewise an entire row or column of weight updates may be generated

in parallel by performing a logical AND between the visible and hidden node states and

using the outcome to decide whether or not the learning rate should be applied to the

weight updates.

Clearly an important characteristic of the RBMC is its ability to access a full column

or row of the weight matrix in parallel. To facilitate this, for a RBM of size nxn, n

physical dual ported Block RAMs (BRAMs) are instantiated, each containing a diagonal

of the weight matrix. An example for a 3x3 RBM is shown in Fig. 3.1. Notice that for

every row and column, each weight is stored on a different BRAM.

The parallel storage of weights turns out to be the limiting factor in terms of size of

RBM synthesizable on a single FPGA. For weight storage, n BRAMs are required and

an additional n BRAMs are required to store the weight updates.


3.2.2 Energy Accumulator Core and Node Select Core

The EAC and NSC work in tandem to find the node states given the a set of partial

energies. First, as a stream of partial energies arrives at the EAC, they are summed and

stored in a BRAM First In First Out (FIFO) memory structure. Once the total energies

has been computed, the EAC sends them to the NSC which performs node state selection

using an approximated sigmoid function and a uniform random number generator. The

sigmoid function is calculated using a look up table (LUT) whose output is sent through

a pipelined piecewise linear interpolator (PLI) in order to get a better estimate. Once

the node states have been determined, they are sent back to the EAC and from there

back to the source of the partial energies.

3.3 Message Passing Interface

All three cores performing different parts of the RBM calculations must be connected

to each other as well as a supervising microprocessor. In order to provide a simple,

high bandwidth, communication channel TMD-MPI [14] was used to connect the cores.

TMD-MPI implements a subset of the Message Passing Interface (MPI) standard for

embedded systems. This communication layer offers a level of abstraction away from the

implementation of the computational cores. Data is sent point to point as through a

network as packets called messages. Each message has a defined source, destination, tag

and word count where words are 32-bit pieces of data. At initialization, each device on

the MPI network is given a specific address called a rank which is used to route packets

through the network. When data is recieved it is stored in a message queue. Once a

hardware core begins to read the message, a new word of the data is available each clock

cycle. This operation allows the cores to operate asynchornously and yet still have high

bandwidth communication between each other.

The microprocessor has its own Message Passing Engine (MPE) which supports direct


PPC

EAC

RBMC

NSC

R0 R1

R2R3

Figure 3.2: Structure of the Virtualized Restricted Boltzmann Machine Architecture

memory access (DMA) and burst access to memory. These features allow a minimal

overhead from the processor since only four 32-bit words must be sent to the MPE

before it may begin streaming data. The MPI connectivity of the full system is shown in

Fig. 3.2. The circles represent the MPI hardware and show the ranks of each computing

element. The platform used in [4] was a Berkeley Emulation Engine 2 (BEE2) [15] with

five Virtex-II Pro XC2VP70 FPGAs in a communication mesh and a hard PowerPC

(PPC) processor.

Chapter 4

Large Restricted Boltzmann

Machine Architecture

The goal of this thesis was to investigate the scalability of the FPGA architecture dis-

cussed in the previous chapter and adapt it to handle the size of Restricted Boltzmann

Machines required in real world Deep Belief Network applications. In this chapter, the

design modifications to the existing architecture will be discussed. One of the first steps

taken to increase the performance of the virtualized FPGA architecture was to move the

design to a more modern FPGA. The BEECube Berkeley Emulation Engine 3 (BEE3)

[16] hardware platform was chosen as a logical upgrade from the Virtex II based BEE2.

The BEE3 contains four Xilinx Virtex-5 5VLX155T FPGAs connected in a ring, each

with acces to up to 16GB of DDR2 external RAM. The only major design change during

this transition was a switch from a PowerPC processor managing the hardware cores to

soft MicroBlaze processors.

4.1 Investigation of Data Bit Widths

The existing Virtualized Restrcted Boltzmann Machine architecture represented weight

and energy values as 32-bit fixed-point numbers. This bit width was a convinient design

19

Chapter 4. Large Restricted Boltzmann Machine Architecture 20

choice since the MPI hardware operated with 32-bit data widths and the configurable

dual ported BRAMs supported up to 36-bit data width. However, significant performance

improvements may be realized with a reduction in bit width. For example, given the 32-

bit width of the MPI channel, a bit width of 16-bits would result in double the amount of

weights or energies transferred per clock cycle since two of them could be packed into each

MPI word. A width of 8-bits would allow four times the number of weights or energies

transferred. Since the weights must be transferred during every context switch, this

gain in throughput becomes significant in the Virtualized RBM architecture. With data

packing, the operation of the EAC and NSC may also be parallelized to add multiple

energies and calculate multiple node states per clock cycle. Finally, a RBM of size n

requires 2n physical dual ported BRAMs to store the weight and weight update matrices

on the RBMC. If the BRAMs on the FPGA may be split into smaller width, but more

plentiful dual ported BRAMs the size of RBM that can be synthesized on a single FPGA

may also be increased. This would significantly increase the performance of the RBMC.

The drawback of using fewer bits to represent data is the reduction in the range

of possible values. Depending on the RBM application, there exists the possibility of

overflow or underflow. This could lead to problems finding a set of values in weight

space to accurately represent the given training set. To roughly estimate the effect of

using different data widths, a simple experiment was carried out. A RBM was trained in

software with three different signed fixed-point representations: 32-bit with 8 magnitude

bits, 16-bit with 8 magnitude bits and 8-bit with 4 mangitude bits. The network of size

1024x512 was trained for 100 epochs to recognize an image of the number 0.

As a comparison metric, the trained networks were fed back the training image and

AGS was run for 1025 phases. If the weights had been set properly, the network would

be able to reproduce the image faithfully. Fig 4.1 shows the average number of errors

found in a bitwise comparison of the original image versus the reconstructed one over ten

attempts. The reconstruction with 16-bit weights produced a result similar to the case


10 15 20 25 30

100

200

300

400

500

Weight and Energy Bit Width

Ave

rage

Rec

onst

ruct

ion

Err

or

Bit Width vs. Average Reconstruction Error

with 32-bit weights while the network with 8-bit weights failed to reproduce the image

at all. Previous studies [17, 5] on data width reinforce these results and suggest that

16-bits is adequate for many neural network applications. Thus, this project uses 16-bit

representations for weight and energy values.

Due to the choice of 16-bit widths, two energies are packed in each word transmitted

over MPI, the EAC and NSC were modified to perform the energy summation and node

state calcuation in parallel for both incomming energies. The Virtex-5 family Block

RAMs are 36Kbit dual ported modules configurable in a number of width and depth

settings. Each BRAM may also be configured as two independent 18Kbit dual ported

modules. In addition, both the 36Kbit and 18Kbit BRAMs may be configured in simple

dual-port mode in which there is a single dedicated read port and a single dedicated

write port. In this configuration, the 36Kbit BRAM width is doubled to 72 bits and the

18Kbit BRAM width is doubled to 36 bits.[18] The 5VLX155T FPGA has 212 36Kbit

BRAMs and thus 424 18Kbit BRAMs available. Therefore the maximum RBM which

can be implemented on the FPGA is 128x128 using 256 18Kbit BRAMs.[19] The RBMC

only requires a single read port and single write port for each weight storage BRAM, thus

the maximum RBMC is 128x128 with both 32-bit data widths and 16-bit data widths.


μB

MPMC

PLB

MPE

PLB

BRAM

DRAM

FPGA 0

Figure 4.1: MicroBlaze PLB Connectivity

However, given the improvements in the communication performance from the reduction

in bit widths is still beneficial to overall performance.

4.2 Memory and Communication Considerations

4.2.1 Data Storage

A weight matrix for a RBM of size 128x128 with 16-bit weight representation requires

32 KBytes of memory to store. As we increase the RBM network size, the weight matrix

becomes quadratically larger and quickly exceeds the storage resources on a FPGA. In

addition, when training real world applications the number of training vectors can become

very large. For example, the MNIST database with 60 000 images was used while training

a network to recognize handwritten numbers[1]. In order to store all of this data off chip

DDR2 RAM was used. Fig 4.1 shows the local connections to the MicroBlaze processor.

Data is streamed through the processor local bus (PLB) from the DDR2 RAM to the

multip ported memory controller (MPMC) and finally to the PLB MPE core after which

it is distributed to the appropriate computational core through the MPI network.


4.2.2 Communication Overhead

Since much of the data must be located external to the FPGA, additional latency is

introduced during transfers between the MicroBlaze and the computational cores. The

MPE core is designed with direct memory access (DMA) and is able to perform burst

writes and reads to and from the external memory through the PLB. In principle this is

fast, especially for large blocks of data such as the weight matrices. The only overhead

from the MicroBlaze processor is the transmission of four words to set up the MPE

core. However, when transferring smaller batches of data this overhead becomes very

significant since the MicroBlaze is slow compared to the hardware cores. In particular,

node states for a 128x128 system are only four 32-bit words long themselves and a set

of energies is only 64 words long. Therefore operations heavily involving these elements

such as the node state calculation are subject to a significant performance hit.

In addition to the overhead from the MicroBlaze, another significant performance

reduction comes from the context switch operation itself. Although the transmission of

weight matrices is relatively fast given DMA, bursting and the two packed 16-bit elements

in each MPI word, the communication time represents a significant period where the

RBMC is idle. One simple way to address this problem is by increasing the mini-batch

size. This allows the weight matrix to remain on the RBMC for a longer period of time

and as batch size increases, theoretically the computation time of the RBMC would

eventually become the limiting factor. One key drawback to using larger batch sizes is

the need to store more partial energies before node state calculations may occur. In this

particular implemention since the energies are stored on large external DDR RAM, this

does not have a significant effect.


W =

FPGA 0

FPGA 1

FPGA 2

FPGA 3

W0,0 W0,1 W0,2 W0,3 W0,4 W0,5 W0,6 W0,7

W1,0 W1,1 W1,2 W1,3 W1,4 W1,5 W1,6 W1,7

W2,0 W2,1 W2,2 W2,3 W2,4 W2,5 W2,6 W2,7

W3,0 W3,1 W3,2 W3,3 W3,4 W3,5 W3,6 W3,7

W4,0 W4,1 W4,2 W4,3 W4,4 W4,5 W4,6 W4,7

W5,0 W5,1 W5,2 W5,3 W5,4 W5,5 W5,6 W5,7

W6,0 W6,1 W6,2 W6,3 W6,4 W6,5 W6,6 W6,7

W7,0 W7,1 W7,2 W7,3 W7,4 W7,5 W7,6 W7,7

W0,0 W0,1 W0,2 W0,3 W0,4 W0,5 W0,6 W0,7

W1,0 W1,1 W1,2 W1,3 W1,4 W1,5 W1,6 W1,7

W2,0 W2,1 W2,2 W2,3 W2,4 W2,5 W2,6 W2,7

W3,0 W3,1 W3,2 W3,3 W3,4 W3,5 W3,6 W3,7

W4,0 W4,1 W4,2 W4,3 W4,4 W4,5 W4,6 W4,7

W5,0 W5,1 W5,2 W5,3 W5,4 W5,5 W5,6 W5,7

W6,0 W6,1 W6,2 W6,3 W6,4 W6,5 W6,6 W6,7

W7,0 W7,1 W7,2 W7,3 W7,4 W7,5 W7,6 W7,7

Figure 4.2: Weight distribution of eight partitions among four FPGAs

4.3 Extension to Four FPGAs

Additional performance was obtained by using the four FPGAs available on the BEE3

platform to provide coarse grain parallelism. Since communication links between FPGAs

are not as fast as on-chip links, minimizing the inter-FPGA communication was essential

to maintaining performance. In addition, it was important to ensure that the work load

was shared evenly among the FPGAs. From these two conditions, the weight matrices

representing the largest data transfer, were assigned to FPGAs where they were just

streamed locally in and out of DDR2 RAM. The partioning of the matrices shown in

Fig. 4.2 is similar to the weight distribution within the RBMC from Fig. 3.1. The one

difference being that there may be fewer FPGAs than weight matrices and thus multiple

sets of calculations may be required to get all the partial energies for a set of nodes. This

structure allows all of the FPGAs to work together computing either a set of visible or

hidden nodes at once.

The overall system layout is shown in Fig. 4.3. In this configuration, all of the partial

energies from each FPGA must be sent to a single location to be added together and

the node states must then be distributed from that single source back to the rest of the

FPGAs. Given the operation of the EAC, this was the simplest method of connectivity.

However it results in a communication bottleneck that becomes more significant as FP-


μB

EAC

RBMC

NSC

RA

M

R0 R1

R2R3

μB RBMC

RA

M

R4 R5

μB RBMC

RA

M

R8 R9

μB RBMC

RA

M

R6 R7

FPGA 0 FPGA 1

FPGA 3 FPGA 2

Figure 4.3: Overall Layout in Four FPGA System. All MPI Ranks are interconnected.

GAs are added. In addition, a key limitation in this implementation is network size. A

bug exists in the EAC in which it ceases operation when more than ten partial energies

are delivered to it. Due to time constraints, this bug was not addressed for this thesis.

Therefore, the maximum network size implementable by this system is 8n where n is the

size of RBM synthesized on a single FPGA.

An outline of the code run on each FPGA is provided in Appendix A.

Chapter 5

Results and Analysis

5.1 Test Methods

5.1.1 Test Setup

The design was tested on the BEE3 platform with all computational cores and MicroBlaze

soft processors running at 100MHz. In addition, an external 2GB DDR2-667 RDIMM

module running at 200MHz was connected through a multi ported memory controller

(MPMC) to the processor local bus (PLB) of the MicroBlaze processor. In hardware,

two different network sizes were synthesized: 64x64 and 128x128. Virtual network sizes

of 1024x1024 and 512x512 were tested on the 128x128 system while only 512x512 was

tested on the 64x64 system. The fmax of the 128x128 system reported by Xilinx Synthesis

Tool (XST) was 145.624MHz. However, due to time constraints, system clock frequencies

greater than 100MHz were not explored. In testing the effect of various batch sizes on

performance, batches of 1,8,16,32,64 and 84 were run.

A sequential implementation written in C was used as a basis for comparison for

relative speedup. The equivalent software versions of the hardware components were

written such that the output of the software benchmark matched the output of the

hardware implementation. The C implemenation was compiled using gcc version 4.4.1

26

Chapter 5. Results and Analysis 27

with optimization level 2. The benchmark was run on a Intel Core 2 Duo E8400 at 3GHz

on a 32-bit version of Ubuntu Linux running kernel 2.6.31-15.

To record the computation time, the function gettimeofday() was used on the software

implementation. The results of 25 runs was averaged to get the final computation time.

For the hardware implementation, the function MPI TIME() was used and the results

were averaged over 10 runs.

5.1.2 Test Metric

Although relative speedup is an interesting measure, it is difficult to compare different

architectures without an absolute measure of performance. One popular method of mea-

suring neural network training performance is Connection Updates per Second (CUPS)

[20]. This is defined as the number of weight updates per second or

CUPS =n2

T(5.1)

Where n is the size of node layers and T is the amount of time for all of the weights to

be updated for one test vector.

The speedup over the sequential C implementation was taken to be the ratio of

Connection Updates per Second of the hardware implementation and that of the software

implementation.

S =CUPShCUPSs

(5.2)

5.2 Results

5.2.1 Batch Size vs. Speedup

As previously mentioned, one of the major concerns with scaling a RBM architecture

is the O(n2) growth in the number of context switches and thus data transfers. Fig.


0 20 40 60 800

2

4

6

8

Mini-batch size

Sp

eedup

Batch Size vs. Speedup

1024x1024512x512

Figure 5.1: Mini-batch size vs. speedup

5.2.1 shows the effect of increasing batch size on overall performance. Both network sizes

plotted were implemented on a system with 128x128 intrinsic core size. From Fig. 5.2.1,

we can see that as batch size increases the larger 1024x1024 system becomes increasingly

faster than the 512x512 network. We can infer from this result that at a batch size

of one, the weight transfers consume a vast majority of the computation time. As the

network size is increased, the RBMC and EAC/NSC operations begin to take up a greater

percentage of time and the O(n) benefits of those cores become apparent relative to the

O(n2) software baseline.

If the RBM training time was limited purely by the RBMC or EAC/NSC computation

times, we would expect to see a continued increase in speedup with batch size. However,

the plots level off fairly quickly. One other operation that keeps the RBMC idle apart

from the weight transfer is the node state calculation. Particularly in this architecture,

the partial energies must be transferred from all of the FPGAs to one point and the node

states must then be streamed back. This operation must be done synchronously between

FPGAs thus, some overhead in setting up the timings between FPGAs is required and

the computation must wait for the most delayed FPGA to be ready.


0 20 40 60 80

0

5

10

15

Mini-batch size

Sp

eedup

Batch Size vs. Speedup Without Node Selection

1024x1024512x512

Figure 5.2: Mini-batch size vs. speedup without node calculation

As an artificial experiment, the same test was run but without the node calculation

stage. Here, the speedup increases noticebly beyond the last test, but still begins to taper

off before reaching very high performance. Since from [4], the hardware cores themselves

are very fast, this is likely due once again to communication bottlenecks between the

RAM and the compute cores. From these results, we can see that the communication

overhead due to transfers between the MicroBlaze and the hardware cores limit the

overall system performance. Since a number of communications increase as O(n2), this

has significant implications.

5.2.2 Intrinsic RBM Size

A second performance factor measured was the effect of changing the intrinsic network

size n. A reduction in n would degrade the performance benefit of the O(n) compute

cores and require more context switches for the same virtualized RBM size. However it

would also reduce the overall transfer time of data between computations. Fig. 5.2.2

shows a comparison of running a virtual 512x512 RBM on both n = 64 and n = 128

hardware over varying batch sizes. From this plot we can see that increasing the intrinsic


0 20 40 60 800

1

2

3

4

5

Mini-batch size

Sp

eedup

512x512 Batch Size vs. Speedup for n

n = 128n = 64

Figure 5.3: Mini-batch size vs. speedup for a virutal 512x512 network with intrinsicRBM sizes of 64 and 128

size of RBM is very beneficial above small batch sizes. 64x64 vs 128x128

5.2.3 Summary

The absolute CUPS results are summarized in table 5.2.3. It is interesting to note how

CUPS is approximately the same for batch size 1 and n = 128 regardless of virutalized

size. This represents a O(n2) relationship.

Platform RBM Size Batch Size MCUPS

Virtualized FPGA n = 64 512x512 1 60.1Virtualized FPGA n = 64 512x512 84 314.9Virtualized FPGA n = 128 512x512 1 79.8Virtualized FPGA n = 128 512x512 84 769.12Virtualized FPGA n = 128 1024x1024 1 82.4Virtualized FPGA n = 128 1024x1024 84 995.3

Table 5.1: Summary of Performance Measurements

Chapter 6

Conclusion

6.1 Conclusions

The purpose of this thesis was to investigate the scalability of Ly, et al.’s [4], virtualized

FPGA architecture. The primary impediment to using the architecture for large networks

was the communication overhead required in context switching. As the size of the RBM

grows linearly, the number of transfers increases as O(n2). To maintain performance,

the architecture was ported to a faster, more modern FPGA platform, the BEE3. The

data representation was also changed from 32-bits to 16-bits in order to reduce the time

required to transfer a set of energies or weights. Finally, the system was implemented

four FPGAs in order to provide some extra coarse grain parallelsim.

When compared to a sequential O(n2) C benchmark, the results showed several dif-

ferent communication overhead problems. The speed at low batch sizes was limited by

the weight transfers during AGS phases. At higher batch sizes, the transfer of data to

the EAC became the bottleneck and if that was removed, additional overhead reduced

performance before a good speedup could be observed. The design presented in this the-

sis only achieves a small speedup over software at high batch sizes. However, the analysis

provided may be used as a basis for further improvements to the architecture.

31

Chapter 6. Conclusion 32

6.2 Future Work

The architecture presented in this thesis still has a great deal of room for improvement.

Primarily, a reduction in the significant communication overheads present in the virtu-

alized system would allow the system to more fully utilize the computational cores.

6.2.1 Weight Matrix Caching

The transfer of weights during the context switches is a significant performance bottleneck

of this architecture. As shown in this work, the effect of the context switch can be

partially allieviated by using large batch sizes. However during the transfer from external

DDR2 RAM, the RBMC is still inactive. To reduce the transfer latency, the next weight

matrix to be processed may be cached in a compact structure within the RBMC. Another

possibility is to use the leftover depth of the weight storage BRAMs to cache multiple

weight matrices. In the 128x128 16-bit architecture, only 128 out of 1K elements of the

18Kbit DRAMs are being used; by making use of the independent write port, additional

weight matrices may be loaded while the RBMC performs other calcuations.

6.2.2 Distributed Energy Accumulator Core Structure

Another method of reducing communication overhead is to improve the operation of the

EAC. If more FPGAs are added to the system, the single EAC point in the current

architecture will become increasingly bottlenecked. In order for the performance of the

architecture so scale well when implemented on many FPGAs, the calculation of the

node states should be distributed in a tree or ring fashion. This would reduce the com-

munication bottleneck as well as improve the node selection time in the case of the tree

structure.

Bibliography

[1] G. E. Hinton and S. Osindero, “A Fast Learning Algorithm for Deep Belief Nets,”

Neural Computation, vol. 18, p. 2006, 2006.

[2] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson, “Generating

Facial Expressions with Deep Belief Nets.”

[3] R. Salakhutdinov and G. Hinton, “Semantic Hashing,” Int. J. Approx. Reasoning,

vol. 50, no. 7, pp. 969–978, 2009.

[4] D. Ly, “A High Performance, Reconfigurable Architecture for Restricted Boltzmann

Machines,” Master’s thesis, University of Toronto, 2009.

[5] S. Kim, MacAfee, P. L. McMahon, and K. Olukoton, “A Highly Scalable Restricted

Boltzmann Machine FPGA Implementation,” in International Conference on Field

Programmable Logic and Applications, 2009.

[6] Y. W. Teh and G. E. Hinton, “Rate-coded Restricted Boltzmann Machines for Face

Recognition,” in In Advances in Neural Information Processing Systems. MIT

Press, 2001, pp. 908–914.

[7] D. L. Ly and P. Chow, “A High-Performance FPGA Architecture for Restricted

Boltzmann Machines,” in FPGA ’09: Proceeding of the ACM/SIGDA international

symposium on Field programmable gate arrays. New York, NY, USA: ACM, 2009,

pp. 73–82.

33

Bibliography 34

[8] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale Deep Unsupervised Learning

using Graphics Processors,” in ICML ’09: Proceedings of the 26th Annual Interna-

tional Conference on Machine Learning. New York, NY, USA: ACM, 2009, pp.

873–880.

[9] G. E. Hinton, “Connectionist Learning Procedures,” pp. 185–234, 1990.

[10] P. Smolensky, “Information Processing in Dynamical Systems: Foundations of Har-

mony Theory,” pp. 194–281, 1986.

[11] Y. Freund and D. Haussler, “Unsupervised Learning of Distributions on Binary

Vectors Using Two Layer Networks,” Santa Cruz, CA, USA, Tech. Rep., 1994.

[12] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltz-

mann Machines,” Cognitive Science, vol. 9, pp. 147–169, 1985.

[13] G. E. Hinton, “Training Products of Experts by Minimizing Contrastive Diver-

gence,” Neural Comput., vol. 14, no. 8, pp. 1771–1800, 2002.

[14] M. Saldana, A. Patel, C. Madill, D. Nunes, D. Wang, H. Styles, A. Putnam, R. Wit-

tig, and P. Chow, “MPI as an Abstraction for Software-Hardware Interaction for

HPRCs,” in Second International Workshop on High-Performance Reconfigurable

Computing Technology and Applications, 2008.

[15] C. Chang, J. Wawrzynek, and R. W. Brodersen, “114 Configurable Computing:

Fabrics and Systems BEE2: A High-End Reconfigurable Computing System.”

[16] J. D. Davis, C. P. Thacker, and C. Chang, “BEE3: Revitalizing computer architec-

ture research,” Tech. Rep., 2009.

[17] J. L. Holt and T. E. Baker, “Back Propagation Simulations using Limited Precision

Calculations,” in International Joint Conference on Neural Networks. Volume II,

Seattle, WA, USA, 1991, pp. 121–126.

Bibliography 35

[18] Xilinx Inc., “Xilinx UG190 Virtex-5 User Guide,” April 2006, [Revised Nov. 5, 2009].

[19] Xilinx Inc, “Xilinx DS100 Virtex-5 Family Overview,” April 2006, [Revised Feb. 6,

2009].

[20] Y. Liao, “Neural Networks in Hardware: A Survey,” Tech. Rep., 2001.

Appendix A

Outline of MicroBlaze Operation

The following is a pseudocode outline of the C code which is run on each of the four

MicroBlaze processors in Fig. 4.3. A partition is the row or column of the block weight

matrix (Fig. 4.2). If the size of the network is larger than 4n, then the FPGAs must

perform multiple energy calculations before the node states for a partition can be found.

Each context switch required in a partition is a work unit.

Many details including the memory addressing scheme are removed from the following

code for clarity. The pseudocode is intended as a reference for the amount of communica-

tion transfers that occur and as a guideline for the order of operations in the computation

engines.

36

Appendix A. Outline of MicroBlaze Operation 37

// MPI_Send and MPI_Recv format:

// MPI_Send(<Memory Address>, <Word Count>, <Destination>)

// MPI_Recv(<Memory Address>, <Word Count>, <Source>)

n = intrinisic RBM size

SIZE = Size of virtualized RBM

PART = SIZE / n

WORK = PART / 4

MPI_RANK = Rank of the current MicroBlaze

for (all epochs)

{

//Run for three AGS phases

for (ags = 0 to 3)

{

//Generate Phase

if ( (ags & 0x00000001) == 0)

{

for (p = 0 to PART)

{

for (w = 0 to WORK)

{

MPI_Send(RBMC_Initialization, BATCH*3+1, MPI_RANK+1);

//Send the weights to the RBMC

MPI_Send(weight, n*n/2, MPI_RANK+1);

for (b = 0 to BATCH)

{

//Send the visible nodes

MPI_Send(visible, n/32, MPI_RANK+1);

//Recieve the Partial Energy

MPI_Recv(energy, n/2, MPI_RANK+1);

if (w == WORK-1)

{

//The primary MicroBlaze must initialize the EAC

//Before other MicroBlazes send their energy

#if MPI_RANK == 0

MPI_Send(EAC_Initialization, 2, 3);

//Send a message around the ring of FPGAs

//To esure they are synchronized


MPI_Send(test, 1, 4);

MPI_Recv(test, 1, 8);

for (c = 0 to WORK)

{

MPI_Send(energy, n/2, 3);

}

//Since the rank 0 FPGA initialized the EAC

//It recieves the node states

MPI_Recv(hidden, n/32, 3);

//Distribute the Node States

MPI_Send(hidden, n/32, 4);

#else

//Synchronize

#if MPI_RANK == 4



#endif

#if MPI_RANK == 6



#endif

#if MPI_RANK == 8



#endif

//Send Energies

for (c = 0 to WORK)

{


}

//Recieve Weights

#if MPI_RANK == 4



#endif

#if MPI_RANK == 6




#endif

#if MPI_RANK == 8


#endif

}

}

}

}

}

//Reconstruct Phase

else

{

for (p = 0 to PART)

{

for (w = 0 to WORK)

{





{

//Send the hidden nodes

MPI_Send(hidden, n/32, MPI_RANK+1);

//Recieve the Partial Energy

MPI_Recv(energy, n/2, MPI_RANK+1);

if (w == WORK-1)

{

//The primary MicroBlaze must initialize the EAC

//Before other MicroBlazes send their energy

#if MPI_RANK == 0

MPI_Send(EAC_Initialization, 2, 3);

//Send a message around the ring of FPGAs

//To esure they are synchronized



for (c = 0 to WORK)


{


}

//Since the rank 0 FPGA initialized the EAC

//It recieves the node states

MPI_Recv(visible, n/32, 3);

//Distribute the Node States

MPI_Send(visible, n/32, 4);

#else

//Synchronize

#if MPI_RANK == 4



#endif

#if MPI_RANK == 6



#endif

#if MPI_RANK == 8



#endif

//Send Energies

for (c = 0 to WORK)

{


}

//Recieve Weights

#if MPI_RANK == 4



#endif

#if MPI_RANK == 6



#endif

#if MPI_RANK == 8



#endif

}

}

}

}

}

}

for (p = 0 to PART)

{

for (w = 0 to WORK)

{

//Initialize the RBMC for weight updates




//Send the learning rate

MPI_Send(learning_rate, 1, MPI_RANK+1);


{

//Send the node states for positive weight update



//Send the node states for negative weight update



}

//Recieve the updated weights

MPI_Recv(weight, n*n/2, MPI_RANK+1);

}

}

}

Documents

A FPGA Implementation of Large Restricted Boltzmann Machinespc/research/publications/ugrad/2009/lo.pdf · A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering