A Universal FPGA-based Floating-point Matrix Processor for ... · matrix accelerators with a master processor and propose a universal ﬂoating-point matrix processor. The processor

A Universal FPGA-based Floating-point Matrix

Processor for Mobile Systems ∗

Wenqiang Wang, Kaiyuan Guo, Mengyuan Gu, Yuchun Ma, Yu Wang

Electronic Engineering Department, TNLIST, Tsinghua University, Beijing, China

Abstract—FPGA-based acceleration of matrix operations is apromising solution in mobile systems. However, most related workfocuses on a certain operation instead of a complete system.In this paper, we explore the possibility of integrating multiplematrix accelerators with a master processor and propose auniversal floating-point matrix processor. The processor supportsmultiple matrix-matrix operations (Level 3 BLAS) and the matrixsize is unlimited. The key component of the processor is a sharedmatrix cache which enables on-chip communication between dif-ferent accelerators. This structure reduces the external memorybandwidth requirement and improves the overall performance.Considering the performance of the whole system, an asyn-chronous instruction execution mechanism is further proposedin the hardware-software interface so as to reduce the workloadof the master processor. We demonstrate the system using a DE3develop board and achieve a computing performance of about19 GFLOPS. Experiments show the proposed processor achieveshigher performance and energy efficiency than some state-of-the-art embedded processors including ARM cortex A9 and NIOSII/f soft-core processor. The performance of the processor is evencomparable to some desktop processors.

I. INTRODUCTION

In recent years, the mobile systems, such as unmanned

aerial vehicles and mobile robots, are developing rapidly. The

applications on these systems have proposed requirements

for low power and high performance matrix computing. For

example, the kalman filter [1], which is a widely-used method

in robot localization, is composed of a set of matrix operations.

To realize high performance matrix computing, traditional

CPUs and GPUs have been utilized by some software libraries

such as MKL [2] and cuBLAS [3]. However, high power

consumption and complex peripherals restrict their application

in mobile systems. Dedicated ASIC chips achieves the best

energy efficiency, but the flexibility is low.

FPGA achieves a good compromise between energy effi-

ciency and flexibility with reconfigurable logic. Thus FPGA-

based matrix computing becomes a promising solution in

mobile applications. Vector processor [4][5][6] is a widely

researched technology for high performance computing in

FPGA. It can be used for parallel matrix computing by

partitioning the matrix into small vectors. However, the energy

efficiency of vector processors is lower than the dedicated ma-

* This work was supported by Huawei Innovation Research Program, IBMResearch China University Relationship Program, 973 project 2013CB329000,National Science and Technology Major Project (2011ZX03003-003-01),National Natural Science Foundation of China (No.61373026), and TsinghuaUniversity Initiative Scientific Research Program.

trix accelerators due to complex scheduling and non-coalesced

memory access.

Thus dedicated accelerator is a better choice when tar-

geting energy-efficient matrix operations. There has been a

lot of work [7][8][9] on the accelerator design for a certain

matrix operation, but very few of them have considered the

integration. Designing a dedicated hardware connection and

dataflow controller for each application can maximize the

computing capacity of these accelerators, but the flexibility

is quite low. In this paper, we propose a universal matrix

computing system by integrating a master processor with

multiple accelerators. The accelerators are connected to a

common memory space and controlled by the master processor

with software programs. Thus the system can adapt to different

applications conveniently by modifying the software programs.

However, the universal structure also brings some chal-

lenges in consideration of performance. The first challenge is

how to integrate multiple accelerators efficiently. Traditional

accelerators are often designed to load matrices from and write

results back to the external memory. Thus a straightforward

method is to integrate them through multiple ports of MMU

(Memory Management Units). We find the bottleneck of this

structure is the external memory bandwidth instead of comput-

ing resources in some applications. To deal with this problem,

we further propose a shared-matrix-cache structure to enable

high-bandwidth on-chip communication. This shared matrix

cache is designed to support various data access patterns,

thus different accelerators can be assigned to the cache and

communicate through the shared cache.

Another challenge is how to reduce the workload of the

master processor. In the proposed structure, the execution

time of each instruction is unpredictable. This brings a new

problem: the master processor is stalled in checking hardware

states before sending a new instruction. In this paper, an asyn-

chronous instruction execution mechanism is further proposed

to solve this problem. With a hardware instruction dispenser,

the master processor is freed from continuous status check and

can process other tasks in parallel.

The major contributions of this paper are as follows:

1) We propose an energy efficient and universal matrix pro-

cessor by integrating a master processor with multiple matrix

accelerators. A hierarchical matrix computing technology is

proposed to support unlimited matrix size.

2) A shared-matrix-cache structure is proposed to improve

the computing performance by supporting on-chip communi-

978-1-4799-6245-7/14/$31.00 ©2014 IEEE 139

cation between accelerators.

3) To reduce the workload of the master processor, an

asynchronous instruction execution mechanism is proposed to

free the master processor from continuous status check.

4) We build a demo system on DE3 develop board and

compare it to some state-of-the-art processors including Intel

i7, ARM Cortex A9 and NIOS II/f processor. The results show

that the proposed system is more energy-efficient and powerful

than the mobile processors. The computing capacity of the

proposed processor is even comparable to the Intel i7 CPU.

The rest of this paper is organized as follows: Section

II introduces the related work and existing problems. In

Section III, the overall structure of the system is introduced.

Section IV describes the detailed hardware implementation of

the processor. In Section V, we show the hardware-software

interface of the system. Section VI describes the performance

of the proposed system and compares it with some state-of-

the-art processors. Section VII concludes the paper.

II. RELATED WORK

Vector processor is a widely researched technology for high

performance computing in FPGA. Most vector processors are

based on a co-processor structure and accelerate 1D vector

operations with an array of general-purpose ALUs [6][5][4].

Vector processors can be used for matrix operations by par-

titioning a matrix into a group of small vectors. However,

the memory access efficiency may decrease when processing

the 2D matrix operations because the required data access

pattern is along two dimensions. Taking matrix multiplication

as an example, the operand matrices should be partitioned

into multiple vectors. Some of these vectors are along the

row direction, while some are along the column direction.

It is quite difficult to make sure these vectors are all stored

consecutively when facing a series of unpredictable matrix

operations. In [10], the vector processor is used for kernel

recursive least squares (KRLS) algorithm in which all the

matrices are symmetric, thus 1D parallel data access is enough.

In [11], a 2D array is proposed to deal with the 2D matrix

operations. However, this structure is actually similar with

the accelerator-based idea: 1D accelerator and 2D accelera-

tor. Besides, when processing matrix operations with vector

processors, the workload of the master processor may be

heavy because a single matrix operation will indicate multiple

instructions. Some work [12][13] have tried to solve this

problem by a hardware address controller, but the 2D parallel

data access flexibility is not realized in their works.

Another category of work that can be used for FPGA-based

matrix operation is the dedicated matrix accelerators. Unlike

vector processors, the accelerators try to solve each basic

matrix operation with a dedicated hardware design. Matrix

multiplication is a widely researched [7][8][9][14] matrix

operation. Some special linear algebra problems are also

considered, such as sparse matrix factorization [15], sparse

matrix-vector multiplication [16] and linear solver [17]. With

dedicated matrix accelerators, the 2D parallel data access can

be supported by special cache design. Besides, the accelerators

can process matrix-matrix operations (Level 3 BLAS) directly.

Thus we select accelerators as the basic computing units.

Although the accelerators for a certain matrix operation

have been widely researched, the integration of different

accelerators is seldom considered. In [18], several different

linear algebra operations are discussed, but each of them is

still designed separately. This motivates us to explore how

to build an efficient hardware computing platform which can

support multiple matrix accelerators.

III. SYSTEM DESCRIPTION

A. Motivation Example

External Memory

MMU

Arbitrator

Cache 0

Accelerator 0

Cache 1

Accelerator 1

Cache N

Accelerator N

Fig. 1. Straightforward implementation of a multi-accelerator system.

As discussed in Section I, the integration method of acceler-

ators will influence the system performance. Fig. 1 shows the

straightforward integration through MMU. All the accelerators

load the matrices from the external memory, execute opera-

tions, and write the results back to the external memory. For

each accelerator, a dedicated cache is used to support parallel

data access. However, it is limited in some aspects:

• No direct communication between the accelerators. The

accelerators must communicate through the external

memory. Thus the external memory bandwidth may be-

come the bottleneck.

• Complicated accelerator design. The cache design is the

key part in many accelerators. It will cost lots of effort to

design an efficient cache with least memory utilization.

In consideration of performance, we can take a set of

different unary operations as an example: B = op1(A);C =op2(B);D = op3(C). A is the input matrix and D is the

result matrix, while B and C are the temporary matrices. In the

straightforward structure, B and C must be written to and then

read back from the external memory. The minimum execution

time tmin is shown in (1).

tmin =size(A) + 2 ∗ size(B) + 2 ∗ size(C) + size(D)

BW(1)

The function size() represents the size of the matrix and

BW represents the bandwidth of the external memory. Even

with more computing units, the executing time will not be

smaller than tmin because the external memory bandwidth

has become the bottleneck.

140

External Memory

MMUShared Cache

Arbitrator

Accelerator 0 Accelerator 1 Accelerator N

Fig. 2. The proposed shared cache structure for a multi-accelerator system.

To solve the problem above, a shared-matrix-cache structure

is proposed in this work to allow on-chip communication, as

shown in Fig. 2. With a shared cache space, the temporary

matrices B and C are only stored in the cache, thus the

minimum execution time tmin will be reduced to:

tmin =size(A) + size(D)

BW(2)

Thus the system performance can be improved with more

computing resources. Besides, the shared matrix cache struc-

ture also simplifies the accelerator design.

B. Overall ArchitectureThe overall architecture of the proposed matrix computing

system is shown in Fig. 3. The whole system is composed

of a master processor, multiple accelerators and shared matrix

caches. Each accelerator is designed to accelerate a certain

matrix operation. The master processor is a NIOS II soft-core

processor in the current system. It sends instructions to the

accelerators and receives feedback. The shared matrix caches

are designed to enable high-bandwidth on-chip communication

between the accelerators. To allow parallelism between differ-

ent accelerators, we implement multiple shared matrix caches

in the system. The arbitrator handles the data distribution

between the accelerators and caches.

We propose sophisticated design in both the hardware and

software to realize the system. Firstly, the shared matrix cache

is a key component of the system. We encapsulate a group of

block RAMs as a 2D matrix storage space to support various

data access patterns. Secondly, the hardware-software interface

is specially designed to support unlimited matrix size and the

asynchronous instruction execution mechanism. Finally, we

encapsulate the low-level software programs and provide an

easy-to-use programming interface.

The proposed structure forms a universal matrix computing

system. To adapt to different applications, we only have

to modify the software programs in the master processor.

Besides, the processor supports convenient accelerator de-

sign/insertion/deletion. In different applications, the required

matrix functions may be different. With the help of recon-

figurable logic, we can insert or delete matrix accelerators

according to the application requirement conveniently. In the

proposed processor, the accelerator design is also greatly

simplified without considering the cache design.

AcceleratorsAccelerators

NIOS II

CPU

DDR2 Memory

MMU

DE3 Dev Board

Instructions

Matrix Cache

Matrix Cache

Matrix Cache

Arbitrator

AcceleratorsDMA DMA

PC

Dispenser

Fig. 3. Overall architecture of the system.

IV. HARDWARE IMPLEMENTATION

A. Shared Matrix Cache

The shared matrix cache is a key component in the hardware

structure. The challenge of the shared cache design is to

support various 2D parallel data access patterns. We propose

a sophisticated hardware structure to realize this, as shown in

Fig. 4.

BRAM

BRAM

BRAM

BRAM

BRAM

BRAM

Data

adapter

Data

adapter

Address

translator

Address

translator

wrdata[]

Port BPort A

rddata[]

wraddr[]

rdaddr[]

Fig. 4. The hardware structure of the shared matrix cache.

Firstly, multiple block RAMs is encapsulated as a matrix s-

torage space, which supports two key features: 2D-coordinate-

based data access and window-based parallel data access.

The low-level storage format of 2D matrix is hidden, thus

it provides a universal storage space for all the accelerators.

Besides, a 2D window in the matrix can be accessed in

parallel. This window can cover most data access patterns of

the accelerators, such as column vector access or row vector

access. The key technology of the matrix storage space is

the projection from the 2D matrix space to the multiple 1D

block RAM storage space. As shown in Fig. 5, the matrix is

141

partitioned to a set of windows. Each window is stored at the

same address of the block RAMs. Assuming the window size

is Hwin ∗Wwin and the maximum matrix width is Wmat, the

correspondence between 2D matrix coordinate and the block

RAM address is shown in (3).

ID =y%Hwin ∗Wwin + x%Wwin

ADDR =⌊ y

Hwin⌋ ∗ ⌊Wmat

Wwin⌋+ ⌊ x

Wwin⌋

(3)

2D Matrix

W(0,0) W(0,1) W(0,2)

W(1,0)

BRAM 0~63

address

Cache

Fig. 5. The layout of the matrix storage in the cache. The 2D matrix isdivided into multiple windows. Each window is stored at the same address ofthe block RAMs. Thus the window can be accessed in parallel.

The left term ID represents which block RAM the data are

stored at and ADDR represents the data address in this block

RAM. This projection is realized by the address translator

module in the cache. To support different accelerators directly,

the window-based access is further extended to four common

data access patterns: pixel access, column vector access, row

vector access and window access. The details of the four types

of data access pattern are shown in Fig. 6. This extension is

realized by the data adapter in the cache.

rowaddr

coladdr

Pixel W/R

Colomn W/R

Row W/R

Window W/R

2-D matrix cache

Fig. 6. Flexible data access patterns supported by the matrix cache.

In the processor, multiple shared caches are implemented

to support accelerator parallelism. An arbitrator handles the

data distribution between the accelerators and required data

access port. With the proposed structure, the block RAMs

are encapsulated as a universal matrix storage space and can

support multiple 2D parallel data access patterns. Thus the

accelerators can be integrated conveniently.

B. Matrix Accelerators

In the universal matrix operation processor, it is convenient

to insert or delete an accelerator. In the current system, we

have designed several basic operations. The structure of these

accelerators will be discussed in this section.

1) Initialization: In the proposed system, a matrix in the

cache can be initialized in three methods:

• DMA. The matrix content can be transferred between the

matrix cache and the external DDR2 memory or the NIOS

II processor;

• From the matrix cache. The matrix can be transferred

inside of the matrix cache;

• Direct initialization. The matrix can be directly initialized

as some basic matrices such as the diagonal matrix.

Two DMA modules are built in the system to handle the

data transmission between the shared matrix cache and the

external DDR2 memory or the NIOS II processor. Besides, a

dedicated matrix initialization accelerator is implemented in

the system to accelerate the cache-to-cache initialization or

direct initialization. The structure of the accelerator is shown

in Fig. 7. The destination matrix can be copied, transposed,

initialized as a diagonal matrix, or initialized as the same

value. A window read port and a window write port are

used in the accelerator, thus the initialization speed is greatly

improved.

InitData

8*8*32

DataAddress

MUX

InitDiag InitAll Transpose

Address

Generator

Instruction

rdaddr

Input Port

OutputSel

DataAddressOutput Port

8*8*32

wraddr

8*8*328*8*328*8*32

Fig. 7. Structure of the matrix initialization accelerator. The window size ofthe shared matrix cache is set as 8*8 here.

2) Array Operations: In array operations, the matrix is

regarded as a long vector and processed in sequence. In the

current system, we have implemented matrix addition, matrix

subtraction, dot multiplication (.∗), and dot division (./). We

use .∗ and ./ here to distinct the array operations from 2D

matrix multiplication and division.

142

The corresponding accelerator is very simple based on the

universal platform. Two read ports are used to read the operand

matrices in the matrix cache. The output of these two ports are

sent to the addition/subtraction/multiplication/division array to

compute the results. The final results are sent back to the

matrix cache through a write port. In the current system, the

three ports are all based on row vector access. This means the

parallelism of the array operations is Wwin (the window width

of the shared cache).

3) Multiplication: Matrix multiplication can be realized

by combining a set of array operations. However, matrix

multiplication is often a key operation which may influence the

performance of the whole application. Thus a dedicated matrix

multiplication accelerator is implemented in the processor.

The hardware design of matrix multiplication has been

researched in a lot of work[7][8][9][14]. The dataflow and

cache design are the key parts in these accelerators. In the

proposed processor, the design is greatly simplified because

the cache is ready. We simply select a window-multiply-

column-vector method, as shown in Fig. 8. With the shared

matrix cache, a window in the left matrix and a column vector

in the right matrix can be read out in one clock cycle. These

two parts are multiplied as a column vector. The window is

scanned along the row direction and the column vector is

scanned along the column direction. Along with the scan,

the result column vectors are accumulated by an array of

accumulators. Finally, when the scan of a whole row/column

is over, the result of these accumulators corresponds to a

column vector in the result matrix. The design of the matrix

multiplication is quite simple, but the hardware computing

units are fully utilized.

Input Port A Input Port B

Multiply

accumulator

Address

Generator

Instruction rdaddr

accbegin

Output Port Cwraddr

8*8*328*32

8*32

8*32

Left matrix Right matrix

Result matrix

8*8 window 8*1 bar

8*1 res

Fig. 8. Structure of the matrix multiplication accelerator. The window sizeof the matrix cache is set as 8*8 here.

V. HARDWARE-SOFTWARE INTERFACE

A. Hierarchical Matrix Computing

In the proposed processor, the accelerators can only operate

the matrices stored in the cache. However, in many appli-

cations, the matrices are larger than the cache capacity and

are stored in the external memory. To solve this problem,

the master processor and the hardware accelerators must work

together. A hierarchical processing mechanism is implemented

in the proposed system: the accelerators handle the operations

inside of the cache, while the master processor handles the

higher-level operations.

In the master processor, a big matrix is partitioned into

multiple small blocks which can be stored in the cache. The

master processor sends instructions to load these blocks into

the cache, execute operations, and then write the result blocks

back. Fig. 9 shows the matrix multiplication as an example.

The two input operands, matA and matB are partitioned

into multiple blocks: blockA(1 : M, 1 : N) and blockB(1 :N, 1 : P ). M , N , and P represents the block number, which

is related to the matrix size. The final result matC are also

partitioned into multiple blocks: blockC(1 : M, 1 : P ). With

basic block matrix operations, the computing procedure is

shown in (4).

blockC(i, j) =∑

0≤k<N

blockA(i, k) ∗ blockB(k, j) (4)

To improve the processing speed, the matrix cache space are

divided into six regions for ping-pong mechanism. As shown

in Fig. 9, the ping buffers pingA,pingB,pingC are used

for block multiplication, while the pong buffers are used for

accumulation and data transfer. When the matrix multiplication

accelerator operates on the ping buffers, the accumulation and

data transfer are executed in parallel:

• Load blockC(i, j) in the external memory into pongB.

• Add the previous result block, which is stored in pongC,

to blockC(i, j).• Store the new blockC(i, j) to the external memory.

• Load new operand blocks, blockA(i, k) and blockB(k, j)into the pong buffer.

With this mechanism, the processed matrix size becomes

unlimited without modifying the cache size. However, it also

brings a big problem: the external bandwidth optimization dis-

cussed in Section III-A becomes difficult because the adjacent

instructions always process different blocks. We still take the

example in Section III-A: B = op1(A);C = op2(B);D =op3(C). When the matrices are partitioned into multiple block-

s, the straightforward execution sequence is shown in the first

column of Fig. 10. We can see the blocks of B and C must

be stored to the external memory because the cache content

will be overlapped by the following blocks.

To solve this problem, we have implemented a preliminary

schedule method in the software. We buffer the adjacent array

operations, and send a set of instructions to each block. After

schedule, the new execution sequence is shown in the second

143

matB

blockB

k

jmatA

blockAi

k matC

blockCi

j

pingA pingB pingC

pongA pongB pongC

Multiply

Matrix

Cache

LOAD LOAD ACCUM

Swap

Fig. 9. Block-level schedule of matrix multiplication..

column of Fig. 10. We don’t need to store B and C in

the external memory any more. In fact, this problem is a

complicated scheduling problem. We aim at an out-of-order

scheduling strategy to maximize the hit ratio of the cache in

the future version.

for i=1:N

blockBi=op1(blockAi);

for i=1:N

blockCi=op2(blockBi);

for i=1:N

blockDi=op3(blockCi);

for i=1:N

{

blockBi=op1(blockAi);

blockCi=op2(blockBi);

blockDi=op3(blockCi);

}

Straightforward After schedule

Fig. 10. Bandwidth optimization on block-level instructions.

B. Asynchronous Instruction Execution

To process the big matrices, the master processor should

send multiple instructions for the schedule. In a synchronous

system, these instructions will be executed immediately after

being sent. Thus when the accelerators are busy, the master

processor should keep waiting. It is hard to arrange any other

task for the master processor during this period. As shown in

Fig. 11, a matrix operation in the software will cost the same

time as the hardware accelerators, while most of the time is

used for waiting.

In the mobile systems, the master processors are often

low-performance. Besides, there may be other tasks for the

master processor except the matrix operations. To reduce the

master processor’s workload, an hardware-based asynchronous

instruction execution mechanism is proposed in our system.

The instructions are buffered in the external memory, thus

the execution of the instructions are not synchronized with

the master processor. A instruction dispenser is implemented

to check the conflicts between instructions. Whenever a new

instruction can be executed, it is read out from the external

memory and send to the accelerators. In the software program,

the master processor just need to send multiple instructions

without considering whether the accelerators are ready. When

the accelerators execute these instructions, the master proces-

sor is free and can do some other tasks such as computing the

next block address. The procedure is shown in Fig. 11.

Execute Ins1 Execute Ins2 Execute Ins3

Wait Wait Wait

Send Ins1 Send Ins2 Send Ins3

Accelerators

Master

Processor

Time

Synchronous

Instruction

Execution

Execute Ins1 Execute Ins2 Execute Ins3

Wait Wait

Send Ins1~3

Accelerators

Master

Processor

Asynchronous

Instruction

Execution

Execute other tasks

Fig. 11. Synchronous and asynchronous instruction execution.

C. Programming Interface

In the current system, we take a NIOS II/f soft-core pro-

cessor as the master processor. Altera has provided sufficient

support for the usage of NIOS II processor. We utilize the

software build tools to program the system. Actually, the

instructions in the platform are some port access commands

in the lowest level. We encapsulate the low-level schedule

programs and provide an C++ class Mat2D as the high-level

interface. Table I shows the basic functions of the class.

VI. EXPERIMENTAL RESULTS

We build a prototype on DE3 develop board [19]. The

FPGA chip on the board is Altera EP3SL340, with 135,200

ALMs, 576 DSP 18-bit elements and about 16Mbits block

RAMs. An 1GB DDR2 memory running at 533MHZ is used

as the main memory. We implement three sets of shared matrix

caches in the demo system to support parallel data access of

ternary operations. For each cache, the window size is 8*8 and

the length of the block RAM is 512. Thus the total storage

capacity is 64*512 = 32768 elements. The 2D cache size can

be configured by the software. It is set as 128*256 in the

current system.

A. Resource utilization

In the demo system, the window size of the matrix cache

is set as 8*8. This means the matrix multiplication accelerator

can execute 64 multiplications and 64 additions in one cycle.

With a clock a frequency of 150MHZ, the theoretical process-

ing capacity of the processor is 150MHZ ∗ 64FLOP ∗ 2 =19.2GFLOPS. Table II shows the resource utilization of the

main modules and the whole system.

From Table II, we can see the parallelism is mainly restrict-

ed by the DSP elements. To explore the potential performance

144

TABLE IBASIC FUNCTIONS OF THE PROGRAMMING INTERFACE.

Functions Declaration Explanation

Initialization void Mat2D::Init(float val, bool isdiag); Initialize (*this) to a certain value

Matrix copy void Mat2D::copyfrom(Mat2D &A); Copy from A

Matrix addition void Mat2D::isadd(Mat2D &A, Mat2D &B); (*this) = A+B;

Matrix subtraction void Mat2D::issub(Mat2D &A, Mat2D &B); (*this) = A-B;

Array multiplication void Mat2D::isdotmult(Mat2D &A, Mat2D &B); (*this) = A.*B;

Array division void Mat2D::isdotdiv(Mat2D &A, Mat2D &B); (*this) = A./B;

Matrix multiplication void Mat2D::ismult(Mat2D &A, Mat2D &B); (*this) = A*B;

TABLE IIRESOURCE UTILIZATION OF THE DEMO SYSTEM.

FPGA Stratix III (64 multiplication units) Stratix V (256 multiplication units)

Modules ALMs DSP 18-bit M9Ks M144Ks ALMs DSP 27-bit M20Ks

Matrix Cache 27,925 6 384 0 25,234 6 384

Accelerators 31,715 416 9 0 69,496 304 18

MMU 8,113 0 44 2 10,016 0 63

NIOS II/f 3,983 4 31 16 3,279 2 156

Demo system 71,930(53%) 426(74%) 468 (45%) 18(38%) 108,025(63%) 312(20%) 621(31%)

TABLE IIIPROCESSING TIME OF MATRIX MULTIPLICATION ON DIFFERENT PLATFORMS. MEASURED BY SECOND.

PlatformMatrix size Average Performance

256 512 1024 2048 4096 8192 (GFLOPS)

NIOS II/f 40.73 325.8 2606 - - - 0.0008

VEGAS[4] - - 0.72 - 43.77 - 3.061 (GOPS)1

Proposed on Stratix III 0.0018 0.0141 0.1121 0.8965 7.171 57.37 19.06

Intel i72 0.0004 0.0027 0.0203 0.1327 0.9812 7.565 117.3

1 The result of VEGAS is based on integer multiplication.2 The result of Intel i7 is based on Windows 7 64bit operating system, Matlab R2013a.

of the processor, we migrate the design to a Stratix V FP-

GA 5SGSMD5K2 and validate the resource utilization with

compilation tools. With more DSP resources, we update the

matrix multiplication module to compute the multiplication

of an 8*8 window with an 8*4 window in each cycle. Thus

the parallelism is improved to 256 multiplications and 256

additions in one cycle. The theoretical processing capacity is

improved to 150MHZ ∗ 256FLOP ∗ 2 = 76.8GFLOPS.

The new resource utilization is also listed in Table II.

B. Performance Evaluation

In this section, the performance of the proposed processor

is evaluated with matrix operations. All the experimental

matrices are bigger than the cache, thus they are stored in the

external DDR2 memory and processed with the hierarchical

matrix computing mechanism discussed in Section V-A.

Matrix multiplication is selected to evaluate the peak com-

puting capacity of the processor. The NIOS II/f processor

and an Intel i7 3770K CPU running at 3.4GHZ are used

for comparison. As shown in Table. III , the processing time

is proportional to the cube of matrix size. The proposed

processor shows better performance than the ARM processor

and NIOS II/f soft-core processor. Compared to VEGAS in

[4], the proposed processor shows about 6x speed up with

double resource utilization. Besides, the proposed processor

is based on floating-point, which costs more resources than

the integer computing units. Thus we believe the proposed

processor achieves better resource utilization efficiency.

The fully optimized program on the Intel i7 CPU shows

a better performance than the proposed processor. However,

the power consumption of Intel i7 is also huge. Table IV

compares the energy efficiency of different platforms. We can

see that the proposed processor outperforms the other three

processors. Besides, the Intel i7 3770K CPU is based on the

22nm VLSI technology, while the demo system uses a 65nm

Altera Stratix III FPGA. To make a fair comparison, we list

the estimated system performance on a newer 28nm Stratix V

FPGA in the last row. We can see the computing capacity

is even comparable with the desktop CPU and the energy

efficiency is much better.

We further evaluate the performance improvement brought

by the two key technologies: shared matrix cache and asyn-

chronous instruction execution. Compared to matrix multipli-

cation, the array operations are easier to be limited by the

external memory bandwidth because their computing com-

plexity is smaller. Thus we select a set of array operations

in the experiment: C = (A. ∗ B)./(A + B) (.∗ and ./are the array operations discussed in Section IV-B2). Table

V shows the elapsed time under three conditions: without

shared-cache-based communication, with shared-cache-based

communication, and the software execution time in the master

processor. From the results, we can see the shared cache

145

TABLE IVPERFORMANCE COMPARISON AMONG DIFFERENT PLATFORMS.

PlatformPerformance Power Energy Efficiency(GFLOPS) (W) (GFLOPS/W)

Nios II/f 0.0008 0.5281 0.0016

ARM Cortex A92 3.0 7.5/board 0.4/board

Intel i7 3770K 117.3 773 1.52

Proposed on Stratix III 19.1 5.811 3.28

Proposed on Stratix V 76.8 4.591 16.7

1 The power consumption of FPGA is estimated by the power estimator[20]provided by Altera company.

2 The result is from [21]. The board power is shown here, thus the energyefficiency of the processor could be a little larger than the listed value.

3 The power consumption of Intel i7 is estimated with thermal designpower(TDP). Actually, the power consumption can be larger than this underfull workload.

structure improves the performance when the external band-

width is the bottleneck. Besides, the asynchronous instruction

execution mechanism greatly reduces the workload of the

master processor. Thus the master processor is able to handle

other tasks when the accelerators are busy.

TABLE VELAPSED TIME OF ARRAY OPERATIONS. MEASURED BY MILLISECOND.

`````````SettingMatrix size

256 512 1024 2048 4096

Without shared cache 0.72 3.04 12.23 48.25 191.83

With shared cache 0.35 1.36 5.41 21.69 87.07

Master processor 0.08 0.30 1.17 4.87 18.67

C. Discussion

From the above experiments, we can see the proposed

matrix processor achieves both high processing speed and high

energy efficiency. In fact, the performance of FPGA-based

matrix operation can still be improved with the update of

devices. In current FPGAs, the floating-point computing units

cost lots of LUTs because there are only integral DSP units on

the chip. This restricts the number of floating-point computing

units and decreases the energy efficiency.

In recent years, floating-point computing has attracted the

attention of FPGA companies. For example, the next gener-

ation Altera Stratix 10 FPGA will have hard-core floating-

point DSP elements [22]. This will increase the floating-

point computing capacity of FPGA and reduce the energy. We

believe the proposed processor will show a better performance

with the improvement of FPGA technology.

VII. CONCLUSION AND FUTURE WORK

In this paper, we propose an energy efficient and univer-

sal matrix processor by integrating a master processor with

multiple matrix accelerators. A novel shared matrix cache

structure is proposed to improve the performance and flex-

ibility of the overall system. Furthermore, we design a so-

phisticated hardware-software interface to provide an easy-to-

use programming interface and reduce the master processor’s

workload. The proposed processor achieves better performance

than some state-of-the-art processors. In the future, we will

add more accelerators and seek possible mobile applications

to further prove the efficiency of the proposed processor.

REFERENCES

[1] R. E. Kalman, “A new approach to linear filtering and predictionproblems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45,1960.

[2] http://software.intel.com/en-us/intel-mkl.[3] https://developer.nvidia.com/cublas.[4] C. H. Chou, A. Severance, A. D. Brant, Z. Liu, S. Sant, and G. G.

Lemieux, “Vegas: Soft vector processor with scratchpad memory,” inProceedings of the 19th ACM/SIGDA international symposium on Fieldprogrammable gate arrays, pp. 15–24, ACM, 2011.

[5] A. Severance and G. Lemieux, “Venice: A compact vector processor forfpga applications,” in Field-Programmable Custom Computing Machines(FCCM), 2012 IEEE 20th Annual International Symposium on, pp. 245–245, IEEE, 2012.

[6] P. Yiannacouras, J. G. Steffan, and J. Rose, “Vespa: portable, scalable,and flexible fpga-based vector processors,” in Proceedings of the 2008international conference on Compilers, architectures and synthesis forembedded systems, pp. 61–70, ACM, 2008.

[7] Z. Jovanovic and V. Milutinovic, “Fpga accelerator for floating-pointmatrix multiplication,” IET Computers & Digital Techniques, vol. 6,no. 4, pp. 249–256, 2012.

[8] T.-C. Lee, M. White, and M. Gubody, “Matrix multiplication on fpga-based platform,” in Proceedings of the World Congress on Engineeringand Computer Science, vol. 1, 2013.

[9] S. Kestur, J. D. Davis, and E. S. Chung, “Towards a universal fpgamatrix-vector multiplication architecture,” in Field-Programmable Cus-tom Computing Machines (FCCM), 2012 IEEE 20th Annual Interna-tional Symposium on, pp. 9–16, IEEE, 2012.

[10] Y. Pang, S. Wang, Y. Peng, N. J. Fraser, and P. H. Leong, “A low latencykernel recursive least squares processor using fpga technology,” in Field-Programmable Technology (FPT), 2013 International Conference on,pp. 144–151, IEEE, 2013.

[11] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Quality programmable vector processors for ap-proximate computing,” in Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture, pp. 1–12, ACM, 2013.

[12] L. G. Bleris, P. D. Vouzis, M. G. Arnold, and M. V. Kothare, “Aco-processor fpga platform for the implementation of real-time modelpredictive control,” in American Control Conference, 2006, pp. 6–pp,IEEE, 2006.

[13] A. Severance and G. Lemieux, “Embedded supercomputing in fpgaswith the vectorblox mxp matrix processor,” in Hardware/SoftwareCodesign and System Synthesis (CODES+ISSS), 2013 InternationalConference on, pp. 1–10, Sept 2013.

[14] K. K. Matam, H. Le, and V. K. Prasanna, “Energy efficient architecturefor matrix multiplication on fpgas,” in Field Programmable Logic andApplications (FPL), 2013 23rd International Conference on, pp. 1–4,IEEE, 2013.

[15] W. Wu, Y. Shan, X. Chen, Y. Wang, and H. Yang, “Fpga acceleratedparallel sparse matrix factorization for circuit simulations,” in Reconfig-urable Computing: Architectures, Tools and Applications, pp. 302–315,Springer, 2011.

[16] Y. Shan, T. Wu, Y. Wang, B. Wang, Z. Wang, N. Xu, and H. Yan, “Fpgaand gpu implementation of large scale spmv,” in Application SpecificProcessors (SASP), 2010 IEEE 8th Symposium on, pp. 64–70, June 2010.

[17] J. Sun, G. D. Peterson, and O. O. Storaasli, “High-performance mixed-precision linear solver for fpgas,” Computers, IEEE Transactions on,vol. 57, no. 12, pp. 1614–1623, 2008.

[18] L. Zhuo and V. Prasanna, “High-performance designs for linear algebraoperations on reconfigurable hardware,” Computers, IEEE Transactionson, vol. 57, pp. 1057–1071, Aug 2008.

[19] http://www.altera.com/education/univ/materials/boards/de3/unv-de3-board.html.

[20] http://www.altera.com/support/devices/estimator/st3-estimator/st3-power-estimator.html.

[21] http://www.ll.mit.edu/HPEC/agendas/proc11/Day1/Posters/A-3 Keville.pdf.

[22] http://www.altera.com/devices/fpga/stratix-fpgas/stratix10/stx10-index.jsp.

146

Documents

A Universal FPGA-based Floating-point Matrix Processor for ... · matrix accelerators with a master processor and propose a universal ﬂoating-point matrix processor. The processor