Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
A Universal FPGA-based Floating-point Matrix
Processor for Mobile Systems ∗
Wenqiang Wang, Kaiyuan Guo, Mengyuan Gu, Yuchun Ma, Yu Wang
Electronic Engineering Department, TNLIST, Tsinghua University, Beijing, China
Abstract—FPGA-based acceleration of matrix operations is apromising solution in mobile systems. However, most related workfocuses on a certain operation instead of a complete system.In this paper, we explore the possibility of integrating multiplematrix accelerators with a master processor and propose auniversal floating-point matrix processor. The processor supportsmultiple matrix-matrix operations (Level 3 BLAS) and the matrixsize is unlimited. The key component of the processor is a sharedmatrix cache which enables on-chip communication between dif-ferent accelerators. This structure reduces the external memorybandwidth requirement and improves the overall performance.Considering the performance of the whole system, an asyn-chronous instruction execution mechanism is further proposedin the hardware-software interface so as to reduce the workloadof the master processor. We demonstrate the system using a DE3develop board and achieve a computing performance of about19 GFLOPS. Experiments show the proposed processor achieveshigher performance and energy efficiency than some state-of-the-art embedded processors including ARM cortex A9 and NIOSII/f soft-core processor. The performance of the processor is evencomparable to some desktop processors.
I. INTRODUCTION
In recent years, the mobile systems, such as unmanned
aerial vehicles and mobile robots, are developing rapidly. The
applications on these systems have proposed requirements
for low power and high performance matrix computing. For
example, the kalman filter [1], which is a widely-used method
in robot localization, is composed of a set of matrix operations.
To realize high performance matrix computing, traditional
CPUs and GPUs have been utilized by some software libraries
such as MKL [2] and cuBLAS [3]. However, high power
consumption and complex peripherals restrict their application
in mobile systems. Dedicated ASIC chips achieves the best
energy efficiency, but the flexibility is low.
FPGA achieves a good compromise between energy effi-
ciency and flexibility with reconfigurable logic. Thus FPGA-
based matrix computing becomes a promising solution in
mobile applications. Vector processor [4][5][6] is a widely
researched technology for high performance computing in
FPGA. It can be used for parallel matrix computing by
partitioning the matrix into small vectors. However, the energy
efficiency of vector processors is lower than the dedicated ma-
* This work was supported by Huawei Innovation Research Program, IBMResearch China University Relationship Program, 973 project 2013CB329000,National Science and Technology Major Project (2011ZX03003-003-01),National Natural Science Foundation of China (No.61373026), and TsinghuaUniversity Initiative Scientific Research Program.
trix accelerators due to complex scheduling and non-coalesced
memory access.
Thus dedicated accelerator is a better choice when tar-
geting energy-efficient matrix operations. There has been a
lot of work [7][8][9] on the accelerator design for a certain
matrix operation, but very few of them have considered the
integration. Designing a dedicated hardware connection and
dataflow controller for each application can maximize the
computing capacity of these accelerators, but the flexibility
is quite low. In this paper, we propose a universal matrix
computing system by integrating a master processor with
multiple accelerators. The accelerators are connected to a
common memory space and controlled by the master processor
with software programs. Thus the system can adapt to different
applications conveniently by modifying the software programs.
However, the universal structure also brings some chal-
lenges in consideration of performance. The first challenge is
how to integrate multiple accelerators efficiently. Traditional
accelerators are often designed to load matrices from and write
results back to the external memory. Thus a straightforward
method is to integrate them through multiple ports of MMU
(Memory Management Units). We find the bottleneck of this
structure is the external memory bandwidth instead of comput-
ing resources in some applications. To deal with this problem,
we further propose a shared-matrix-cache structure to enable
high-bandwidth on-chip communication. This shared matrix
cache is designed to support various data access patterns,
thus different accelerators can be assigned to the cache and
communicate through the shared cache.
Another challenge is how to reduce the workload of the
master processor. In the proposed structure, the execution
time of each instruction is unpredictable. This brings a new
problem: the master processor is stalled in checking hardware
states before sending a new instruction. In this paper, an asyn-
chronous instruction execution mechanism is further proposed
to solve this problem. With a hardware instruction dispenser,
the master processor is freed from continuous status check and
can process other tasks in parallel.
The major contributions of this paper are as follows:
1) We propose an energy efficient and universal matrix pro-
cessor by integrating a master processor with multiple matrix
accelerators. A hierarchical matrix computing technology is
proposed to support unlimited matrix size.
2) A shared-matrix-cache structure is proposed to improve
the computing performance by supporting on-chip communi-
978-1-4799-6245-7/14/$31.00 ©2014 IEEE 139
cation between accelerators.
3) To reduce the workload of the master processor, an
asynchronous instruction execution mechanism is proposed to
free the master processor from continuous status check.
4) We build a demo system on DE3 develop board and
compare it to some state-of-the-art processors including Intel
i7, ARM Cortex A9 and NIOS II/f processor. The results show
that the proposed system is more energy-efficient and powerful
than the mobile processors. The computing capacity of the
proposed processor is even comparable to the Intel i7 CPU.
The rest of this paper is organized as follows: Section
II introduces the related work and existing problems. In
Section III, the overall structure of the system is introduced.
Section IV describes the detailed hardware implementation of
the processor. In Section V, we show the hardware-software
interface of the system. Section VI describes the performance
of the proposed system and compares it with some state-of-
the-art processors. Section VII concludes the paper.
II. RELATED WORK
Vector processor is a widely researched technology for high
performance computing in FPGA. Most vector processors are
based on a co-processor structure and accelerate 1D vector
operations with an array of general-purpose ALUs [6][5][4].
Vector processors can be used for matrix operations by par-
titioning a matrix into a group of small vectors. However,
the memory access efficiency may decrease when processing
the 2D matrix operations because the required data access
pattern is along two dimensions. Taking matrix multiplication
as an example, the operand matrices should be partitioned
into multiple vectors. Some of these vectors are along the
row direction, while some are along the column direction.
It is quite difficult to make sure these vectors are all stored
consecutively when facing a series of unpredictable matrix
operations. In [10], the vector processor is used for kernel
recursive least squares (KRLS) algorithm in which all the
matrices are symmetric, thus 1D parallel data access is enough.
In [11], a 2D array is proposed to deal with the 2D matrix
operations. However, this structure is actually similar with
the accelerator-based idea: 1D accelerator and 2D accelera-
tor. Besides, when processing matrix operations with vector
processors, the workload of the master processor may be
heavy because a single matrix operation will indicate multiple
instructions. Some work [12][13] have tried to solve this
problem by a hardware address controller, but the 2D parallel
data access flexibility is not realized in their works.
Another category of work that can be used for FPGA-based
matrix operation is the dedicated matrix accelerators. Unlike
vector processors, the accelerators try to solve each basic
matrix operation with a dedicated hardware design. Matrix
multiplication is a widely researched [7][8][9][14] matrix
operation. Some special linear algebra problems are also
considered, such as sparse matrix factorization [15], sparse
matrix-vector multiplication [16] and linear solver [17]. With
dedicated matrix accelerators, the 2D parallel data access can
be supported by special cache design. Besides, the accelerators
can process matrix-matrix operations (Level 3 BLAS) directly.
Thus we select accelerators as the basic computing units.
Although the accelerators for a certain matrix operation
have been widely researched, the integration of different
accelerators is seldom considered. In [18], several different
linear algebra operations are discussed, but each of them is
still designed separately. This motivates us to explore how
to build an efficient hardware computing platform which can
support multiple matrix accelerators.
III. SYSTEM DESCRIPTION
A. Motivation Example
External Memory
MMU
Arbitrator
Cache 0
Accelerator 0
Cache 1
Accelerator 1
Cache N
Accelerator N
Fig. 1. Straightforward implementation of a multi-accelerator system.
As discussed in Section I, the integration method of acceler-
ators will influence the system performance. Fig. 1 shows the
straightforward integration through MMU. All the accelerators
load the matrices from the external memory, execute opera-
tions, and write the results back to the external memory. For
each accelerator, a dedicated cache is used to support parallel
data access. However, it is limited in some aspects:
• No direct communication between the accelerators. The
accelerators must communicate through the external
memory. Thus the external memory bandwidth may be-
come the bottleneck.
• Complicated accelerator design. The cache design is the
key part in many accelerators. It will cost lots of effort to
design an efficient cache with least memory utilization.
In consideration of performance, we can take a set of
different unary operations as an example: B = op1(A);C =op2(B);D = op3(C). A is the input matrix and D is the
result matrix, while B and C are the temporary matrices. In the
straightforward structure, B and C must be written to and then
read back from the external memory. The minimum execution
time tmin is shown in (1).
tmin =size(A) + 2 ∗ size(B) + 2 ∗ size(C) + size(D)
BW(1)
The function size() represents the size of the matrix and
BW represents the bandwidth of the external memory. Even
with more computing units, the executing time will not be
smaller than tmin because the external memory bandwidth
has become the bottleneck.
140
External Memory
MMUShared Cache
Arbitrator
Accelerator 0 Accelerator 1 Accelerator N
Fig. 2. The proposed shared cache structure for a multi-accelerator system.
To solve the problem above, a shared-matrix-cache structure
is proposed in this work to allow on-chip communication, as
shown in Fig. 2. With a shared cache space, the temporary
matrices B and C are only stored in the cache, thus the
minimum execution time tmin will be reduced to:
tmin =size(A) + size(D)
BW(2)
Thus the system performance can be improved with more
computing resources. Besides, the shared matrix cache struc-
ture also simplifies the accelerator design.
B. Overall ArchitectureThe overall architecture of the proposed matrix computing
system is shown in Fig. 3. The whole system is composed
of a master processor, multiple accelerators and shared matrix
caches. Each accelerator is designed to accelerate a certain
matrix operation. The master processor is a NIOS II soft-core
processor in the current system. It sends instructions to the
accelerators and receives feedback. The shared matrix caches
are designed to enable high-bandwidth on-chip communication
between the accelerators. To allow parallelism between differ-
ent accelerators, we implement multiple shared matrix caches
in the system. The arbitrator handles the data distribution
between the accelerators and caches.
We propose sophisticated design in both the hardware and
software to realize the system. Firstly, the shared matrix cache
is a key component of the system. We encapsulate a group of
block RAMs as a 2D matrix storage space to support various
data access patterns. Secondly, the hardware-software interface
is specially designed to support unlimited matrix size and the
asynchronous instruction execution mechanism. Finally, we
encapsulate the low-level software programs and provide an
easy-to-use programming interface.
The proposed structure forms a universal matrix computing
system. To adapt to different applications, we only have
to modify the software programs in the master processor.
Besides, the processor supports convenient accelerator de-
sign/insertion/deletion. In different applications, the required
matrix functions may be different. With the help of recon-
figurable logic, we can insert or delete matrix accelerators
according to the application requirement conveniently. In the
proposed processor, the accelerator design is also greatly
simplified without considering the cache design.
AcceleratorsAccelerators
NIOS II
CPU
DDR2 Memory
MMU
DE3 Dev Board
Instructions
Matrix Cache
Matrix Cache
Matrix Cache
Arbitrator
AcceleratorsDMA DMA
PC
Dispenser
Fig. 3. Overall architecture of the system.
IV. HARDWARE IMPLEMENTATION
A. Shared Matrix Cache
The shared matrix cache is a key component in the hardware
structure. The challenge of the shared cache design is to
support various 2D parallel data access patterns. We propose
a sophisticated hardware structure to realize this, as shown in
Fig. 4.
BRAM
BRAM
BRAM
BRAM
BRAM
BRAM
Data
adapter
Data
adapter
Address
translator
Address
translator
wrdata[]
Port BPort A
rddata[]
wraddr[]
rdaddr[]
Fig. 4. The hardware structure of the shared matrix cache.
Firstly, multiple block RAMs is encapsulated as a matrix s-
torage space, which supports two key features: 2D-coordinate-
based data access and window-based parallel data access.
The low-level storage format of 2D matrix is hidden, thus
it provides a universal storage space for all the accelerators.
Besides, a 2D window in the matrix can be accessed in
parallel. This window can cover most data access patterns of
the accelerators, such as column vector access or row vector
access. The key technology of the matrix storage space is
the projection from the 2D matrix space to the multiple 1D
block RAM storage space. As shown in Fig. 5, the matrix is
141
partitioned to a set of windows. Each window is stored at the
same address of the block RAMs. Assuming the window size
is Hwin ∗Wwin and the maximum matrix width is Wmat, the
correspondence between 2D matrix coordinate and the block
RAM address is shown in (3).
ID =y%Hwin ∗Wwin + x%Wwin
ADDR =⌊ y
Hwin⌋ ∗ ⌊Wmat
Wwin⌋+ ⌊ x
Wwin⌋
(3)
2D Matrix
W(0,0) W(0,1) W(0,2)
W(1,0)
BRAM 0~63
address
Cache
Fig. 5. The layout of the matrix storage in the cache. The 2D matrix isdivided into multiple windows. Each window is stored at the same address ofthe block RAMs. Thus the window can be accessed in parallel.
The left term ID represents which block RAM the data are
stored at and ADDR represents the data address in this block
RAM. This projection is realized by the address translator
module in the cache. To support different accelerators directly,
the window-based access is further extended to four common
data access patterns: pixel access, column vector access, row
vector access and window access. The details of the four types
of data access pattern are shown in Fig. 6. This extension is
realized by the data adapter in the cache.
rowaddr
coladdr
Pixel W/R
Colomn W/R
Row W/R
Window W/R
2-D matrix cache
Fig. 6. Flexible data access patterns supported by the matrix cache.
In the processor, multiple shared caches are implemented
to support accelerator parallelism. An arbitrator handles the
data distribution between the accelerators and required data
access port. With the proposed structure, the block RAMs
are encapsulated as a universal matrix storage space and can
support multiple 2D parallel data access patterns. Thus the
accelerators can be integrated conveniently.
B. Matrix Accelerators
In the universal matrix operation processor, it is convenient
to insert or delete an accelerator. In the current system, we
have designed several basic operations. The structure of these
accelerators will be discussed in this section.
1) Initialization: In the proposed system, a matrix in the
cache can be initialized in three methods:
• DMA. The matrix content can be transferred between the
matrix cache and the external DDR2 memory or the NIOS
II processor;
• From the matrix cache. The matrix can be transferred
inside of the matrix cache;
• Direct initialization. The matrix can be directly initialized
as some basic matrices such as the diagonal matrix.
Two DMA modules are built in the system to handle the
data transmission between the shared matrix cache and the
external DDR2 memory or the NIOS II processor. Besides, a
dedicated matrix initialization accelerator is implemented in
the system to accelerate the cache-to-cache initialization or
direct initialization. The structure of the accelerator is shown
in Fig. 7. The destination matrix can be copied, transposed,
initialized as a diagonal matrix, or initialized as the same
value. A window read port and a window write port are
used in the accelerator, thus the initialization speed is greatly
improved.
InitData
8*8*32
DataAddress
MUX
InitDiag InitAll Transpose
Address
Generator
Instruction
rdaddr
Input Port
OutputSel
DataAddressOutput Port
8*8*32
wraddr
8*8*328*8*328*8*32
Fig. 7. Structure of the matrix initialization accelerator. The window size ofthe shared matrix cache is set as 8*8 here.
2) Array Operations: In array operations, the matrix is
regarded as a long vector and processed in sequence. In the
current system, we have implemented matrix addition, matrix
subtraction, dot multiplication (.∗), and dot division (./). We
use .∗ and ./ here to distinct the array operations from 2D
matrix multiplication and division.
142
The corresponding accelerator is very simple based on the
universal platform. Two read ports are used to read the operand
matrices in the matrix cache. The output of these two ports are
sent to the addition/subtraction/multiplication/division array to
compute the results. The final results are sent back to the
matrix cache through a write port. In the current system, the
three ports are all based on row vector access. This means the
parallelism of the array operations is Wwin (the window width
of the shared cache).
3) Multiplication: Matrix multiplication can be realized
by combining a set of array operations. However, matrix
multiplication is often a key operation which may influence the
performance of the whole application. Thus a dedicated matrix
multiplication accelerator is implemented in the processor.
The hardware design of matrix multiplication has been
researched in a lot of work[7][8][9][14]. The dataflow and
cache design are the key parts in these accelerators. In the
proposed processor, the design is greatly simplified because
the cache is ready. We simply select a window-multiply-
column-vector method, as shown in Fig. 8. With the shared
matrix cache, a window in the left matrix and a column vector
in the right matrix can be read out in one clock cycle. These
two parts are multiplied as a column vector. The window is
scanned along the row direction and the column vector is
scanned along the column direction. Along with the scan,
the result column vectors are accumulated by an array of
accumulators. Finally, when the scan of a whole row/column
is over, the result of these accumulators corresponds to a
column vector in the result matrix. The design of the matrix
multiplication is quite simple, but the hardware computing
units are fully utilized.
Input Port A Input Port B
Multiply
accumulator
Address
Generator
Instruction rdaddr
accbegin
Output Port Cwraddr
8*8*328*32
8*32
8*32
Left matrix Right matrix
Result matrix
8*8 window 8*1 bar
8*1 res
Fig. 8. Structure of the matrix multiplication accelerator. The window sizeof the matrix cache is set as 8*8 here.
V. HARDWARE-SOFTWARE INTERFACE
A. Hierarchical Matrix Computing
In the proposed processor, the accelerators can only operate
the matrices stored in the cache. However, in many appli-
cations, the matrices are larger than the cache capacity and
are stored in the external memory. To solve this problem,
the master processor and the hardware accelerators must work
together. A hierarchical processing mechanism is implemented
in the proposed system: the accelerators handle the operations
inside of the cache, while the master processor handles the
higher-level operations.
In the master processor, a big matrix is partitioned into
multiple small blocks which can be stored in the cache. The
master processor sends instructions to load these blocks into
the cache, execute operations, and then write the result blocks
back. Fig. 9 shows the matrix multiplication as an example.
The two input operands, matA and matB are partitioned
into multiple blocks: blockA(1 : M, 1 : N) and blockB(1 :N, 1 : P ). M , N , and P represents the block number, which
is related to the matrix size. The final result matC are also
partitioned into multiple blocks: blockC(1 : M, 1 : P ). With
basic block matrix operations, the computing procedure is
shown in (4).
blockC(i, j) =∑
0≤k<N
blockA(i, k) ∗ blockB(k, j) (4)
To improve the processing speed, the matrix cache space are
divided into six regions for ping-pong mechanism. As shown
in Fig. 9, the ping buffers pingA,pingB,pingC are used
for block multiplication, while the pong buffers are used for
accumulation and data transfer. When the matrix multiplication
accelerator operates on the ping buffers, the accumulation and
data transfer are executed in parallel:
• Load blockC(i, j) in the external memory into pongB.
• Add the previous result block, which is stored in pongC,
to blockC(i, j).• Store the new blockC(i, j) to the external memory.
• Load new operand blocks, blockA(i, k) and blockB(k, j)into the pong buffer.
With this mechanism, the processed matrix size becomes
unlimited without modifying the cache size. However, it also
brings a big problem: the external bandwidth optimization dis-
cussed in Section III-A becomes difficult because the adjacent
instructions always process different blocks. We still take the
example in Section III-A: B = op1(A);C = op2(B);D =op3(C). When the matrices are partitioned into multiple block-
s, the straightforward execution sequence is shown in the first
column of Fig. 10. We can see the blocks of B and C must
be stored to the external memory because the cache content
will be overlapped by the following blocks.
To solve this problem, we have implemented a preliminary
schedule method in the software. We buffer the adjacent array
operations, and send a set of instructions to each block. After
schedule, the new execution sequence is shown in the second
143
matB
blockB
k
jmatA
blockAi
k matC
blockCi
j
pingA pingB pingC
pongA pongB pongC
Multiply
Matrix
Cache
LOAD LOAD ACCUM
Swap
Fig. 9. Block-level schedule of matrix multiplication..
column of Fig. 10. We don’t need to store B and C in
the external memory any more. In fact, this problem is a
complicated scheduling problem. We aim at an out-of-order
scheduling strategy to maximize the hit ratio of the cache in
the future version.
for i=1:N
blockBi=op1(blockAi);
for i=1:N
blockCi=op2(blockBi);
for i=1:N
blockDi=op3(blockCi);
for i=1:N
{
blockBi=op1(blockAi);
blockCi=op2(blockBi);
blockDi=op3(blockCi);
}
Straightforward After schedule
Fig. 10. Bandwidth optimization on block-level instructions.
B. Asynchronous Instruction Execution
To process the big matrices, the master processor should
send multiple instructions for the schedule. In a synchronous
system, these instructions will be executed immediately after
being sent. Thus when the accelerators are busy, the master
processor should keep waiting. It is hard to arrange any other
task for the master processor during this period. As shown in
Fig. 11, a matrix operation in the software will cost the same
time as the hardware accelerators, while most of the time is
used for waiting.
In the mobile systems, the master processors are often
low-performance. Besides, there may be other tasks for the
master processor except the matrix operations. To reduce the
master processor’s workload, an hardware-based asynchronous
instruction execution mechanism is proposed in our system.
The instructions are buffered in the external memory, thus
the execution of the instructions are not synchronized with
the master processor. A instruction dispenser is implemented
to check the conflicts between instructions. Whenever a new
instruction can be executed, it is read out from the external
memory and send to the accelerators. In the software program,
the master processor just need to send multiple instructions
without considering whether the accelerators are ready. When
the accelerators execute these instructions, the master proces-
sor is free and can do some other tasks such as computing the
next block address. The procedure is shown in Fig. 11.
Execute Ins1 Execute Ins2 Execute Ins3
Wait Wait Wait
Send Ins1 Send Ins2 Send Ins3
Accelerators
Master
Processor
Time
Synchronous
Instruction
Execution
Execute Ins1 Execute Ins2 Execute Ins3
Wait Wait
Send Ins1~3
Accelerators
Master
Processor
Asynchronous
Instruction
Execution
Execute other tasks
Fig. 11. Synchronous and asynchronous instruction execution.
C. Programming Interface
In the current system, we take a NIOS II/f soft-core pro-
cessor as the master processor. Altera has provided sufficient
support for the usage of NIOS II processor. We utilize the
software build tools to program the system. Actually, the
instructions in the platform are some port access commands
in the lowest level. We encapsulate the low-level schedule
programs and provide an C++ class Mat2D as the high-level
interface. Table I shows the basic functions of the class.
VI. EXPERIMENTAL RESULTS
We build a prototype on DE3 develop board [19]. The
FPGA chip on the board is Altera EP3SL340, with 135,200
ALMs, 576 DSP 18-bit elements and about 16Mbits block
RAMs. An 1GB DDR2 memory running at 533MHZ is used
as the main memory. We implement three sets of shared matrix
caches in the demo system to support parallel data access of
ternary operations. For each cache, the window size is 8*8 and
the length of the block RAM is 512. Thus the total storage
capacity is 64*512 = 32768 elements. The 2D cache size can
be configured by the software. It is set as 128*256 in the
current system.
A. Resource utilization
In the demo system, the window size of the matrix cache
is set as 8*8. This means the matrix multiplication accelerator
can execute 64 multiplications and 64 additions in one cycle.
With a clock a frequency of 150MHZ, the theoretical process-
ing capacity of the processor is 150MHZ ∗ 64FLOP ∗ 2 =19.2GFLOPS. Table II shows the resource utilization of the
main modules and the whole system.
From Table II, we can see the parallelism is mainly restrict-
ed by the DSP elements. To explore the potential performance
144
TABLE IBASIC FUNCTIONS OF THE PROGRAMMING INTERFACE.
Functions Declaration Explanation
Initialization void Mat2D::Init(float val, bool isdiag); Initialize (*this) to a certain value
Matrix copy void Mat2D::copyfrom(Mat2D &A); Copy from A
Matrix addition void Mat2D::isadd(Mat2D &A, Mat2D &B); (*this) = A+B;
Matrix subtraction void Mat2D::issub(Mat2D &A, Mat2D &B); (*this) = A-B;
Array multiplication void Mat2D::isdotmult(Mat2D &A, Mat2D &B); (*this) = A.*B;
Array division void Mat2D::isdotdiv(Mat2D &A, Mat2D &B); (*this) = A./B;
Matrix multiplication void Mat2D::ismult(Mat2D &A, Mat2D &B); (*this) = A*B;
TABLE IIRESOURCE UTILIZATION OF THE DEMO SYSTEM.
FPGA Stratix III (64 multiplication units) Stratix V (256 multiplication units)
Modules ALMs DSP 18-bit M9Ks M144Ks ALMs DSP 27-bit M20Ks
Matrix Cache 27,925 6 384 0 25,234 6 384
Accelerators 31,715 416 9 0 69,496 304 18
MMU 8,113 0 44 2 10,016 0 63
NIOS II/f 3,983 4 31 16 3,279 2 156
Demo system 71,930(53%) 426(74%) 468 (45%) 18(38%) 108,025(63%) 312(20%) 621(31%)
TABLE IIIPROCESSING TIME OF MATRIX MULTIPLICATION ON DIFFERENT PLATFORMS. MEASURED BY SECOND.
PlatformMatrix size Average Performance
256 512 1024 2048 4096 8192 (GFLOPS)
NIOS II/f 40.73 325.8 2606 - - - 0.0008
VEGAS[4] - - 0.72 - 43.77 - 3.061 (GOPS)1
Proposed on Stratix III 0.0018 0.0141 0.1121 0.8965 7.171 57.37 19.06
Intel i72 0.0004 0.0027 0.0203 0.1327 0.9812 7.565 117.3
1 The result of VEGAS is based on integer multiplication.2 The result of Intel i7 is based on Windows 7 64bit operating system, Matlab R2013a.
of the processor, we migrate the design to a Stratix V FP-
GA 5SGSMD5K2 and validate the resource utilization with
compilation tools. With more DSP resources, we update the
matrix multiplication module to compute the multiplication
of an 8*8 window with an 8*4 window in each cycle. Thus
the parallelism is improved to 256 multiplications and 256
additions in one cycle. The theoretical processing capacity is
improved to 150MHZ ∗ 256FLOP ∗ 2 = 76.8GFLOPS.
The new resource utilization is also listed in Table II.
B. Performance Evaluation
In this section, the performance of the proposed processor
is evaluated with matrix operations. All the experimental
matrices are bigger than the cache, thus they are stored in the
external DDR2 memory and processed with the hierarchical
matrix computing mechanism discussed in Section V-A.
Matrix multiplication is selected to evaluate the peak com-
puting capacity of the processor. The NIOS II/f processor
and an Intel i7 3770K CPU running at 3.4GHZ are used
for comparison. As shown in Table. III , the processing time
is proportional to the cube of matrix size. The proposed
processor shows better performance than the ARM processor
and NIOS II/f soft-core processor. Compared to VEGAS in
[4], the proposed processor shows about 6x speed up with
double resource utilization. Besides, the proposed processor
is based on floating-point, which costs more resources than
the integer computing units. Thus we believe the proposed
processor achieves better resource utilization efficiency.
The fully optimized program on the Intel i7 CPU shows
a better performance than the proposed processor. However,
the power consumption of Intel i7 is also huge. Table IV
compares the energy efficiency of different platforms. We can
see that the proposed processor outperforms the other three
processors. Besides, the Intel i7 3770K CPU is based on the
22nm VLSI technology, while the demo system uses a 65nm
Altera Stratix III FPGA. To make a fair comparison, we list
the estimated system performance on a newer 28nm Stratix V
FPGA in the last row. We can see the computing capacity
is even comparable with the desktop CPU and the energy
efficiency is much better.
We further evaluate the performance improvement brought
by the two key technologies: shared matrix cache and asyn-
chronous instruction execution. Compared to matrix multipli-
cation, the array operations are easier to be limited by the
external memory bandwidth because their computing com-
plexity is smaller. Thus we select a set of array operations
in the experiment: C = (A. ∗ B)./(A + B) (.∗ and ./are the array operations discussed in Section IV-B2). Table
V shows the elapsed time under three conditions: without
shared-cache-based communication, with shared-cache-based
communication, and the software execution time in the master
processor. From the results, we can see the shared cache
145
TABLE IVPERFORMANCE COMPARISON AMONG DIFFERENT PLATFORMS.
PlatformPerformance Power Energy Efficiency(GFLOPS) (W) (GFLOPS/W)
Nios II/f 0.0008 0.5281 0.0016
ARM Cortex A92 3.0 7.5/board 0.4/board
Intel i7 3770K 117.3 773 1.52
Proposed on Stratix III 19.1 5.811 3.28
Proposed on Stratix V 76.8 4.591 16.7
1 The power consumption of FPGA is estimated by the power estimator[20]provided by Altera company.
2 The result is from [21]. The board power is shown here, thus the energyefficiency of the processor could be a little larger than the listed value.
3 The power consumption of Intel i7 is estimated with thermal designpower(TDP). Actually, the power consumption can be larger than this underfull workload.
structure improves the performance when the external band-
width is the bottleneck. Besides, the asynchronous instruction
execution mechanism greatly reduces the workload of the
master processor. Thus the master processor is able to handle
other tasks when the accelerators are busy.
TABLE VELAPSED TIME OF ARRAY OPERATIONS. MEASURED BY MILLISECOND.
`````````SettingMatrix size
256 512 1024 2048 4096
Without shared cache 0.72 3.04 12.23 48.25 191.83
With shared cache 0.35 1.36 5.41 21.69 87.07
Master processor 0.08 0.30 1.17 4.87 18.67
C. Discussion
From the above experiments, we can see the proposed
matrix processor achieves both high processing speed and high
energy efficiency. In fact, the performance of FPGA-based
matrix operation can still be improved with the update of
devices. In current FPGAs, the floating-point computing units
cost lots of LUTs because there are only integral DSP units on
the chip. This restricts the number of floating-point computing
units and decreases the energy efficiency.
In recent years, floating-point computing has attracted the
attention of FPGA companies. For example, the next gener-
ation Altera Stratix 10 FPGA will have hard-core floating-
point DSP elements [22]. This will increase the floating-
point computing capacity of FPGA and reduce the energy. We
believe the proposed processor will show a better performance
with the improvement of FPGA technology.
VII. CONCLUSION AND FUTURE WORK
In this paper, we propose an energy efficient and univer-
sal matrix processor by integrating a master processor with
multiple matrix accelerators. A novel shared matrix cache
structure is proposed to improve the performance and flex-
ibility of the overall system. Furthermore, we design a so-
phisticated hardware-software interface to provide an easy-to-
use programming interface and reduce the master processor’s
workload. The proposed processor achieves better performance
than some state-of-the-art processors. In the future, we will
add more accelerators and seek possible mobile applications
to further prove the efficiency of the proposed processor.
REFERENCES
[1] R. E. Kalman, “A new approach to linear filtering and predictionproblems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45,1960.
[2] http://software.intel.com/en-us/intel-mkl.[3] https://developer.nvidia.com/cublas.[4] C. H. Chou, A. Severance, A. D. Brant, Z. Liu, S. Sant, and G. G.
Lemieux, “Vegas: Soft vector processor with scratchpad memory,” inProceedings of the 19th ACM/SIGDA international symposium on Fieldprogrammable gate arrays, pp. 15–24, ACM, 2011.
[5] A. Severance and G. Lemieux, “Venice: A compact vector processor forfpga applications,” in Field-Programmable Custom Computing Machines(FCCM), 2012 IEEE 20th Annual International Symposium on, pp. 245–245, IEEE, 2012.
[6] P. Yiannacouras, J. G. Steffan, and J. Rose, “Vespa: portable, scalable,and flexible fpga-based vector processors,” in Proceedings of the 2008international conference on Compilers, architectures and synthesis forembedded systems, pp. 61–70, ACM, 2008.
[7] Z. Jovanovic and V. Milutinovic, “Fpga accelerator for floating-pointmatrix multiplication,” IET Computers & Digital Techniques, vol. 6,no. 4, pp. 249–256, 2012.
[8] T.-C. Lee, M. White, and M. Gubody, “Matrix multiplication on fpga-based platform,” in Proceedings of the World Congress on Engineeringand Computer Science, vol. 1, 2013.
[9] S. Kestur, J. D. Davis, and E. S. Chung, “Towards a universal fpgamatrix-vector multiplication architecture,” in Field-Programmable Cus-tom Computing Machines (FCCM), 2012 IEEE 20th Annual Interna-tional Symposium on, pp. 9–16, IEEE, 2012.
[10] Y. Pang, S. Wang, Y. Peng, N. J. Fraser, and P. H. Leong, “A low latencykernel recursive least squares processor using fpga technology,” in Field-Programmable Technology (FPT), 2013 International Conference on,pp. 144–151, IEEE, 2013.
[11] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Quality programmable vector processors for ap-proximate computing,” in Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture, pp. 1–12, ACM, 2013.
[12] L. G. Bleris, P. D. Vouzis, M. G. Arnold, and M. V. Kothare, “Aco-processor fpga platform for the implementation of real-time modelpredictive control,” in American Control Conference, 2006, pp. 6–pp,IEEE, 2006.
[13] A. Severance and G. Lemieux, “Embedded supercomputing in fpgaswith the vectorblox mxp matrix processor,” in Hardware/SoftwareCodesign and System Synthesis (CODES+ISSS), 2013 InternationalConference on, pp. 1–10, Sept 2013.
[14] K. K. Matam, H. Le, and V. K. Prasanna, “Energy efficient architecturefor matrix multiplication on fpgas,” in Field Programmable Logic andApplications (FPL), 2013 23rd International Conference on, pp. 1–4,IEEE, 2013.
[15] W. Wu, Y. Shan, X. Chen, Y. Wang, and H. Yang, “Fpga acceleratedparallel sparse matrix factorization for circuit simulations,” in Reconfig-urable Computing: Architectures, Tools and Applications, pp. 302–315,Springer, 2011.
[16] Y. Shan, T. Wu, Y. Wang, B. Wang, Z. Wang, N. Xu, and H. Yan, “Fpgaand gpu implementation of large scale spmv,” in Application SpecificProcessors (SASP), 2010 IEEE 8th Symposium on, pp. 64–70, June 2010.
[17] J. Sun, G. D. Peterson, and O. O. Storaasli, “High-performance mixed-precision linear solver for fpgas,” Computers, IEEE Transactions on,vol. 57, no. 12, pp. 1614–1623, 2008.
[18] L. Zhuo and V. Prasanna, “High-performance designs for linear algebraoperations on reconfigurable hardware,” Computers, IEEE Transactionson, vol. 57, pp. 1057–1071, Aug 2008.
[19] http://www.altera.com/education/univ/materials/boards/de3/unv-de3-board.html.
[20] http://www.altera.com/support/devices/estimator/st3-estimator/st3-power-estimator.html.
[21] http://www.ll.mit.edu/HPEC/agendas/proc11/Day1/Posters/A-3 Keville.pdf.
[22] http://www.altera.com/devices/fpga/stratix-fpgas/stratix10/stx10-index.jsp.
146