Accelerating Proximal Policy Optimization on CPU-FPGA

Accelerating Proximal Policy Optimization onCPU-FPGA Heterogeneous Platforms

Yuan Meng, Sanmukh Kuppannagari, Viktor PrasannaSchool of Electrical and Computer Engineering, University of Southern California

Emails: {ymeng643, kuppanna, prasanna}@usc.edu

Abstract—Reinforcement Learning (RL) is a technique thatenables an agent to learn to behave optimally by repeatedlyinteracting with the environment and receiving the rewards.RL is widely used in domains such as robotics, game playingand finance. Proximal Policy Optimization (PPO) is the state-of-the-art policy optimization algorithm which achieves superioroverall performance on various RL benchmarks. PPO iterativelyoptimizes its policy - a function which chooses optimal actions,with each iteration consisting of two computationally intensivephases: Inference phase - agents infer actions to interact withthe environment and collect data, and Training phase - agentstrain the policy using the collected data. In this work, wedevelop the first high-throughput PPO accelerator on CPU-FPGA heterogeneous platform, targeting both phases of thealgorithm for acceleration. We implement a systolic-array basedarchitecture coupled with a novel memory-blocked data layoutthat enables streaming data access in both forward and backwardpropagations to achieve high-throughput performance. Addition-ally, we develop a novel systolic array compute sharing techniqueto mitigate the potential load imbalance in the training of twonetworks. We develop an accurate performance model of ourdesign, based on which we perform design space exploration toobtain optimal design points. Our design is evaluated on widelyused robotics benchmarks, achieving 2.1×–30.5× and 2×–27.5×improvements in throughput against state-of-the-art CPU andCPU-GPU implementations, respectively.

I. INTRODUCTION

Reinforcement Learning (RL) is an area of Machine Learn-ing that constitutes a wide range of algorithms which spanthe Observe, Orient, Decide and Act phases of autonomousagents [1]. An autonomous agent interacts with the envi-ronment by sensing, orients itself by updating its internalstate of the world model, decides an action to maximize itsrewards (based on policy) and executes the action to collectthe rewards [1]. RL has found widespread success in domainssuch as self-driving cars [2], robotics in space exploration [3],surveillance [4], wearable healthcare devices [5], etc.

An important class of RL constitutes of policy optimization[6] methods where the policy is modeled as a function ofprobability distribution of actions conditional on states. Theagent learns the policy by pushing up the probabilities ofactions that lead to higher cumulative rewards. Proximal PolicyOptimization (PPO) [7] is considered state-of-the-art policyoptimization method, which introduces a clipped objectivethat discourages the new policy from stepping far away fromthe old policy while taking the biggest improvement stepusing current data. PPO achieves better sample complexity

(number of samples required for convergence) and reliableperformance [7].

Heterogeneous architectures consisting of CPUs and FPGAshave become popular in accelerating memory and computeintensive algorithms [8], [9], [10], [11]. Such platforms areintegrated with abundant high-bandwidth memory and Inter-connections such as CCIX [12] and OpenCAPI [13] whichprovide coherent shared-memory access between the processorand accelerators [14], [15]. The PPO algorithm consists ofvarious computational kernels some of which (e.g. neuralnetwork propagation) can benefit from fine-grained parallelismof FPGAs, whereas others (e.g. Advantage or Loss calculation)are better suited for CPUs.

While PPO requires fewer iterations compared with otherpolicy optimization methods, each iteration is computationallyintensive as it involves multiple epochs of neural networktraining [7]. Additionally, hardware acceleration of PPO ischallenging as PPO has different data access patterns inforward and backward propagations requiring novel memorylayout techniques to enable streaming data access. Moreover,different iterations needed in training the value and policynetworks may lead to imbalanced workload in hardware.

In this work, we develop the first high-throughput PPOaccelerator targeting CPU-FPGA heterogeneous platforms byaccelerating both the Inference and Training phases. Specifi-cally, the major contributions of the paper are:• We identify the key operations in the PPO algorithm and

map them onto FPGA or CPU based on their computa-tion characteristics and communication overheads.

• We design a systolic-array based FPGA accelerator en-abling parallel inferences and high-throughput training ofneural networks representing value and policy functions.

• We develop a novel data layout to store neural networkparameters that supports concurrent conflict-free on-chipmemory accesses in both forward and backward propa-gations without the need to perform matrix transpose.

• We design a novel systolic array compute sharing basedload-balancing technique that shares the compute unitsbetween the value network and policy network to dealwith potential load imbalances in training.

• We develop an accurate performance model of theproposed design incorporating the impacts of designparameters and algorithmic parameters, based on whichwe perform design space exploration to obtain optimaldesigns for implementation.

• Experiments on a heterogeneous CPU-FPGA platformachieve up to 30.5× and 27.5× improvement in through-put compared with state-of-the-art CPU-only and CPU-GPU implementations, respectively.

II. BACKGROUND

A. Proximal Policy Optimization (PPO)

We consider the standard RL setting defined as a PartiallyObservable Markov Decision Process {S ,A,R,P ,γ,s0} thatdefines a stochastic sequence of states st ∈ S , actions at ∈ A,rewards rt ∈ R, and initial states s0 [16]. Starting from theinitial state s0, we draw observations at current state st ∈ S ,actions at from the policy π(at|st), rewards rt from thereward function (rt|st, at), and the next state st+1 from thestate transition function P (st+1|st, at). The objective is tofind a policy π that maximizes the expected Rewards-to-goRt =

∑Ti=t γ

i−tri at every time stamp t, where T is whenthe agent encounters the terminal state.

Typical implementations of policy optimization algorithmsmodel the policy using a neural network, known as policynetwork, parameterized by θ. The policy network representsa stochastic policy πθ which returns a conditional probabilitydistribution πθ(a|s) = Pθ (at = a|st = s).

An advantage function A is defined to indicate the ad-vantage of a specific action at over average [17]: Aπθ,t =Rt− Vφ(st), where Vφ(st) is a learnt value function approxi-mated by a neural network, known as the value network, with

objective LMSE = Eπθk[Vφ(st)− Rt

]2.

One feature of PPO is that it prevents large changes to theaction distribution in a single update. To accomplish this, aclipped objective function for policy is defined [7]:

g(ε, At) =

{(1 + ε)At At ≥ 0

(1− ε)At At < 0

LCLIP = Eπθk

[min

(πθ (at|st)πθk (at|st)

At, g(ε, At

))] (1)

where πθk is the old policy before the update and ε is aparameter valued between 0 and 1 that determines the updatesize. The function g(ε, A) modifies the objective by removingthe incentive for moving the probability ratio πθ(at|st)

πθk (at|st)outside

of the interval [1− ε, 1 + ε] [18], [7].The complete PPO algorithm that uses fixed-length trajec-

tory segments is shown in Algorithm 1: In each iteration, PPOalternates between Inference phase (line 3-7): sampling datathrough interaction with the environment, where a trajectoryis defined as a series of st, πθt(st), at, rt and Training phase(line 8-9): optimizing the objective functions for policy andvalue networks. The Inference phase generates datasets fortraining by interacting with the environment using the currentneural network parameters while the Training phase updatesthe parameters. In the following sections, we use K to denotethe number of training epochs and M to denote mini-batchsize.

B. Reinforcement Learning in Robotic Systems

In robotics, Multi-Layer Perceptrons (MLP) are generallyused to represent the policy of the agent(s). These MLPs areusually small. For instance, a typical MLP structure used inthe largest Openai Mujoco environment [19] with standardtraining parameters [7] requires a total of 5.4 Mbits memoryto store all parameters and intermediate results. Physicalrobots require a large amount of training episodes, which candamage the robot if directed by unsafe policies [9]. Trainingusing simulation environments can serve as an alternativebefore a robot is deployed in the field. This is usually doneby connecting the robot with a CPU where the simulationenvironment is installed, and multiple environments can besimulated in parallel threads for parallel agents grouped toperform the neural network computation on a batch [20], [21].Even with simulation environment, training a RL agent is stillvery time consuming [22], [23], therefore research effort isneeded to accelerate RL algorithms themselves.

C. Challenges and Limitations in PPO acceleration

environment 𝑠"#$𝑎" environment 𝑠"#$𝑎" environment 𝑠"#$𝑎" environment 𝑠"#$𝑎"

Inference𝑠" 𝑎"Inference𝑠" 𝑎"Inference𝑠" 𝑎"Inference𝑠" 𝑎"1…N1…N

Training

𝑠", 𝑎", '𝑅", '𝐴"𝑠", 𝑎", '𝑅", '𝐴"𝑠", 𝑎", '𝑅", '𝐴"𝑠", 𝑎", '𝑅", '𝐴"

𝜃, ∅

Host (CPU)

CPU or GPU

Agents 1…N

𝜃, ∅

𝑠"#$ ← 𝑠"

Fig. 1. PPO on CPU or GPU based platform

Figure 1 illustrates how a PPO iteration can be implementedon a multi-core CPU or CPU-GPU coupled platform [24]:the [inference - environment interaction] loop is repeated forT times by N parallel agents, followed by multiple epochsof mini-batched training executed to update the network pa-rameters that reside in CPU cache or GPU memory. Figure2 summarizes the total computational time for inferences,environmental interactions and mini-batch training in oneiteration of PPO on two typical robotic baselines (Hopper,Humanoid) [21] on CPU and CPU-GPU coupled platforms. On

8.063

5.8

6.9

4.67

0.363

0.28

0.3170.27

0.6530.22

0.6530.22

0 2 4 6 8 10

CPU, HumanoidCPU, Hopper

CPU-GPU, HumanoidCPU-GPU, Hopper

Latency (s)

Training Inference only Environment interaction

0%

10%

20%

30%

M

40M

80M

120M

160M

1 2 3 4 5 6Latnecy (Clock Cycles)

Running iterationsTraining Latency (Baseline)Training Latency (with LB)% Increase in IPS (Running Average)

Fig. 2. Latency breakdown, [N,T,M,K]=[32, 512, 512, 10]

such platforms the performance of PPO is limited because of(1) repeated synchronization overheads and low operational in-tensity due to small batched computations and (2) limited lowlatency L1 cache size (∼32 KB) requires high latency accessto and off-chip memory. On FPGA-based platform, the mini-batch training computations can be pipelined effectively suchthat overheads are non-existent and the abundant customizableon-chip memory enables extremely low latency access to data,

Algorithm 1 PPO-CLIP Algorithm0: Input: Initial policy parameters θ, initial value function parameters φ; Output: learnt policy πθ0: for iteration k = 0, 1, 2... do0: for Agent = 0, 1, 2...N do0: Collect a trajectory τi ∈ D by running old policy πθk in environment for T time stamps0: Approximate state Value function Vφ(st), t = 0, ...T for T time stamps0: Compute Rewards-to-go Rt and Advantage estimates At, t = 0, ...T0: end for0: Optimize Value objective LMSE wrt. φ using mini-batch SGD, with K1 epochs and minibatch size M1 < N × T0: Optimize Policy objective LCLIP wrt. θ using mini-batch SGA, with K2 epochs and minibatch size M2 < N × T0: end for =0

allowing it to achieve superior performance on PPO over CPUand even GPU-based platforms. However, several challengesstill exist in accelerating PPO: (1) different data access patternswithin the Training phase can leads to memory access conflictsand stall the pipeline, and (2) training operations for valueand policy networks may differ, leading to load imbalancein hardware. We address both issues by proposing noveloptimizations to achieve high-throughput performance of PPOacceleration.

III. RELATED WORKS

Most of the existing works on accelerating RL on FPGAsfocus on Q-Learning algorithms such as Deep Q Learn-ing [25], [3], [26] and Table based Q-Learning [27], [28].Unlike policy optimization methods, Q-Learning algorithmslearn the quality function of state-action pairs.

A few recent works have focused on accelerating policyoptimization methods. A FPGA implementation of Asyn-chronous Advantage Actor-Critic (A3C) algorithm is presentedin [23] which uses Convolutional Neural Networks (CNN)for policy representation targeting the application of Atari2600 games. In contrast, we use Multi-Layer Perceptron(MLP) based neural networks to target robotics applications.In [29] and [22], a hardware architecture is developed toaccelerate Trust Region Policy Optimization (TRPO). TRPOuses second-order optimization to prevent divergence of theupdated policy unlike PPO which uses first-order optimization.Moreover, it does not exploit the advantage of parallel agentsor batched training. In [9], a CPU-FPGA architecture isproposed for Deep Deterministic Policy Gradient (DDPG) thatcombines Deep Q-Learning with policy optimization methods.To the best of our knowledge, our work is the first to proposeFPGA-based framework targeting PPO acceleration.

IV. ARCHITECTURE DETAILS

A. Overview

1) CPU-FPGA Task Scheduling: PPO executes iterativelyuntil a user specified convergence criteria is met. In eachiteration, the task scheduling between Host-CPU and FPGAis shown in Figure 3. We partition the tasks such that theFPGA executes the computationally intensive kernels whichcan benefit from fine grained parallelism. In PPO, the neuralnetwork propagation operations satisfy these requirements.

The computation of At, Rt and objective functions are notoff-loaded to FPGA because the amount of computation isrelatively small leading to under-utilization of resource con-suming division functions, and if put on FPGA they will alsointroduce unnecessary communication.

26

PCIe

…

𝑇 rounds FW+ENV Repeat 𝐾 epochs: 𝐾×$×%&

FW+LOSS+BW+WU

…

CPU

FPGA

env

Parallel Forward Propagations (FW)

Parallel Backward Propagations (BW)

Loss ∇𝐿 Calculation (LOSS)

Environment Interaction (ENV)

Gradients aggregation & Weights Update (WU)

)𝑅+& ,𝐴+ calculationLabels:

FPGA to CPU data communication: 𝑉∅(𝑠+) or 𝜋4(𝑎+|𝑠+)

CPU to FPGA data communication: states 𝑠

CPU to FPGA data communication: ∇𝐿&78 or ∇𝐿9:;<

timeenv

env

×𝑁 ×𝑁×𝑁×𝑁 ×𝑁 ×𝑀 ×𝑀 ×𝑀×𝑀 ×𝑀×𝑀

Fig. 3. CPU-FPGA Tasks Schedule

The communication between Host CPU and FPGA isachieved through PCIe interface. An inference request isinitiated by N parallel agents on CPU sending N (different)states st to FPGA, and after each forward propagation (FW)executed in Inference phase, N results of the neural networksare sent to Host CPU for next-step environmental interaction.At the end of trajectories, the CPU computes Rt, At for allN×T time stamps, and initiates a training request by sendingM states sampled from N ×T collected data to FPGA. Afterexecution of FW on FPGA, M FW results are sent to CPUto calculate loss objectives 5L, which are then sent to FPGAto execute backward propagations (BW) and weights updates(WU). The {FW-LOSS-BW-WU} process is repeated to coverall N ×T samples for K epochs, after which one iteration ofthe algorithm is complete and it repeats the next iteration.

2) FPGA design Overview: Figure 4 shows a high-leveloverview of the FPGA accelerator. We deploy two ComputeUnits (CU), one for value network and one for policy network.Same CUs are used for both Inference and Training. We storethe entire MLP’s weight matrices and its intermediate resultson-chip since MLPs and mini-batch used for robotics applica-tions in PPO are small in sizes. The black and blue coloredarrows show the data flows for FW and BW, respectively. Theprocess of WU is entirely local to FPGA since all the networkparameters are stored on-chip.

27

Host CPU

Environment,Host CPU

gradients (∅)Host CPU

Policies 𝜋4(𝑎+|𝑠+)

CU: Value NetworkFW, BW, WU

CU: Policy networkFW, BW, WU

states 𝑠+

Values 𝑉∅(𝑠+)

states 𝑠+

Store on-chip

∇∅𝐿&78 ∇4𝐿9:;<

Store on-chip

On-chip Weights Updates

gradients (𝜃)On-chip Weights Updates

Host CPU

Inter-Load Balancing module

FW dataflow BW dataflow WU dataflow

Environment Environment

Labels:

Fig. 4. Overview

B. Compute Units on FPGA

Figure 5 shows the detailed architecture of a single CU.It is composed of a 2D systolic array to perform (1) FW,(2) BW and (3) gradients computation and WU; an activationmodule, an activation derivative module; and buffers to storenetwork weights (Weights buffer) and intermediate results (zl

buffer for immediate matrix products, al and δl buffers for FWactivation results and BW errors, respectively). The systolicarray is composed of Psys × Psys Processing Elements (PE),each performing a MACC (Multiply-Accumulate) operation.The activation and activation derivative modules are arrayswith length Psys of activation and activation derivative units.The WU module is an array of adders that subtract the gra-dients computed by the systolic array from the correspondingweights. The Mult. Module consists of Psys multipliers tocalculate the Hadamard products of BW intermediate resultsand activation results in FW.

In Inference phase, only FWs are performed and only thelast layer results are collected and sent to CPU. Since N agentsperform FW concurrently in each time stamp, to fully utilizethe resources of 2D systolic array, we ‘vectorize’ the N inputstate vectors into a matrix. According to feed forward rule:

FW : zl = al−1(N×L1−1) ∗W l(Ll−1×Ll)

al = σ(zl)(N×Ll)(2)

Where Ll is the layer size of layer 1 ≤ l ≤ n. In each layer ofthe MLP, a piplined matrix-matrix multiplication is performedby the systolic array. The outputs of each layer, except thefinal layer, are buffered and sent as input to PEs to performcomputation of the next layer. The output of the final layer issent to the CPU.

In Training phase, we use mini-batch Stochastic GradientDescent/Ascent (SGD/SGA) [30]. Each epoch is composedof M independent FW and BW, to which we apply similarvectorization technique as in Inference phase: form M inputvectors into a matrix and perform the FW, send the vectorizedresult matrix to Host-CPU to calculate the LOSS as wellas δLastlayer = ∇aLOSS � σ′

(zLastlayer

)which has the

dimension of M × LLastlayer, and perform BW followed byWU:BW : δl=(δl+1(M×Ll+1)

)∗(W l+1)T(Ll+1×Ll)�σ′

(zl

(Ml))

(3)

WU :W l=W l(Ll−1×Ll)−λal−1T(Ll−1×M)

∗δl(M×Ll)

� : Hadamard (element-wise) product; λ: Learning rate;T : Matrix transpose; ∗ : Matrix Multiplication;

(4)

30

MUX FW? BW? WU?

Last Layer outputsEnvironment/Host CPU

states ∇𝐿Environment Host CPU

MUX

FW? BW? WU?

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Weights

buffe

r

Activation Module 𝜎

𝑧B layers buffer

𝑎B layers buffer

FW? BW? WU?

𝛿B layers buffer Activation’ Module 𝜎D

FW dataflow BW data flow WU dataflow

WU EN

T

T

T: Transpose orderLabels:

WU Module

T1

1

1

1

1

1

2

2

Computing Unit (CU)

Mult. Module

Fig. 5. CU architecture

The WU can be performed only after all the error calcula-tions are finished. In WU process all products between everyneurons in adjacent layers are aggregated, which is essentiallyanother round of Matrix-Matrix Multiplication, so we feedthese data into the systolic array, with data flow path shownin red color in Figure 5.

C. Parallelism & Data Layout

1) Systolic-Array based Parallelism: Both FW and BWconsist of multiple propagation steps across all the MLPlayers. A propagation step across one layer may take severalpasses of the systolic array since the size of matrices are usu-ally larger than systolic array dimension Psys. We partition theinput matrices into Psys rows and Psys columns respectively,and feed them into the PEs in a cyclic manner: in each pass,the corresponding Psys rows in each colored row-blocks ofinputs/intermediate results matrix A are fed into one side ofsystolic array while the Psys different colored columns ofweight matrix W are fed into the other side. Figure 6 (right)shows such process with an example 4× 4 systolic array. Toensure concurrent accesses of all PEs, the rows/columns readin each pass need to be stored in Psys individual memoryblocks.

2) Conflict-Free Data Layout: In Training phase, theweight matrices that need to be read in the BW passes aretranspose of that in FW passes. If we store the matrix entirelyin either row or column major order in Psys BRAM blocksas Figure 6 (left) displays, we cannot ensure concurrent andconsecutive accesses to the weight matrix by all PEs in bothFW and BW passes, because multiple PEs requesting datafrom the same BRAM block in one of FW or BW processes

𝐿"𝑃$%$

𝐿&

Single Memory Block

FW pass

BW pass𝐿"

Inputs/Intermediate results matrix

Weight matrix

𝐷𝑃$%$

𝐿&𝑃$%$

𝐴)×+" 𝑊+"×+&

Systolic Array,𝑃$%$ = 4

𝐿"

𝐿"

time

time

1

2

1 1 1 12 222

𝑎

𝑎𝑎

𝑎

𝑏𝑏

𝑏𝑏

1$3 𝑝𝑎𝑠𝑠: 𝑎: 1, 𝑏: 1289 𝑝𝑎𝑠𝑠: 𝑎: 1, 𝑏: 23;9 𝑝𝑎𝑠𝑠: 𝑎: 2, 𝑏: 143< 𝑝𝑎𝑠𝑠: 𝑎: 2, 𝑏: 2

1 2

1

21 2

Weight matrix𝑊+"×+&

Fig. 6. Systolic Array Detailed Data Flow

will result in bank conflicts, costing delay proportional to∑ all layer sizesPsys

cycles to read these rows into consecutivestreams before feeding into the systolic array.

12

1 2 3

𝑊+=×+=>? 𝑊+=×+=>?

𝐿@𝑝

Single Memory Block𝐿@A"𝑃$%$

𝐿@𝑝

𝐿@A"𝑃$%$

BW Data AccessFW Data Access𝐿@𝑃$%$

×𝐿@A"𝑃$%$

Addr.

Data

…

…

Layer 𝑙

Fig. 7. Weights Data Layout

To address this issue, we design a new data layout to storeon-chip parameters and intermediate results of neural networksin a memory-blocked fashion as illustrated in Figure 7. Wepartition the input matrix into Psys × Psys blocks along bothdimensions (row and column), and each block is stored in anindividual memory block (e.g. BRAM, LUT). In each layerof forward propagation, in each pass, if data from the weightmatrix is fed into systolic array in row-major order, Psys dataamong different columns are accessed in parallel, and sincethese data are stored in different memory blocks, continuousparallel accesses can be achieved. Same for each pass of eachlayer in back propagation (data read in column major order),Psys data can be accessed in one clock cycle along differentrows as they are stored in separate memory blocks, and datacan still be streamed in continuously without BRAM conflicts.Furthermore, we can store correspondent blocks of differentweight matrices belonging to different layers into one memoryblock, such that Psys × Psys memory blocks store all theweight matrices in the MLP instead of just one, as shownin Figure 8.

With this data layout, we can also stream the data through-

𝑊+?×+C

Single Memory Block

FW Data Access

𝐿"𝑃$%$

×𝐿&𝑃$%$

𝐿&𝑃$%$

×𝐿D𝑃$%$

Addr.

Data

…

…

Layer 2 Layer 3Layer 1 …

𝑊+C×+E𝑊+E×+F

…

BW Data Access

…

Fig. 8. Weight Matrix Storage

out the FW or BW process across the layers in entire MLP,meaning that anytime we switch to next pass or layer, thedata access pipelines can be ”stacked” (consecutively accessedback-to-back), thus the data access process only need to paythe initialization latency of Psys clock cycles once.

As shown in Figure 5, The WU process also needs to takethe transpose of al matrices, which are the intermediate resultsin each layer in the FW process of training, therefore both theweight buffer and al layers buffer are designed to support theblocked memory data layout.

D. Load Balancing Module

In this section, we focus on addressing the problem of possi-ble load-imbalance between CUs, which is caused by differentnumber of epochs (and mini-batch sizes) for training value andpolicy networks. This occurs due to the following reasons:(1) Early stopping technique applied in the Training phaseto terminate policy training because of large approximate KLdivergence between new and old policy [24], and (2) differentconvergence rate between value and policy networks. This maylead to the undesired situations where, when the training forone network is complete or stopped, the CU is completely idlewaiting for the training of the other network to complete. To

33

Inter-Load Balancing module

CU: Value Network

CU: Policy network

Value networkWeight bufferPolicy networkWeight buffer

MUX.sM

UX.s

Value: End? Policy: End?

Fig. 9. Load Balance Module

address this issue, we develop an Inter-Load Balancing Modulewhich off-loads the training of one network to both the CUsunder such situation, as shown in Figure 9. Both the CUshave access to the buffers storing the parameters of both thenetworks via multiplexers. When both networks are training,each CU accesses their own corresponding weight buffers.

When one network finishes training, the CU is switchedto access the weights corresponding the other network, andthe two CUs access the same weight buffer. The mini-batch

inputs from CPU are evenly divided into two sub-mini-batchesto feed into the two CUs concurrently. During FW and BWprocesses, the PEs in systolic arrays of the two CUs simplyread from the selected buffers. In WU process the mechanismis more complicated: as the WU modules of the two CUsare updating the same memory address of weight buffer ineach clock cycle, the outputs of two WUs are first summedup with an adder array of length Psys before writing intothe weight buffer. To avoid reading and adding the originalweight twice, we add another layer of selection before weightbuffer selection so that when 2 WUs are working on the samenetwork, only one WU reads the network weight and the otherreads 0.

V. PERFORMANCE MODEL & DESIGN SPACEEXPLORATION

We develop an accurate performance model to estimate theimpact of parameters such as the dimension of the systolicarrays, data width, and constraints such as the availableBRAM/DSPs, accuracy, etc. on the performance of the de-sign, based on which we perform design space exploration(DSE) to identify optimal designs. The algorithmic hyper-parameters N,M, T,K respectively represent the number ofparallel agents, the mini-batch size, the trajectory length andthe number of training epochs, in accordance with descriptionsin section II-A.

A. Performance Objective Function

The clock cycles needed for one execution of matrix-matrixproduct of A(n1 × n2)×B(n2 × n3) is:

Ymm(n1, n2, n3) = dn1Psyse × d n3

psyse × n2 + Psys − 1 (5)

Therefore the clock cycles needed for complete FW, BW andWU across all the layers are

YFW : (

i=n−1∑i=1

Ymm(D,Li,Li+1))

YBW : (

i=2∑i=n

Ymm(M,Li,Li−1))

YWU : (i=n−1∑i=1

Ymm(Li,M,Li+1))

(6)

where Li denotes the ith layer size, and D can be N orM , depending on Inference or Training phase. The overallthroughput in IPS (number of interactions in Inference phasedivided by the entire iteration time, as defined in SectionVI-D1) is calculated as:

U :N × T

Tinf + Tenv × T + Ttrain, where (7)

Tinf = YFW × T + Tcomm1 + TR,A and (8)

Ttrain = Tminibatch ×K ×N × TM

+ Tcomm2, where

Tminibatch = YFW + YBW + TLOSS + YWU

(9)

In the performance objective function Tenv denotes the one-step environmental interaction time; TLOSS and TR,A denotes

the time needed for computations of 5L objectives, R0..T andA0..T on CPU; Tcomm1, Tcomm2 denotes the total communi-cation costs between FPGA and Host CPU in Inference andTraining phase, respectively.

In our optimized FPGA-based design, total environmentalinteractions time usually takes > 49% of the entire iteration.Therefore when fixing other hyper-parameters constant andincreasing N , the throughput should increase as evident inEquation 7, as the denominator is part bottle-neck by Tenv .When increasing T or M , the throughput should not change, asthe factor of T is reduced in both numerator and denominatorof U and the factor of M is reduced in numerator anddenominator of Ttrain. These characteristics are confirmed inour evaluations.

B. DSE

The optimal mapping parameters Psys - systolic arraydimension and wdatatype - data representation in number ofbits can be obtained by solving the following optimizationproblem:

maximizePsys,wdatatype

(U)

subject to: CBRAM ,CDSP ,CData(10)

where CBRAM and CDSP represents on-chip memory andDSP constraints;

2(i=n−1∑i=1

LiLi+1+ 3Mi=n∑i=1

Li)×wdatatype<BRAMFPGA

2× P 2sys ×DSPPE +DSPothers < DSPFPGA

(11)CData is a constraint used to ensure neural network functionmodel accuracy, for which we set a lower bound to data widthsuch that the accuracy is within a reasonable threshold requiredby the application: wdatatype > w.

The DSE problem described can be solved using standardpython based optimization libraries [31].

VI. EXPERIMENTS & EVALUATIONS

A. Benchmark Experiments and Platforms Overview

We use two MuJoCo benchmarks from OpenAI Gym [21]- which are widely used to evaluate RL algorithms [32], [33],[29] - Hopper and Humanoid. We apply PPO to learn thepolicy to control a robot to hop or run. We use the samestructure for value and policy networks except for the outputlayer with relu() activation. Hopper-v2 has 4 rigid body partsand 3 actuated joints, a 11-dimensional state space and 3-dimensional action space. We use a neural network of size 11(input) - 64 - 64 - 3 (output). Humanoid-v2 is a simulated robotwith the largest state and action space in MuJoCo environment.We use a neural network of size 376 (input) - 128 - 64 - 17(output).

We compare our CPU-FPGA design against two baselines:CPU only design and CPU-GPU design. We use Intel Xeon5120 at 2.20GHz as the (Host) CPU for all three designs.TITAN Xp GPU [34] at 1404 MHz is used as the GPU whileXilinx Alveo U200 [35] is used as the FPGA. We use the

OpenAI implementation of PPO [24] for baselines, which usesMPI [36] for parallel agents on CPU and TensorFlow [37],[20] for the neural networks which allows training/inferencein either CPU - for CPU only baseline or GPU - for CPU-GPU baseline. The training is performed using mini-batchSGD/SGA [30].

B. Design Space Exploration

We first perform DSE as discussed in Section V. The resultsfor Humanoid under example hyper-parameters [N,M, T,K]= [32, 512, 512, 10] and example 8-bits (data accuracy)lower bound to w are plotted in Figure 10. Same operationis performed for Hopper and similar result is also obtained.The solid lines represent throughputs with different combi-nations of architectural parameters on the target hardwareplatform, and the area enclosed by the dashed lines indicatethe achievable performance bounded by limited number ofon-chip BRAMs, DSPs, and application-specific data accu-racy requirement. Overall throughput (IPS) can be calculatedfrom Equation 7, and based on the constraints we inferthat the optimal design points for both Mujoco benchmarksare (Psys, wdatatype) = (21, wlowerbound) for the hyper-parameters used in the evaluations. Overall, the best perfor-mance achievable from our design is bounded by availableDSP resources. For the following performance evaluation, weconstrain the Psys value to be the highest power of 2 to ensureminimum zero-paddings; Float point data width is chosen to be32 which is within the memory bound and yields a reasonableapproximation accuracy for both problems.

4 6 8 10 12 14 16 18 20 22 24Psys

10

20

30

40

50

60

Data

Wid

th

OPT range

4000

5000

6000 70

0080

00

1000

0 1100

012

000

1300

0

1400

0

1500

0

1600

0

1700

0

1800

0

Data accuracy boundBRAM boundDSP bound

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Fig. 10. DSE with Humanoid

C. Resource Utilization

We implement the design consisting of two CUs, buffers(with the data layout optimization) and the load balancingmodule on the FPGA using Vivado 2018.3 for synthesis. Ourimplementation sustains a clock frequency of 285 MHz. TableI summarizes the resource usage breakdown on the FPGA.

The BRAM utilization for both the baselines increaseswith mini-batch size M as the zl,al and δl buffers need tostore all intermediate results. Humanoid has larger memoryutilization than Hopper due to its larger-sized neural networks.The utilization of the remaining resources do not vary withdifferent hyper-parameters.

TABLE IFPGA RESOURCE UTILIZATIONS

exp. M=64 M=128 M=256 M=512

BRAM 1 696(32%) 752(35%) 810(38%) 926(43%)2 638(30%) 926(43%) 1466(68%) 2160(100%)

DSP both 3744 (55%)Logic both 501k (42%)FlipFlop both 708k (30%)

? exp.1: Hopper; exp. 2: Humanoid.

D. Performance

1) Throughput: In accordance with the existing works onaccelerating RL, we use the metric: Number of inferencesprocessed per second (IPS) to measure the throughput perfor-mance. IPS is the ratio of the total number of samples collectedin the Inference phase to the total time taken in executing theInference phase (neural network inferences plus environmentinteractions) and the Training phase (mini-batch training ofneural network using the collected samples).

Figure 11 shows the IPS (in log scale for better visual-ization) for the CPU-FPGA, CPU-only and CPU-GPU imple-mentations. The labels (a×, b×) above the CPU-FPGA barsindicate a× improvement over CPU-only and b× improvementover CPU-GPU implementations in IPS, respectively.

We observe similar trends in IPS for both the benchmarksfor same set of experiments as evident from Figures 11avs 11d, Figures 11b vs 11e, Figures 11c vs 11f. All threesets exhibit higher IPS on Hopper, and our design achieveshigher improvements against the baselines on Hopper thanHumanoid. This is due to the following reasons: (a) TheHumanoid environment is more complex than Hopper leadingto (3×) longer interaction times. As the environmental inter-actions for each agent are sequential, this results in smallerroom for speed-up after acceleration; (b) Larger network isrequired for Humanoid than Hopper. Therefore, the FPGAdesign takes larger number of systolic array passes for trainingand inference resulting in lower speed-up.

In Figures 11a and 11d, we fix [M, T, K] = [512, 512, 10]and vary N . As expected from our performance model, higherIPS is observed by increasing the number of parallel agentsN due to increased parallelism. The highest performanceimprovement is observed at N=32 implying that our designscales better than the baselines with increasing N .

In Figures 11b and 11e, we fix [N, T, K] = [16, 512, 10]and vary M . We observe that on both CPU and CPU-GPUplatforms, as the mini-batch size drops, the IPS drops too.This is due to an increase in the frequency of synchronizationsbetween threads for WU operations. Our fine-grained pipelineddesign, where synchronization overheads are non-existent,does not suffer from such a slowdown.

In Figures 11c and 11f, we fix [N, M, K] = [32, 512, 10]and vary T . Increasing T leads to increases in both numberof inferences processed per iteration and total iteration time,therefore we observe that performance remains roughly thesame on each platforms with different values of T . Overallwe observe significant improvements on CPU-FPGA platformover the baselines.

10

11

12

13

14

15

16

N=4 N=8 N=16 N=32

log2 (IPS)

CPU CPU-GPU CPU-FPGA

2.1x, 2x

2.8x, 2.6x

4.6x, 4.3x

6.4x, 6.1x

10

11

12

13

14

15

16

N=4 N=8 N=16 N=32

log2 (IPS)


3.2x, 3x

5x, 5.4x

10.5x, 9.7x

17.2x, 15.2x

(a) Hopper: Varying N

910111213141516

M=128 M=256 M=512

log2 (IPS)

CPU CPU-GPU CPU-FPGA10.3x, 9.2x

15.2x, 13x

30.5x, 27.5x

910111213141516

M=128 M=256 M=512

log2 (IPS)


4.8x, 4.4x

8.2x, 7.5x

13.1x, 12.1x

(b) Hopper: Varying M

10

11

12

13

14

15

16

T=256 T=512 T=1024

log2 (IPS)


17.2x, 14.9x

21.2x, 19x

10

11

1213

14

15

16

T=256 T=512 T=1024

log2 (IPS)


6.2x, 5.5x

6.4x, 6x

7.4x, 5.3x

(c) Hopper: Varying T

10

11

12

13

14

15

16

N=4 N=8 N=16 N=32

log2 (IPS)


2.1x, 2x

2.8x, 2.6x

4.6x, 4.3x

6.4x, 6.1x

10

11

12

13

14

15

16

N=4 N=8 N=16 N=32log2 (IPS)


3.2x, 3x

5x, 5.4x

10.5x, 9.7x

17.2x, 15.2x

(d) Humanoid: Varying N

910111213141516

M=128 M=256 M=512

log2 (IPS)


15.2x, 13x

30.5x, 27.5x

910111213141516

M=128 M=256 M=512log2 (IPS)


4.8x, 4.4x

8.2x, 7.5x

13.1x, 12.1x

(e) Humanoid: Varying M

10

11

12

13

14

15

16

T=256 T=512 T=1024

log2 (IPS)


17.2x, 14.9x

21.2x, 19x

10

11

1213

14

15

16

T=256 T=512 T=1024

log2 (IPS)


6.2x, 5.5x

6.4x, 6x

7.4x, 5.3x

(f) Humanoid: Varying T

Fig. 11. Throughput comparison of our design - CPU-FPGA and baselines - CPU, CPU-GPU varying N,M, T

2) Data Layout Optimization: To evaluate the effectivenessof data layout optimization, we consider a baseline (no opti-mization) design where the weight matrices and al activationsare stored in Psys BRAM blocks such that data can bestreamed into systolic array in FW passes while extra cyclesare required in the BW and WU passes. We measure theBW/WU latencies of one single mini-batch in clock cycles andthe overall IPS for this case on both robotics baselines for theset of hyper-parameters [N,M, T,K] = [32, 512, 512, 10],and compare with our optimized design using blocked datalayout, shown in Figure 12. Our result shows 23.5% increasein overall throughput (IPS) for Hopper and 21.2% increase forHumanoid using memory blocked data layout.

0%

5%

10%

M

4M

8M

12M

16M

1 2 3 4 5 6

Latnecy (Clock Cycles)


30K

40K

50K

60K

70K

5K

15K

25K

35K

45K

55K

baseline blockeddata layout

IPS

latnecy (clock cy

cles)

minibatch WU latencyminibatch BW latencyIPS

(a) Hopper

6K

8K

10K

12K

14K

50K100K150K200K250K300K


IPS

latnecy (clock cy

cles)


(b) Humanoid

Fig. 12. Data layout optimization

3) Inter-Load Balance Optimization: We compare the im-plementation with load balance optimization and a baselinedesign without load balance optimization, where we applythe early stopping trick as used in the OpenAI Spinningupcode [24] where load-imbalance occur at the early iterationsof PPO as KL divergence is prone to be larger (resulting inearlier stopping of policy training) at the beginning. We runthe algorithm for the first 6 iterations and measure the traininglatencies in every iteration as well as the running averagethroughput, again with hyper-parameters [N,M, T,K] = [32,

512, 512, 10]. The result in Figure 13 shows 9.3% to 5.8% and28.3% to 22.5% increase in overall running average throughput(IPS) in the first 6 iterations of PPO for Hopper and Humanoid,respectively.

0%

5%

10%

M

4M

8M

12M

16M

1 2 3 4 5 6

Latnecy (Clock Cycles)


30K

40K

50K

60K

70K

5K

15K

25K

35K

45K

55K


IPS

latnecy (clock cy

cles)


(a) Hopper

8.063

5.8

6.9

4.67

0.363

0.28

0.3170.27

0.6530.22

0.6530.22

0 2 4 6 8 10

CPU, HumanoidCPU, Hopper

CPU-GPU, HumanoidCPU-GPU, Hopper

Latency (s)

Training Inference only Environment interaction

0%

10%

20%

30%

M

40M

80M

120M

160M

1 2 3 4 5 6Latnecy (Clock Cycles)


(b) Humanoid

Fig. 13. Load balance optimization

VII. CONCLUSION

In this work, we accelerated PPO targeting CPU-FPGAheterogeneous platform. We developed novel optimizationsand achieved significant speedup over CPU and CPU-GPUheterogeneous platform based designs. The optimizations de-veloped have a broader applicability. For example, our memoryblocked data layout optimization can be used in accelerationof SGD-based training of Deep Learning models in general,and our systolic array compute sharing technique can be usedto mitigate load imbalances in other RL algorithms leveragingmultiple neural networks.

In the future, we will generalize our design to supportother policy optimization algorithms. We will also exploreacceleration of PPO with other underlying function approxi-mators such as CNN, LSTM, etc. Such networks require noveloptimizations when the network parameters cannot fit on-chip.

ACKNOWLEDGMENT

This work was supported by the U.S. National ScienceFoundation under grant OAC-1911229 and the Intel StrategicResearch Alliance.

REFERENCES

[1] R. Sutton, M. Bowling, D. Schuurmans, and V. Bulitko, “Reinforcementlearning and artificial intelligence,” 2006.

[2] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprintarXiv:1610.03295, 2016.

[3] P. R. Gankidi, “Fpga accelerator architecture for q-learning and itsapplications in space exploration rovers,” Ph.D. dissertation, ArizonaState University, 2016.

[4] P. Remagnino, A. Shihab, and G. A. Jones, “Distributed intelligence formulti-camera visual surveillance,” Pattern recognition, vol. 37, no. 4,pp. 675–689, 2004.

[5] P. Hamet and J. Tremblay, “Artificial intelligence in medicine,”Metabolism, vol. 69, pp. S36–S40, 2017.

[6] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” in Advances in neural information processing systems, 2000, pp.1057–1063.

[7] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,2017.

[8] H. Zeng and V. Prasanna, “Graphact: Accelerating gcn training on cpu-fpga heterogeneous platforms,” arXiv preprint arXiv:2001.02498, 2019.

[9] C. Guo, W. Luk, S. Q. S. Loh, A. Warren, and J. Levine, “Customisablecontrol policy learning for robotics,” in 2019 IEEE 30th InternationalConference on Application-specific Systems, Architectures and Proces-sors (ASAP), vol. 2160. IEEE, 2019, pp. 91–98.

[10] D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson,J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong, “Acustomizable matrix multiplication framework for the intel harpv2 xeon+fpga platform: A deep learning case study,” in Proceedings of the 2018ACM/SIGDA International Symposium on Field-Programmable GateArrays. ACM, 2018, pp. 107–116.

[11] A. Kroh and O. Diessel, “A short-transfer model for tightly-coupledcpu-fpga platforms,” in 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 2018, pp. 366–369.

[12] Ccix. [Online]. Available: https://www.ccixconsortium.com/[13] Opencapi. [Online]. Available: https://opencapi.org/[14] S. Zhou and V. K. Prasanna, “Accelerating graph analytics on cpu-

fpga heterogeneous platform,” in 2017 29th International Symposium onComputer Architecture and High Performance Computing (SBAC-PAD).IEEE, 2017, pp. 137–144.

[15] A. Kroh and O. Diessel, “Efficient fine-grained processor-logic inter-actions on the cache-coherent zynq platform,” ACM Transactions onReconfigurable Technology and Systems (TRETS), vol. 11, no. 4, pp.1–22, 2019.

[16] W. S. Lovejoy, “A survey of algorithmic methods for partially observedmarkov decision processes,” Annals of Operations Research, vol. 28,no. 1, pp. 47–65, 1991.

[17] S. Kakade and J. Langford, “Approximately optimal approximate rein-forcement learning,” in ICML, vol. 2, 2002, pp. 267–274.

[18] Proof: Simplified ppo clip objective. https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view.

[19] Mujoco environments. [Online]. Available: https://gym.openai.com/envs/#mujoco

[20] D. Hafner, J. Davidson, and V. Vanhoucke, “Tensorflow agents: Ef-ficient batched reinforcement learning in tensorflow,” arXiv preprintarXiv:1709.02878, 2017.

[21] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540, 2016.

[22] S. Shao, J. Tsai, M. Mysior, W. Luk, T. Chau, A. Warren, andB. Jeppesen, “Towards hardware accelerated reinforcement learning forapplication-specific robotic control,” in 2018 IEEE 29th InternationalConference on Application-specific Systems, Architectures and Proces-sors (ASAP). IEEE, 2018, pp. 1–8.

[23] H. Cho, P. Oh, J. Park, W. Jung, and J. Lee, “Fa3c: Fpga-accelerateddeep reinforcement learning,” in Proceedings of the Twenty-FourthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems. ACM, 2019, pp. 499–513.

[24] Openai spinningup implementation of ppo. [Online]. Avail-able: https://spinningup.openai.com/en/latest/algorithms/ppo.html,https://github.com/openai/spinningup/tree/master/spinup/algos/ppo

[25] J. Su, J. Liu, D. B. Thomas, and P. Y. Cheung, “Neural network basedreinforcement learning acceleration on fpga platforms,” ACM SIGARCHComputer Architecture News, vol. 44, no. 4, pp. 68–73, 2017.

[26] P. R. Gankidi and J. Thangavelautham, “Fpga architecture for deep learn-ing and its application to planetary robotics,” in 2017 IEEE AerospaceConference. IEEE, 2017, pp. 1–9.

[27] L. M. Da Silva, M. F. Torquato, and M. A. Fernandes, “Parallelimplementation of reinforcement learning q-learning technique for fpga,”IEEE Access, vol. 7, pp. 2782–2798, 2018.

[28] R. Rajat, Y. Meng, S. Kuppannagari, A. Srivastava, V. Prasanna, andR. Kannan, “Qtaccel: A generic fpga based design for q-table basedreinforcement learning accelerators,” in The 2020 ACM/SIGDA Interna-tional Symposium on Field-Programmable Gate Arrays, 2020, pp. 323–323.

[29] S. Shao and W. Luk, “Customised pearlmutter propagation: A hardwarearchitecture for trust region policy optimisation,” in 2017 27th Inter-national Conference on Field Programmable Logic and Applications(FPL). IEEE, 2017, pp. 1–6.

[30] S. Ruder, “An overview of gradient descent optimization algorithms,”arXiv preprint arXiv:1609.04747, 2016.

[31] Scipy python based library for optimization problem. [Online]. Avail-able: https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html

[32] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” in Inter-national Conference on Machine Learning, 2016, pp. 1329–1338.

[33] E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarkingmodel-based reinforcement learning,” arXiv preprint arXiv:1907.02057,2019.

[34] Nvidia titan xp. [Online]. Available: https://www.nvidia.com/en-us/titan/titan-xp/

[35] Xilinx alveo u200 data center accelerator card. [Online]. Available:https://www.xilinx.com/products/boards-and-kits/alveo/u200.html

[36] W. Gropp, W. D. Gropp, E. Lusk, A. D. F. E. E. Lusk, and A. Skjellum,Using MPI: portable parallel programming with the message-passinginterface. MIT press, 1999, vol. 1.

[37] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.

Documents

Accelerating Proximal Policy Optimization on CPU-FPGA