Scalable Heterogeneous Accelerating Structure for …...In 2013, a new video coding standard was proposed, named HEVC [1], which aims to reduce the size of the encoded video bitstream,

Scalable Heterogeneous Accelerating Structure forthe HEVC Temporal Prediction

Joao Luıs Henriques AfonsoInstituto Superior Tecnico, University of Lisbon,Portugal

[email protected]

Abstract—Despite offering significant gains in terms of theresulting encoding efficiency, the state-of-the-art High EfficiencyVideo Coding (HEVC) standard is characterized by a significantcomputational complexity, making it hardly implemented in real-time by nowadays General Purpose Processors (GPPs). This ismainly due to the Motion Estimation (ME) part of its temporalprediction module - a very demanding and repetitive operationthat takes a significant portion of the processing time. Tocircumvent such constraint, a scalable heterogeneous acceleratingstructure is presented, consisting of a set of Processing Elements(PEs) able to process several blocks of the current frame (theCTUs) in parallel. These PEs are composed by highly efficientaccelerators, dedicated to the most demanding ME parts: integersearch and the sub-pixel refinement operations. Moreover, theconceived PEs are fully programmable, thus supporting the im-plementation of most state-of-the-art ME algorithms and searchstrategies. Furthermore, a highly efficient distributed memorymechanism was also devised, in order to satisfy the demandingdata transfers of the pixel data from the main encoder memoryto each PE, without constraining the scalability of the system.To test and evaluate the proposed scalable acceleration system,an aggregate of 16 PEs was prototyped in a Virtex-7 FPGA,connected to a high-end desktop GPP via PCIe. The obtainedresults showed that the distributed memory mechanism and thedevised accelerators are able to process the input video streamin real-time (30 fps), up to the resolution of 2160p (4K-UHD).

I. INTRODUCTION

In 2013, a new video coding standard was proposed, namedHEVC [1], which aims to reduce the size of the encodedvideo bitstream, at around 50%, in comparison to the previousstandard H.264/AVC, while preserving the same perceptivevisual quality. However, this comes with a much highercomputational cost for encoding and decoding.

Therefore, the work presented in this dissertation aims toaccelerate the most complex step of a HEVC encoder: theinterframe prediction. This part of the encoder may take up to60% of the encoding time [2], due to the exhaustive nature ofME procedure. This is due to the ME requiring a significantnumber of comparisons between these blocks and the referenceframes. This, added to the fact that ultra-high definitionformats have a larger area, with many more blocks, resultsin a very exhaustive ME procedure. Fortunately, since thesecomparisons are very repetitive, they are a great candidatefor being accelerated with dedicated parallel architectures.Hence, the work developed for this dissertation aimed toaccelerate the interframe prediction by providing the meansto massively parallelize the ME procedure. It consists of adistributed scheme, containing a scalable number Acceleration

Units (AUs), each one able to process different Coding TreeUnits (CTUs), and respective sub-blocks. Inside each one ofthese AUs, a set of dedicated accelerators is used to acceleratethe most demanding parts of the ME procedure, while aprogrammable controller is used to control the steps of theME.

In addition, a fast and efficient distributed memory mecha-nism was developed, able to provide the pixel data to a scalablenumber of AUs and accelerators, simultaneously.

The proposed distributed prediction scheme was imple-mented in a Virtex-7 FPGA, which allowed 16 PEs to beimplemented in parallel. These PEs correspond to a mate-rialization of the AUs. Each one contains two specializedaccelerators for the ME, also proposed on this dissertation:one for integer search, the other for sub-pixel refinement. Inaddition, a local GPP is used to control the various stepsexecuted by the accelerators. By manipulating the programon these GPPs, multiple ME algorithms and strategies can beimplemented.

II. TEMPORAL PREDICTED VIDEO ENCODING

According to the HEVC encoder, each picture is split intoblock-shaped regions of various sizes, the CTUs. For mostpictures, interpicture temporal prediction is typically employedfor encoding, due to its better efficiency. This process involvesselecting, for each block, the reference picture and MotionVector (MV) that better match the block being encoded (seeFig. 1). The operation of this process, denoted as MotionEstimation (ME), allows the encoder to compress the picturesbased on how the different partitions move from frame toframe, thus reducing the amount of information requiredto encode it. According to [2], two of the steps used bythe Motion Estimation (ME) module of the HEVC inter-prediction, Sum of Absolute Differences (SAD) and subpixelinterpolation, take 40% and 20% of the total encoding time,respectively.

A. Motion Estimation

The Motion Estimation (ME) analyses the displacement ofeach image partition in the neighbouring frames and encodesthis information using MVs (see Fig. 1). When the framepartitioning and the corresponding MVs are very precise, theresiduals becomes very small, thus reducing the total amountof information in the bitstream. However, a high precisionanalysis usually requires more computations, thus a balance

Reference Frame Current Frame

MV(x,y)

Best match

Search windowM

NWx

Wy

Fig. 1. Motion vector that describes the displacement of a certain block.

1

MV(-3,3)

1 1

1

1 1 1

11

2 2 2

22

2 2 233

33 3

333

Fig. 2. Three step search algorithm.

5

55

5

4

4

4 3

3

3 2

3

2 1

3

1

0

1

1 1

1

1

1

2MV(-3,-4)

Fig. 3. Diamond search algorithm.

between the attained bitrate reduction and the computationalpower required to process it needs to be considered.

To determine the best MV for a certain block, a blockmatching ME algorithm is usually employed. They involvecomparing each small partition of the current frame to thecorresponding block and adjacent neighbours in one or morenearby reference frames, inside a search window. The positionwith the lowest cost is then chosen for the MV. For fasterencoders, the cost is usually the SAD between the lumasamples of both frames.

B. Integer Search Algorithms

The optimal ME result is obtained with the full-searchblock-matching algorithm, which exhaustively compares thecurrent block with all the possible positions inside the ref-erence frame search window. This algorithm guarantees thatthe best match is always found but, due to the very highresolutions of nowadays video formats and complexity of thestandards, has become impractical.

Therefore, most application domains have adopted otheralternatives based on fast block-matching algorithms. Thesealgorithms only pick a small number of positions on eachstep, and use the best result for the starting point of the nextiteration. After a certain number of iterations, a minimum ofthe cost function is achieved.

Two of the first fast search algorithms that have beenproposed are the Three Step Search (TSS) [3] and DiamondSearch (DS) [4] (see figures 2 and 3). More recently, otheralgorithms have been proposed, such as the Enhanced Predic-tive Zonal Search (EPZS) [5] and the Test Zone Search (TZS)[6], that achieve better results for high resolution formats at acost of a higher algorithmic complexity .

C. Sub-Pixel Refinement

Sub-pixel MV refinement is employed after the integerblock matching ME algorithm, to get a better estimation of theMV. It usually takes the two steps illustrated in Fig. 4. In thefirst step, the block is compared to eight half-pixel precision

locations around the best integer-precision MV. After this, aquarter-pixel precision comparison takes place, by using thebest half-pixel location as a starting point.

MV(0.25;-1.25)

Integer pixel precision

Half pixel precision

Quarter pixel precision

Fig. 4. Sub-pixel refinement step.

The most demanding part of this sub-pixel refinement is theinterpolation of the sub-pixels, which the HEVC requires tobe done with a 8-tap or a 7-tap FIR filter [1].

D. Block Organization for Temporal Prediction

To improve the block partitioning efficiency and achievea better bit rate reduction, the HEVC standard introduced theCTU structure (see Fig. 6). Each CTU consists of one luma andtwo chroma components, each one represented by a CodingTree Block (CTB), and the associated syntax (Fig. 5). Unlikethe macroblock structure from previous standards, the CTUscan have a size up to 64×64.

Each CTU and CTB can be further partitioned into Cod-ing Units (CUs) and Coding Blocks (CBs), respectively, byfollowing a quad-tree structure (see figures 6 and 7).

In addition, each CU is associated with a group of PredictionUnits (PUs), which can be generated by partitioning the CUinto 8 possible prediction modes (see Fig. 8). Usually, theluma Prediction Blocks (PBs) are the blocks that are comparedto the reference frames, using the considered motion searchalgorithm, in order to find the best MV. When encoding avideo sequence, the encoder can chose which CU sizes andPU modes to analyse.

To further improve the encoding efficiency, some HEVCfeatures impose dependencies, by using the MVs informationfrom neighbour PUs to improve the encoding efficiency ofeach PU. If these dependencies are to be respected, each PUcan only be processed after the neighbour left, top-left, top-middle and top-right PUs had been processed. Fortunately,HEVC specifies two new parallelization approaches to accel-erate the encoding procedure, while allowing information tobe inherited from neighbour blocks.

The first one consists on the usage of Tiles. This optionallows the frame to be partitioned into a number of rectangularregions, usually with the same size, where intra and inter-prediction can be independently performed. Another featureintroduced in HEVC is Wavefront Parallel Processing (WPP).With WPP, each frame is divided into horizontal rows ofCTUs. Each of these rows can be encoded in parallel, providedthat the row above is at least two CTUs ahead.

As it will be described in the following sections, thearchitecture presented in this thesis aims to be compatiblewith several parallelization methods. It is important to note

LumaCTB/CB/PB

Cb Cr

CTU/CU/PUAssociated

synthax

Fig. 5. CTU/CU/PU components.

CTU ...CU

CU

CU

CU

CU

CU

64 pixels

64 pixelsFrame (CTUs)

CTU CTU

Fig. 6. Example of CTU partitioninginto CUs.

Depth 0 – 64 x 64

Depth 1 – 32 x 32

Depth 2 – 16 x 16

Depth 3 – 8 x 8

Fig. 7. CTU and CUs quad tree orga-nization.

2Nx2N Nx2N 2NxN NxN

2NxnD2NxnUnRx2NnLx2N

CU/CB

PU/PB modes

Fig. 8. Modes for CU/CB partitioninginto PUs/PBs.

that, in order to support a greater level of parallelism, thesedependencies can be disabled, albeit at a cost of a greaterbitrate.

III. STATE OF THE ART

In [7], an improvement of the WPP scheme is presented,called Overlapped Wave-front (OWF). Two other approaches,Global Parallel Method (GPM) and Local Parallel Method(LPM), described by [8], aim to further increase the paral-lelism of HEVC, thus making it more suitable to be imple-mented in many-core architectures.

In [9], [10], it is presented an architecture design for theHEVC ME, which processes a large number of pixels in paral-lel, albeit consuming a significant number of resources. In [11]several configurations of a ME architecture are analysed, thatuses several different processing modules, each specialized ina different PU size. These two contributions consider the boththe integer and fractional search, but are not adapted for beingimplemented in multi-core scalable architectures.

Other contributions only focus on the integer search orthe interpolation. Several configurations of a simple SADpipeline structure are studied by [12], regarding performanceversus energy consumption. In [13] an architecture specializedon the HEVC SAD task, for blocks of various sizes, ispresented. Other contributions, such as [14]–[16], focus onthe interpolation procedure, which is considerably more de-manding for HEVC than for previous standards, thus requiringnew specialized accelerating hardware. These architecturesrequire a significant amount of memory in order to store theinterpolated pixels. Contrary to these, the sub-pixel refinementaccelerator herein presented is able to compare the pixels rightafter being interpolated, thus requiring much less memory andallowing larger areas to be interpolated.

IV. PROPOSED TEMPORAL PREDICTIONPARALLELIZATION MODEL

A new model for the parallelization of the HEVC temporalprediction is herein presented. This system aims to benefitfrom various levels of parallelism, while being compatible withdifferent rules - ranging from less to more strict - regardingthe dependencies between the various blocks, and also aims tobe compatible with several ME algorithms and strategies. Theproposed model is depicted in Fig. 9, at its higher abstractionlevel. It contains multiple Acceleration Units (AUs), whichare optimized to handle the interframe prediction of a largenumber of CTUs in parallel. From the encoder point of view,the CTUs are distributed amongst the AUs when they areavailable to process new CTUs. Then, after being processed,

the motion data of the various sub-blocks of each CTU isreturned to the encoder, to be used by the other modules.

Encoder Interframe prediction accelerator

Select CTU and search

windowsReference frames

Current frames

Motion Data Buffer

CTU & Ref. pixels ......

CTU Motion Data... ... ...

CTU acceleration units

...

... ...

Fig. 9. Overview of the encoder interaction with the accelerator.

The pixel data communication signal, between the mainencoder and the accelerator, is the most demanding in terms ofrequired bandwidth, since it needs to transfer all the CTU andsearch window pixels for the AUs. It may require transfer ratesin the order of GB/s for real time encoding of high-definitionformats at 30fps, as it is referred in table I (for one referenceframe only).

TABLE IBANDWIDTH REQUIREMENTS WITHOUT ANY DATA RE-USAGE (MB/S)

Resolution Search window size

±32 ±64 ±96

720p 141 281 4781080p 299 598 10161440p 539 1078 18332160p 1195 2391 4064

Therefore, a dedicated and rather efficient communicationinfrastructure is required to transfer these pixels to the AUs.The pixel data is then stored in a set of dedicated memories,inside each acceleration unit (pixel memories), in order to belocally accessed by the accelerators when processing its CUsand PUs.

In order to avoid stalling the AUs while waiting for newpixel data, an implicit double buffering scheme is employed,where one CTU is being processed by the AU while the nextone is being (or waiting to be) transferred (see Fig. 10).

CTU 3 CTU 5

CTU 4 CTU 6

CTU 1

CTU 2

CTU 1 CTU 2 CTU 3 CTU 4 CTU 5 CTU 6 CTU 7 CTU 8

time

DMA transferAccel. Unit 1Accel. Unit 2

Transfering for slot A

Transfering for slot B

Processing CTU in slot A

Processing CTU in slot B

Fig. 10. Simultaneous processing and data prefetching of CTUs.

Controler

Pixel Memory

Acceleration Unit

Accel. #1

Accel. #2

Accel. #M

Fig. 11. Acceleration Unit sub-modules.

Reference frame pixels

Current frame pixels

2D Address & CTU sel.

Pixel array

Pixel array

New pixel data

Fig. 12. AU local Pixel Memory, withits reference and current frame pixelsmodules.

Fig. 11 presents an illustration of the several modules thatcompose the AU. These can be divided in three groups:Accelerators , responsible for managing the most repetitive

and demanding steps of the ME procedure (SAD, in-terpolation, sub-pixel refinement, or others). In orderto decouple their operation from the controller, eachaccelerator has an input and an output FIFO, for thecommands and the results, respectively.

Pixel memory , used to store the pixels of the CTU (andcorresponding search window) being processed, as wellas the CTU being prefetched. This memory has to supplythe multiple accelerators with arrays of pixels, whilebeing able to simultaneously receive new data from theDirect Memory Access (DMA) controller.

Controller , responsible for managing the ME of the varioussub-blocks (PUs) of the CTU, by sending commands tothe accelerators, interpreting the results and communicat-ing with the main encoder.

A. Distributed Memory Organization

All the CTU and respective search windows pixels are tem-porarily stored in a local pixel memory, so that the acceleratorsof the AU are able to access it with a very low latency andhigh throughput (Fig. 11). It is able to supply pixel data to allthe accelerators in every clock cycle.

These pixel memories are composed of two main moduleswhich are accessed simultaneously: one for the referenceframe pixels, the other for the current frame pixels (seefig. 12). This is very advantageous, since most motion searchoperations involve comparing one set of current frame pixelsto another set of reference frame pixels.

In addition, these memories are accessed by the acceleratorsusing a 2D address space (see Fig. 13 and 14). This approachis better suited for ME procedure, and helps hiding the internalmemory organization from the accelerators. Besides this, twomore signals are used to select between two CTU slots (onebeing processed, the other prefecthed) and between severalsearch windows.

Three levels of data reusage are adopted by the proposedparallel temporal prediction model:

1) From PU to PU. This is done by transferring the entiresearch window of the CTU to the pixel memory. Thisdata can then be reused for each of the CTU sub-blocks,including the PUs, thus avoiding new data transfers (seeFig. 15).

CTU #N

Y

X

64

32

32 640

Fig. 13. Current frame addressspace.

CTU position

Y

X

6432

32

64

92

128

0-32-64

-32

-64

96 128

Fig. 14. Reference frame ad-dress space.

2) From having multiple accelerators simultaneously access-ing the same pixel memory, which avoids having toduplicate the pixel data.

3) From CTU to CTU. If their search windows share aportion of the reference frame, this data does not haveto be transferred twice (see Fig. 16). This provides abandwidth reduction as high as 71% (between 256×256sized search windows).

Y

X

6432

32

64

92

128

0-32-64

-32

-64

96 128

Fig. 15. Search window, cen-tered on the CTU.

CTU N

CTU N+1

CTU N+2

Fig. 16. Search area reuse for neighboursearch windows.

A new method for efficiently managing the data reusagebetween neighbour search windows is proposed in this work.It involves partitioning the reference frame into 32×32 parti-tions, aligned with the CTU boundaries (see Fig. 17). Whena new search window has to be transferred to the pixelmemory, the respective partitions are selected and sent oneby one. Therefore, the reusage is achieved by preservingin the local pixels memory all the partitions that will beused for the next CTU and only transferring the new ones.With this scheme, significant reusage rates can be achieved.Nevertheless, each partition occupies a physical memory slotand a translation mechanism is needed to convert each 2Daddress to its corresponding address, as seen in Fig. 17.

Physical memory

( )

B( )

H( )

DC

( )

F( )

EAG

Translation

-32

E F

G H

Y6432

32

64

92

0

-32

96-32

A B

C D

Y

X

6432

32

64

92

0

-32

96

Fig. 17. Partitioning of the reference frame in 32×32 partitions, mapped tophysical memory slots.

The chosen criteria for distributing the CTUs amongst theAUs plays a huge role on how much data can be reused. Forexample, if each row of CTUs is attributed to a single AU,this allows the pixel data to be reused along the row.

V. PROPOSED TEMPORAL PREDICTION ARCHITECTURE

The architecture herein presented (see Fig. 18) aggregates ascalable number of PEs, which materialize the AUs. Internally,each PE contains: a GPP, responsible for implementing thesearch algorithm and strategy control; one accelerator forinteger search; another accelerator for sub-pixel refinement;and the pixel memories. It also includes some registers, sharedwith the rest of the encoder, as well as a clock cycle counter,which can be used for debugging. Besides the PEs, there arealso other modules that can be used to exchange informationbetween the PEs or with the main HEVC encoder.

The other parts of the HEVC encoder (not included inthe architecture) could be either implemented with dedicatedhardware or on a high-end host GPP. For this work, thelatter option was considered. Therefore, for the communicationbetween the proposed accelerating structure and the host GPP,a PCI Express (PCIe) interconnection was chosen (based onthe framework presented by [17]). For this communication,four channels are considered:Memory Mapped Control - Used by the host GPP to access

the memory mapped addresses of the shared memory,mutexes, shared registers (of each PE) and the instructionmemory configuration module.

Interrupt Interface - Used by the accelerator to notify thehost of certain events.

Output Stream - Buffered stream of data to be sent from thePEs to the host memory (with the ME results).

Input Stream - Buffered stream of data to be sent from thehost memory to the PEs (with the pixel data).

The software control layer depicted in Fig. 18 controls theaddresses of the main memory where the input/output streamof data is going to be written or read from.

A. Pixel Memory Architecture

Since the PEs of the architecture contains two accelerators,this implementation of the pixel memory needs to have twoindependent reading ports. One additional writing port is usedto receive the new pixel data from the host main memory.

Fig. 19 illustrates the set of the signals that are used toaddress each reading port. This memory allows both accelera-tors to receive new pixel data at each cycle, provided that thesignals read enable and valid are active by both sides. Twohorizontal arrays of 16 pixels are obtained for each accelerator:one with current frame pixel data, the other with referenceframe pixel data. This allows the accelerators to benefit froma fine-grained level of parallelism.

Internally, the 2D addresses are mapped to physical ad-dresses of the current and reference pixel data (see Fig. 20).In order to be able to simultaneously provide pixels to thetwo accelerators, multiple banks are used for the current andreference frame data.

This simultaneous access is guaranteed if two main pre-requisites are respected. First, the accelerators should accessthe pixel memory with the pattern exemplified in Fig. 21.Secondly, both the current and reference frame pixels have

to be split among the banks, according to their Y address, asillustrated in Fig. 22.

If a conflict arises (when two accelerators try to accessthe same bank for reference or current pixels) one of theaccelerators has to be put on hold. Fortunately, since thenumber of banks is large enough (see Fig. 22) and theaccelerators access the memory with the described pattern(Fig. 21), it is guaranteed that in a maximum of 3 cycles theywill have access to both the current and reference memorybanks. After this, as long as the accelerators continue accessingthe memory with the same pattern, they will have access tothe pixels on every cycle.

The CTUs and the 32×32 partitions of the reference framesare placed in slots inside of the pixel memory banks, as seen inFig. 23. In addition, the address X of the reference frame maynot be aligned with the borders, which may require an array tobe read from two 32×32 partitions simultaneously. This is dueto the nature of ME algorithms, which may require addressesof the search window to be accessed with any (X,Y) offset.

Meanwhile, internally, each physical bank can be imple-mented by using a dual-port memory. One of the ports isonly used for writing (by the Pixel DMA controller), whilethe other is only for reading (by one of the two accelerators).This allows the pixels to be prefetched while the acceleratorsare accessing the memory.

In order to translate from the 2D virtual address space intothe physical address space, a set of translation modules weredeveloped. For the current frame, the 2D virtual address isdirectly translated into the physical address, through the con-catenation of its values. For the reference frame, the addresshas to be translated twice, since the corresponding array ofpixels may span across two 32×32 partitions. To allow this,each bank of the reference memory is internally partitionedinto two sub-banks (named as columns), which are addressedindependently. A Translation Table is used to store the indexof the slot where each reference frame partition is stored, thusbeing able to match the 2D address into the respective slot.They are configured by the pixel DMA controller when a newpacket with the data of a 32×32 partition is received.

B. Integer Search Accelerator

Fig. 24 presents the block diagram of the proposed IntegerSearch Accelerator. The main purpose of this accelerator is toanalyse the cost of a list of MVs, for every step of a givenblock matching ME algorithm, and return the index of the MVwith the lowest cost (calculated with the SAD metric). Both theinput and the output streams are communicated through FIFOs,in order to decouple the GPP control from the accelerator.

The command interpreter module is responsible for re-ceiving the commands from the FIFO, interpreting them,forwarding the corresponding parameters to the other modulesof the accelerator and signalling them when a new SADhas to be started. There are two commands: one with theconfiguration of the PU which is to be processes, the otherto order the comparison of this PU to one location of a searchwindow.

Interrupt generator

Synchron. mutex

Shared memory

Pixel Memory

Instr. &data GPP

Shared regs.

Clock counter

Results aggregator PE 1 PE 2 PE N( )

N

Memory mapped control

Output stream

Input stream

Fast

inte

rco

nn

ect

ion

Mai

n H

EV

C e

nco

de

r

Soft

wa

re c

on

tro

l la

yer

Pixels DMAcontroller

Results (out)Pixel data (in)

Central interconnect

Fig. 18. Proposed temporal prediction architecture.

Pixel memory

Accelerator 1

add

ress

X

add

ress

Y

CTU

sel

ecti

on

pix

els[

0:N

-1]

add

ress

X

add

ress

Y

pix

els[

0:N

-1]

read

en

able

vali

d

current frame

memory

reference frame

memory

Accelerator 2

add

ress

X

add

ress

Y

CTU

sel

ect.

pix

els

[0:N

-1]

add

ress

X

add

ress

Y

pix

els

[0:N

-1]

read

en

able

vali

d

current frame

memory

reference frame

memoryRead port 1 Read port 2

Fig. 19. Signals used in the interface with the pixel memory readingports (the two accelerators access it simultaneously).

Reference frame

address

Current frame

address

From accel. 1 From accel. 2

Current memory

banks

Reference memory

banks

Sele

ct m

emo

ry b

ank

s

Sele

ct o

utp

uts

write enable 1write enable 2

To accel. 1 To accel. 2

Cur. pixels

Ref. pixels

Is valid

2D Virtual Address

Physical Address

Translateaddress

Translateaddress

Fig. 20. Pixel memory organization.

12345678

910111213141516

Fig. 21. Example of a read pattern, as followed by the accelerators.

The second module is responsible for generating the 2Daddress that indexes the pixel memory. On each cycle, onehorizontal array of 16 pixels is read from the current andanother from the reference frames., which are forwarded tothe next module.

The SAD pipeline module is responsible for computing theSAD with the two arrays of pixels received from the previousmodule, in pipeline.

Finally, the last module is responsible for accumulating thepartial SADs, in order to obtain the full SAD value. Afterobtaining the SAD, it compares it to the one previously cal-culated, referent to a different position on the frame, in orderto determine which one one has the lowest cost. Therefore,when requested by the GPP, the accelerator is able to returnto the output FIFO a reference to the best position, along withits corresponding SAD cost.

C. Sub-pixel Refinement Accelerator

The block diagram of the proposed sub-pixel refinementaccelerator is given in Fig. 25.

Given a PU and a certain location of the search window,the purpose of this accelerator is to refine it to quarter pixel

Bank 0Bank 1Bank 0Bank 1

...

...

Bank 0Bank 1Bank 2Bank 3

...

Bank 0Bank 1Bank 2Bank 3...

000h040h080h0C0h100h

FC0h FF0h

Current frame CTU slot Ref. frame 32x32 partition slot

000h020h040h060h080h

...

3E0h 3F0h

...

010h 030h020h 010h

Bank 1 Bank 3

X0 16 32

X0 16 32 48 64

0

4

64Y

0

4

32Y

Fig. 22. Distribution of the pixels form the CTU and search window partitionsamong the memory banks, for the architecture with two accelerators.

Slot 3 Slot 27

Y

X6432

64

92

0-32

-32

96

CTU A

Y

X

64

32

32 640

Current frame Dec. Hex.

Addr. X: 32 (0020h)Addr. Y: 24 (0018h)CTU idx: 1 (1h)

Reference frame Dec. Hex.Addr. X: -10 (FFF6h)Addr. Y: 8 (0008h)S. window: 3 (3h)

(32,24)

(-10,8)

CTU A

0000h

1000h

Slot 0Slot 1Slot 2Slot 3

( )(...)

Slot 27(...)

Physical memory

0000h

0400h

0800h

0C00h

1000h

6C00h

...

Cur. mem.

Ref. mem.

Fig. 23. Translation of a 2D virtual address into the corresponding physicaladdress.

precision. Two operations are allowed: half-pixel refinement,and quarter-pixel refinement (see Fig. 26). For each one, itcompares its SAD with the SAD of the 8 sub-pixel positionsaround it, and returns the index of the best one, along withthe corresponding SAD value.

The architecture of this accelerator intends to be scalable.In order to show this scalability, two pixels from the PU areprocessed in parallel, in each pipeline stage.

The command interpreter module has the task of receivingthe commands from the GPP, interpreting them, forwardingthe parameters to the other modules of the accelerator andsignalling them when a new sub-pixel refinement is to bestarted, just like for the integer search accelerator.

The 2D address generator module is responsible for gen-erating the 2D addresses used to index the pixel memory,and forwarding the received arrays of pixel values to the nextmodule.

CommandsResults



FIFO

FIFOCommand

interpreter

Parameters & restart

Idle

Accumulator and

comparator

2D address generator

(X,Y)

Partial SAD

pipeline

Ref pix. [0:15]Cur pix. [0:15]

ready readyvalid valid

Partial SAD

mask

Fig. 24. Integer accelerator high-level modules.

Commands Results



FIFO

FIFOCommand

interpreter

Parameters & restartIdle

2D address generator

(X,Y)Interpolation

engine

Ref pixels [0:10]Cur pixels [0:1]

readyvalid Controler

Fig. 25. Block diagram of the sub-pixel refinement accelerator.

Finally, the core of the sub-pixel refinement accelerator isillustrated in Fig. 27. This last pipelined module, with thehelp of a simple controller, is able to interpolate the sub-pixels (horizontally and vertically) compare those values tothe current frame pixels, accumulate the results, and return areference to the sub-pixel location with the lowest SAD cost.

D. GPP

The GPP module is used to implement the controller of theAU. It is responsible for managing the quad-tree partitioningof the CTU into CUs and PUs, as well as controlling theexecution of the ME of each of the PUs by the accelerators.Furthermore, it is also responsible for communicating with thehost GPP and the other PEs, through the usage of the severalshared modules described in this chapter, to synchronize theseveral operations of the ME procedure. For this work, thisGPP was implemented by adopting the MB-Lite processorarchitecture [18].

E. Other Shared Modules

The Pixels DMA controller is responsible for receiving astream of pixel data from the host GPP and forwarding itto the destination position of the pixel memories. It receivesthe stream of data in packets, that contain the data from oneCTU from the current frame or one 32×32 partition fromthe reference frame, along with a header which includes thedestination address of that packet, the size, and the Translationable configuration data.

The Results Aggregator module offers an individual inter-face to each PE so that they can write the results without beingconstrained by the other PEs. These results are sent back tothe host main memory through the stream interface.

This Program Memory Configuration module allows thehost GPP to configure the internal program memory of eachPE. This functionally is essential, so that the several programscan be implemented and tested in the PEs GPPs, withouthaving to reconfigure the hardware.

0 1 2

3 4 5

876 0 1 247

58

36

Half-pixel refinement

Quarter-pixel refinement

Fig. 26. Half-pixel (left) and quarter-pixel (right) refinement operationimplemented.

3 3

FIR

Y

3 arrays of 9

pixels

Divide by (64x64) and round

9 pixels 3 arrays

of 9 pixels

9 pixels

|A-B| |A-B|

1 pixel1 pixel

+(add each pair)

9 9

9 accumulators

9

10 reference frame pixels

2 current frame pixels

offset_x[1:0]

offset_y[1:0]is_quad

fifo_we acc_we

FIR X

Comparator9

do not compare

0 1 2

3 4 5

6 7 8

Fig. 27. Sub-pixel refinement accelerator engine, with various pipeline stages.

Finally, the Interrupt Control module allows the PEs to issueinterruptions that will be interpreted by the host GPP.

F. Motion Estimation Programming

The ME procedure will be controlled by a program imple-mented in the MB-Lite GPP, which is responsible for man-aging the modules inside each PE, by sending commands tothe accelerators and interpreting the results. Furthermore, thisprogram is also responsible for managing the CTU partitioninginto CTUs and PUs, asking the host GPP for new CTUs andreturning the ME results. Naturally, the programming of allthese procedures should be easily adaptable to multiple motionsearch algorithms and strategies. Therefore, a balance had tobe achieved between the offered performance and flexibilityto implement multiple strategies and algorithms.

Two important structures are used in this program:PUnit - Used for coordinating the processing steps of one

single PU. It contains the PU data, as well as a pointer toa function which implements a step of the ME procedureused to process the PU. By dynamically changing thispointer, multiple steps of the ME can be implemented.

PUnitManager - Used for managing the CTU partitioningscheme and providing the PUs which are ready to beprocessed by the accelerators.

VI. PROTOTYPING AND EXPERIMENTAL EVALUATION

The complete system was prototyped in a Xilinx Virtex-7 FPGA (XC7VX485T), connected through an 8x PCIe Gen2 interface to a personal computer equipped with an IntelCore i7 3770K processor, running at 3.5 GHz. In addition, thePCIe framework developed by [17] was used to connect theinterfaces of the developed accelerator to the control programrunning on the Intel Core i7.

The total amount of resources used by the whole temporalprediction subsystem is presented in table II. As it can be seen,the most used resources are the BRAMs (due to the local pixelmemories), followed by the LUTs.

The final architecture was synthesized, mapped, andplace&routed using multiple timing constraints for differentclock domains. The adopted PCIe framework [17] has its dedi-cated clock domain, which requires 250MHz to work properly.For the hardware designed on this work, the place&route stepwas conducted with a clock constraint of 100MHz. This wasrequired due to the fact that the adopted PCIe frameworkimposes several constraints regarding the exact placementof its modules. This, combined with the sheer amount ofresources used by the 16 PEs of the accelerator, resulted ina more difficult routing optimization process, which did notmeet the timing constraints for higher frequencies.

TABLE IIRESOURCE USAGE OF THE WHOLE TEMPORAL PREDICTION SUBSYSTEM.

Whole PCIe Totalaccelerator Framework used(16 PEs) [17] (16 PEs)

Registers 141201 (23,3%) 20886 (3,4%) 168614 (27,8%)LUTs 208419 (68,6%) 16859 (5,6%) 238449 (78,5%)

BRAM36 932 (90,5%) 37 (3,6%) 961 (93,3%)

Max. Freq. 157 MHz 248 MHz 140/248 MHz(*)After Place&Route 100/250 MHz(*)(*) Different clock domains: PCIe framework requires 250 MHz.The accelerator runs at 100 MHz.

A. Distributed Memory Architecture Analysis

1440p

2160p

128x128

192x192

256x256

Search window

Fig. 28. Latency of each CTU and search window data transfer to the localpixel memory of each PE (with data reusage along each row).

In Fig. 28 it is illustrated the average latency to transfer thepixels of each CTUs, and respective search window, from the

1080p

1440p

2160p

Frame resolution

Minimum required for TSS algorithm (25)

Fig. 29. Maximum number of SAD operations allowed per PU (30fpsencoding, by 16 PEs at 100Mhz).

host GPP to the accelerator, depending on the number of PE.In the background of these plots it is presented a comparisonwith the amount of time that is available to process each CTU(at 30fps), for the considered number of PEs and resolution.If the resulting latency is below the amount of time availableto process the CTU, then the prefetching can be masked bythe processing time, while supporting real-time encoding.

As it can be seen, for 128×128 and 192×192 searchwindows, the distributed memory scheme was able to provideCTUs at a rate compatible with 2160p encoding at 30fps.These are important results, that demonstrate that this dis-tributed memory architecture is compatible with the mostdemanding requirements of current video coding systems.

B. Accelerators Performance Evaluation

The number of cycles required to process one SAD, byeach integer search accelerator, is presented in table III for themodes 2N×2N (although all the other PU sizes are supported).These values where obtained from simulation.

TABLE IIIPERFORMANCE OF AN INDIVIDUAL INTEGER SEARCH ACCELERATOR.

PU size Cycles to process1 SAD

8×8 1516×16 2332×32 7164×64 263

The analysis presented in Fig. 29 regards the the perfor-mance of the integer search accelerators when all the 16 PEsare working in parallel. In this analysis, the time available toprocess each CTU is limited by the constraints related witha real-time encoding scenario (30 fps). Therefore, it givesthe maximum number of SAD comparisons which can beexecuted, per PU, in order to meet these requirements (forPUs of mode 2N×2N mode only). Early termination refers tothe choice (by the controller) to not partition one CU of sizeN×N into its sub-CUs of size N/2×N/2, thus saving the timerequired to analyse those sub-CUs.

As it can be seen in Fig. 29, the considered configurationallows several SADs to be computed for each PU. As expected,it is possible to calculate more SADs for lower resolution

sizes, because their smaller CTUs number results in moreavailable time to process each CTU and its sub-blocks. Thesame is observed for higher early termination rates, whichcorrespond to a smaller number of PUs, thus resulting in extraavailable time to process more SADs for each of the PUs. Asillustrated, it is possible to encode 2160p in real-time with theTSS algorithm, provided that the number of PUs to analyse isreduced (for example, by having an average early terminationrate of at least 30%).

A similar analysis was conducted for the sub-pixel refine-ment accelerator. The number of clock cycles required toimplement one step of their sub-pixel refinement (half orquarter-precision search), for the PUs of mode 2N×2N, ispresented in table IV.

TABLE IVPERFORMANCE OF EACH INDIVIDUAL SUB-PIXEL REFINEMENT

ACCELERATOR.

PU size Cycles to completeone refinement step

8×8 7416×16 20432×32 65264×64 2318

By following the same criteria that was considered forthe integer search accelerator analysis, the chart presented inFig. 30 depicts an analysis of the maximum number of sub-pixel refinement operations that can be executed, per PU, inorder to support real-time encoding (30 fps) of various formats.

1080p

1440p

2160p

Frame resolution

Minimum required for sub-pix. refinemen (2)

Fig. 30. Maximum number of sub-pixel refinement operations allowed perPU (30fps encoding, by 16 PEs at 100Mhz).

As it can be seen, at least 2 sub-pixel refinement operationsare possible for each one of the configurations analysed, whichshows that these accelerators are able to execute the sub-pixel refinement step, even when considering 2160p real-timeencoding with all 2N×2N PUs being analysed.

C. Overall Performance Analysis

In this subsection the implementation of two algorithms isanalysed: full search block matching and TSS. In each of thetwo algorithms, only the 2N×2N PU mode was analysed.This means that, for each CTU, a total of 85 PUs needto be processed. In order to ease the implementation, no

dependencies between neighbour PUs were considered forthese algorithms.

This full search algorithm was tested with a range of ±16pixels in both the X and Y axis. This means that each PUis exhaustively compared to 1024 positions of the referenceframe.

Fig. 31. Analysis of the scalability of the full search ME algorithmperformance.

Figure 31 shows the performance of this algorithm, in termsof CTUs processed per second, for different amounts of PEworking in parallel. As expected, the increase in performanceis linear with the number of PEs. This is due to two factors.First, there are no dependencies between neighbour CTUs.Secondly, even with 16 PEs, the time taken to process eachCTU significantly longer than any of the prefetching timesmeasured in Fig. 28. This means that, for each new CTU, thePE does not have to wait for the new pixel data, and can startprocessing it immediately.

The TSS algorithm requires each PU to be compared to 8positions of the search window, in each of the three steps. This,added to the location at the origin of the referential, gives atotal of 25 SADs for each PU.

Fig. 32. Analysis of the scalability of the three step search ME algorithmperformance.

Figure 31 shows the attained performance of this algorithm.In this case, the increase in performance also varies linearlywith the number of PEs, due to the same reasons of the fullsearch algorithm. With 16 PEs this architecture is able toexecute the TSS ME algorithm to encode the 720p resolutionin real time. The reason for not being able to process moreCTUs is because of the controller implemented in the MB-Lite GPP. Higher resolutions might be achieved by increasingthe number of PEs, increasing the frequency, reducing thenumber of analysed PUs, or implementing a less demand-ing ME algorithm. Alternatively, it could be considered theimplementation on an ASIC, instead of the adopted FPGAdevice, which can achieve speedups as high as 4×, according

to [19]. This, together with a better GPP, would perpectivatethe encoding of the 2160p resolution in real-time.

VII. CONCLUSIONS AND FUTURE WORK

In this thesis, a distributed prediction scheme for the inter-frame prediction step of the HEVC standard was presented,implemented and evaluated. Several aspects were considered,such as the support of the encoding of high definition formatsin real-time, the compatibility with multiple ME algorithmsand strategies, the scalability of the architecture and thesupport for dedicated accelerators.

The proposed scheme involved the usage of a scalable num-ber of AUs working in parallel. These conceptual AUs wereimplemented with a module denoted as PE, able to executeof the ME procedure of different CTUs. These PE containhighly optimized accelerators, which were demonstrated tobe compatible with real-time encoding of 2160p at 30fps,even when considering that the prototyped implementation islimited to an operating frequency of 100MHz. Furthermore,the ME processing steps inside each PE are controlled by alocal GPP. A program for processing the ME was proposed,which can be easily adapted to implement different MEalgorithms and strategies.

In addition, to efficiently provide the pixel data to theaccelerators, an optimized distributed memory mechanismwas also proposed. This mechanism employs data prefetchingand data reuse techniques, which allows the architecture togreatly reduce the impact which the data transfers have in theoverall performance. The conducted evaluation showed thatthis distributed pixel memory mechanism is able to transferthe pixel data, from the main encoder (at the host computer)to the local pixel memories, at a rate compatible with theencoding of the 2160p resolution (4K-UHD) at 30fps

Finally, in order to demonstrate the compatibility of theproposed architecture with different algorithms and searchstrategies, two ME algorithms were implemented and eval-uated: full search and TSS.

A. Future Work

In order to improve its overall performance, some furtherwork can be considered.

First, the computational performance of the GPP that isembedded in each PE can be improved, in order to be able tomatch with the high performance of the accelerators and of thedistributed pixel memory scheme. Secondly, the whole archi-tecture may be prototyped in an ASIC technology to providea significant increase of the operating clock frequency, whichwill provide the PEs and the accelerators with more time toprocess the CTUs. Thirdly, more complex ME algorithms maybe implemented. In particular, some of the recently proposedME algorithms, such as the EPZS [5] or TZS [6], whichprovide a very good compression quality. Finally, the proposedarchitecture for interframe prediction shall be incorporatedwithin a complete HEVC encoder, in order to provide a finalproduct which effectively accelerates its encoding procedure.

REFERENCES

[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efficiency video coding (HEVC) standard,” IEEE transactionson circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.

[2] F. Bossen, B. Bross, K. Souhring, and D. Flynn, “HEVC complexity andimplementation analysis,” IEEE transactions on circuits and systems forvideo technology, vol. 22, no. 12, pp. 1685–1696, 2012.

[3] A. Barjatya, “Block matching algorithms for motion estimation,” IEEETransactions Evolution Computation, vol. 8, no. 3, pp. 225–239, 2004.

[4] S. Zhu and K.-K. Ma, “A new diamond search algorithm for fast block-matching motion estimation,” Image Processing, IEEE Transactions on,vol. 9, no. 2, pp. 287–290, 2000.

[5] A. M. Tourapis, “Enhanced predictive zonal search for single and multi-ple frame motion estimation,” in Electronic Imaging 2002. InternationalSociety for Optics and Photonics, 2002, pp. 1069–1079.

[6] P. Nalluri, L. N. Alves, and A. Navarro, “Improvements to tz searchmotion estimation algorithm for multiview video coding,” in 19thInternational Conference on Systems, Signals and Image Processing(IWSSIP). IEEE, 2012, pp. 388–391.

[7] C. C. Chi, M. Alvarez-Mesa, B. Juurlink, G. Clare, F. Henry, S. Pateux,and T. Schierl, “Parallel scalability and efficiency of HEVC paralleliza-tion approaches,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 22, no. 12, pp. 1827–1838, 2012.

[8] C. Yan, Y. Zhang, J. Xu, F. Dai, J. Zhang, Q. Dai, and F. Wu,“Efficient parallel framework for HEVC motion estimation on many-core processors,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 24, no. 12, pp. 2077–2089, 2014.

[9] G. Pastuszak and M. Trochimiuk, “Algorithm and architecture design ofthe motion estimation for the H.265/HEVC 4K-UHD encoder,” Journalof Real-Time Image Processing, pp. 1–13, 2015.

[10] ——, “Architecture design of the high-throughput compensator andinterpolator for the H.265/HEVC encoder,” Journal of Real-Time ImageProcessing, pp. 1–11, 2014.

[11] M. E. Sinangil, V. Sze, M. Zhou, and A. P. Chandrakasan, “Costand coding efficient motion estimation design considerations for highefficiency video coding (HEVC) standard,” IEEE Journal of SelectedTopics in Signal Processing, vol. 6, no. 7, pp. 1017–1028, 2013.

[12] F. Walter and S. Bampi, “Synthesis and comparison of low-powerarchitectures for sad calculation,” in Proceedings of the 26th southsymposium on microelectronics, 2011, pp. 45–48.

[13] P. Nalluri, L. N. Alves, and A. Navarro, “A novel SAD architecture forvariable block size motion estimation in HEVC video coding,” in 2013International Symposium on System on Chip (SoC). IEEE, 2013, pp.1–4.

[14] Z. Guo, D. Zhou, and S. Goto, “An optimized MC interpolationarchitecture for HEVC,” in IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 1117–1120.

[15] E. Kalali and I. Hamzaoglu, “A low energy HEVC sub-pixel interpola-tion hardware,” in IEEE International Conference on Image Processing(ICIP). IEEE, 2014, pp. 1218–1222.

[16] G. Pastuszak and M. Trochimiuk, “Architecture design and efficiencyevaluation for the high-throughput interpolation in the HEVC encoder,”in Euromicro Conference on Digital System Design (DSD). IEEE, 2013,pp. 423–428.

[17] J. Gong, T. Wang, J. Chen, H. Wu, F. Ye, S. Lu, and J. Cong, “Anefficient and flexible host-fpga pcie communication library,” 24th In-ternational Conference on Field Programmable Logic and Applications(FPL), pp. 1–6, 2014.

[18] T. Kranenburg and R. van Leuken, “MB-LITE: A robust, light-weightsoft-core implementation of the microblaze architecture,” in Proceedingsof the Conference on Design, Automation and Test in Europe. EuropeanDesign and Automation Association, 2010, pp. 997–1000.

[19] N. Neves, N. Sebastiao, D. Matos, P. Tomas, P. Flores, and N. Roma,“Multicore SIMD ASIP for next generation sequencing and alignmentbiochip platforms,” IEEE transactions on Very Large Scale IntegrationSystems, vol. 23, no. 7, pp. 1287–1300, 2015.

Documents

Scalable Heterogeneous Accelerating Structure for …...In 2013, a new video coding standard was proposed, named HEVC [1], which aims to reduce the size of the encoded video bitstream,