13
Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform By: Floris Driessen ([email protected]) 11-12-2013

Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

  • Upload
    lehanh

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Throughput Exploration and

Optimization of a Consumer

Camera Interface for a

Reconfigurable Platform

By: Floris Driessen ([email protected])

11-12-2013

Page 2: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Introduction

• Video applications on embedded platforms

• Use of accelerators

• Faster

• Energy efficiency

• USB camera

1

11-12-2013

Page 3: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Platform of Interest - ZYNQ

• Zedboard by Digilent

• Xilinx Zynq platform

− Dual core ARM Cortex A9

− Programmable logic

• 512 MB RAM

• USB connectivity

• HDMI output

• USB camera

2

11-12-2013

Page 4: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Naïve implementation

• Software

1. Read camera frame

2. Copy frame to DMA region

3. Perform HW accelerated operation (Sobel)

4. Copy result from DMA region

5. Show result

• Separate DMA region needed due to lack of DMA

drivers

3

11-12-2013

Zynq platform

ARM Core 0

ARM Core 1

USB

Linux RAM

DMA RAM

Programmable logic

Page 5: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Bottleneck Study

• Performance limit

• Converting the format

• Camera output to

accelerator input

• Copying from/to DMA region

• Mmap

− Not cached

• Frame capturing

4

11-12-2013

Zynq platform

ARM Core 0

ARM Core 1

USB

Linux RAM

DMA RAM

Programmable logic

Page 6: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Possible improvements

• Exploiting scratchpad

• A frame would not fit

• DMA driver support

• Not feasible within time frame of project

• Optimize the current implementation

• Copying data

• Converting format

• Capturing camera frame

5

11-12-2013

Page 7: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Format conversion

• Naïve implementation

• Combined conversion and copy

− Writing small chunks to mmaped memory (slow)

• Split conversion and copy

• OpenCV mixChannels

• NEON interleaving

• ARM SIMD

• Next slide

6

11-12-2013

Implementation Convert + copy [s] Speed-up

Naïve 1,95 1x

Split 0,28+0,04=0,32 6,1x

OpenCV 0,05+0,04=0.09 21,7x

NEON 0,04 50,6x

0x00 R0 R0

0x01 G0 G0

0x02 B0 B0

0x03 R1 x

0x04 G1 R1

0x05 B1 G1

0x06 R2 B1

0x07 G2 x

.. .. ..

R7 R6 R5 R4 R3 R2 R1 R0 d0

G7 G6 G5 G4 G3 G2 G1 G0 d1

B7 B6 B5 B4 B3 B2 B1 B0 d2

x x x x x x x x d3

vld3.8 {d0-d2} [#0]

vst4.8 {d0-d3} [#0]

Page 8: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

NEON RGB24 to RGB32 conversion example

0x00 R0 R0

0x01 G0 G0

0x02 B0 B0

0x03 R1 x

0x04 G1 R1

0x05 B1 G1

0x06 R2 B1

0x07 G2 x

.. .. ..

x x x x x x x x d0

x x x x x x x x d1

x x x x x x x x d2

x x x x x x x x d3

7

11-12-2013

0x00 R0 R0

0x01 G0 G0

0x02 B0 B0

0x03 R1 x

0x04 G1 R1

0x05 B1 G1

0x06 R2 B1

0x07 G2 x

.. .. ..

R7 R6 R5 R4 R3 R2 R1 R0 d0

G7 G6 G5 G4 G3 G2 G1 G0 d1

B7 B6 B5 B4 B3 B2 B1 B0 d2

x x x x x x x x d3

void __attribute__ ((noinline))

neonRGBtoRGBA_gas(unsigned char* src, unsigned char* dst,

int numPix)

{

asm(

// numpix/8

" mov r2, r2, lsr #3\n" // numpix/8

// load alpha channel value

" vmov.u8 d3, #0xff\n"

"loop1:\n"

// load 8 rgb pixels with deinterleave

" vld3.8 {d0,d1,d2}, [r0]!\n"

// preload next values

" pld [r0,#40]\n"

" pld [r0,#48]\n"

" pld [r0,#56]\n"

// substract loop counter

" subs r2, r2, #1\n"

//" vswp d0, d2\n"

// store as 4*8bit values

" vst4.8 {d0-d3}, [r1]!\n"

// loop if not ready

" bgt loop1\n"

);

}

Page 9: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Frame copy from/to DMA RAM

• OpenCV (as used in the naïve implementation)

• Manual copy (loop over virtual contiguous memory)

• Memcpy from C library

• NEON accelerated copy

8

11-12-2013

724 999 642

9

55

36

7

44

22

16

46

16

0

10

20

30

40

50

60

70

Linux RAM → Linux RAM Linux RAM → DMA RAM DMA RAM → Linux RAM

Ex

ecu

tion

tim

e [m

s]

OpenCV

Manual

Memcpy

Neon copy

Page 10: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Camera capture

• OpenCV

• Always BGR24

• Video4Linux

• Different formats

• Not a big improvement

9

11-12-2013

0.07

0.11

0.04

0.06

0.06

0.00 0.02 0.04 0.06 0.08 0.10 0.12

Frame delay [s]

V4L2 RGB24

V4L2 BGR24

V4L2 MJPEG

V4L2 YUYV

OpenCV BGR24

Page 11: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Results

• Multiple configurations

• Combined the conversion and

copy (NEON accelerated)

• 1: Split convert and copy

• 2: OpenCV mixChannels

• 3: Combined mixChannels to external

• 4: No convert back + V4L capture

• 5: NEON copy

• 6: Combined NEON convert and NEON copy

10

11-12-2013

0.73

0.23

0.78

0.17 0.15 0.13

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6

Exec

uti

on

tim

e p

er f

ram

e [s

]

Application configuration

Copy back and convert

Sobel calculation

Convert and copy

Get frame

Page 12: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Contributions

• Framework for combining USB camera with

accelerators in programmable logic

• Multiple format conversion routines

• NEON

• NEON copying routines

• Video4Linux frame capture

11

11-12-2013

Capture

frame

Convert

format

Copy to

DMA RAM

Execute

accelerator

Copy

result back

Process

result

Convert

format

Page 13: Throughput Exploration and Optimization of a Consumer ...parse.ele.tue.nl/tools/usbcam/slidescameraFD.pdf · Throughput Exploration and Optimization of a Consumer Camera Interface

Conclusion and Future work

• Huge improvement 32x (0,2 to 7,7 FPS)

• Still one ARM core unoccupied for processing data

after accelerator

• Make camera frame buffer available to DMA

• DMA buffer sharing

− Linux kernel 3.8

• Improve frame capture

• Takes more than half of the time

• Latency of ~4 frames

• Driver from manufacturer

• Consider other cameras

12

11-12-2013