53
Copyright 2012 Xilinx Using NEON for Parallel Data Processing Zynq-7000 Hardware Architecture Speaker: Leon Qin Title: Processor Specialist Date: Oct, 2012

Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

Embed Size (px)

Citation preview

Page 1: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2012 Xilinx

Using NEON for Parallel Data Processing

Zynq-7000

Hardware Architecture

Speaker: Leon QinTitle: Processor SpecialistDate: Oct, 2012

Page 2: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 Xilinx

Zynq-7000 Block Diagram

2x GigE

with DMA

2x USB

with DMA

2x SDIO

with DMA

Static Memory Controller

Quad-SPI, NAND, NOR

Dynamic Memory Controller

DDR3, DDR2, LPDDR2

AMBA® Switches

Programmable

Logic:System Gates,

DSP, RAM

XADC PCIe

Multi-Standards I/Os (3.3V & High Speed 1.8V)

Mu

lti-

Sta

nd

ard

s I/O

s (

3.3

V &

Hig

h S

peed

1.8

V)

Multi Gigabit Transceivers

I/O

MUXMIO

ARM® CoreSight™ Multi-core & Trace Debug

512 KB L2 Cache

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

Snoop Control Unit (SCU)

Timer Counters 256 KB On-Chip Memory

General Interrupt Controller DMA Configuration

2x SPI

2x I2C

2x CAN

2x UART

GPIO

Processing System

AMBA® Switches

AMBA® Switches

AMBA® Switches

S_AXI_HP0

S_AXI_HP1

S_AXI_HP2

S_AXI_HP3

S_AXI_ACP

M_AXI_GP0/1S_AXI_GP0/1EMIO

Page 2

Page 3: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 3

ARM Architecture Evolution

VFPv3

Jazelle®

VFPv2

SIMD

Thumb®-2

NEON™Adv SIMD

TrustZone™

Thumb-EE

Thumb-2 Only

V5 V6 V7 A&R V7 M

Improved Media and

DSP

Key Technology

Additions by

Architecture Generation

Execution Environments:

Improvedmemory use

Key Technology

Additions by

Architecture Generation

ARM9

ARM10

ARM11

Page 4: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

General purpose SIMD processing useful for many applications

Supports widest range multimedia codecs used for internet applications

– Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, …

– Ideal solution for normal size „internet streaming‟ decode of various formats

Fewer cycles needed

– Neon will give 60-150% performance boost on complex video codecs

– Simple DSP algorithms can show larger performance boost (4x-8x)

– Balance of computation and memory access is required

– Processor can sleep sooner => overall dynamic power saving

Why NEON?

Page 4

Page 5: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON is a mature advanced SIMD technology.

– SIMD exist on many 32-bit arch

• PowerPC has AltiVec, while x86 has MMX/SSE/AVX

– Can significantly accelerate parallelable repetitive operations on large data sets.

Beneficial to many DSP or multimedia algorithms

– Clean orthogonal vector architecture, applicable to a wide range of data intensive computation

– audio, video, and image processing codecs.

– Not just for codecs – also applicable to 2D & 3D graphics etc

– Color-space conversion.

– Physics simulations.

– Error correction(such as Reed Solomon codecs, CRCs), elliptic curve

cryptography, etc.

Why NEON?

Page 5

Page 6: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON advantages

– Easy programming & debug

– Fully coherent with CPU, no cache maintenance operations

– Part of ARM arch - no hardware or software integration required

– Ecosystem support off-the-shelf, no porting required

DSP/FPGA advantages

– Runs parallel with CPU, few CPU cycles required

– More „realtime‟ - no OS/cache variability

– Fixed function or limited codec support

– Potentially higher performance (e.g. 1080p Full HD video)

NEON vs. DSP/FPGA offload

Page 6

Page 7: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON Agenda

Page 7

NEON Hardware overview

NEON Instruction set overview

NEON Software Support

NEON improves performance

Page 7

Page 8: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 8

What is NEON?

NEON is a wide SIMD data processing architecture

– Extension of the ARM instruction set

– 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)

NEON Instructions perform “Packed SIMD” processing

– Registers are considered as vectors of elements of the same data type

– Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float

– Instructions perform the same operation in all lanes

Dn

Dm

Dd

Lane

Source RegistersSource

Registers

Operation

Destination Register

ElementsElementsElements

Page 9: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Example SIMD instruction – Vector ADD

Page 9

Larger register size

Register split into equal size any type elements

Operation performed on same element of each register

VADD.U16 D2, D1, D0

0x1001 0x1234 0x7 0xAB

0xFF0 0x5678 0xFFF8 0xCD

0x1FF1 0x68AC 0xFFFF 0x178

+

=

D0

D2

D1

+ ++

===

015314763

Page 9

Page 10: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 10

Neon Data Types

NEON natively supports a set of common data types

– Integer and Fixed-Point: 8-bit, 16-bit, 32-bit and 64-bit

– 32-bit Single-precision Floating-point; 8 and 16-bit polynomial

Data types are represented using a bit-size and format letter

VADD.U16 D2, D1, D0

Not all data types available in all sizes

.S8

.U8

.P8

.I8.8

.S16

.U16

.P16

.I16.16

.S32

.U32

.F32

.I32.32

.S64

.U64.I64.64

Signed, Unsigned Integers;

Polynomials

32-bit Signed, Unsigned

Integers; Floats

8/16-bit Signed, Unsigned Integers;

Polynomials

64-bit Signed,

Unsigned Integers;

Page 11: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 11

NEON Registers

NEON provides a 256-byte register file

– Distinct from the core registers

– Extension to the VFPv2 register file (VFPv3)

Two explicitly aliased views

– 32 x 64-bit registers (D0-D31)

– 16 x 128-bit registers (Q0-Q15)

Enables register trade-off

– Vector length

– Available registers

Also uses the summary flags in the VFP FPSCR

– Adds a QC integer saturation summary flag

– No per-lane flags, so „carry‟ handled using wider result (16bit+16bit -> 32-bit)

Q0

Q1

Q15

:

D0

D1

D2

D3

:

D30

D31

Page 12: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 12

Vectors and Scalars

Registers hold one or more elements of the same data type

– Vn can be used to reference either a 64-bit Dn or 128-bit Qn register

– A register, data type combination describes a vector of elements

Some instructions can reference individual scalar elements

– Scalar elements are referenced using the array notation Vn[x]

Array ordering is always from the least significant bit

64-bit 128-bit

I64 D0

D7

Dn

63 0

S32 S32 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8

Qn

127 0

F32 F32 F32 F32 Q0

Q7

F32 F32 F32 F32 Q0

Q0[0]Q0[1]Q0[2]Q0[3]

Page 13: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON Agenda

Page 13

NEON Hardware overview

NEON Instruction set overview

NEON Software Support

NEON improves performance

Page 13

Page 14: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 14

Instruction Syntax

V{<mod>}<op>{<shape>}{<cond>}{.<dt>}(<dest>}, src1, src2

<mod> - Instruction ModifiersQ indicates the operation uses saturating arithmetic (e.g. VQADD)

H indicates the operation halves the result (e.g. VHADD)

D indicates the operation doubles the result (e.g. VQDMUL)

R indicates the operation performs rounding (e.g. VRHADD)

<op> - Instruction Operation (e.g. ADD,MUL, MLA, MAX, SHR, SHL, MOV)

<shape> - ShapeL – The result is double the width of both operands

W – The result and first operand are double the width of the last operand

N – The result is half the width of both operands

<cond> - Conditional, used with IT instruction

<.dt> - Data type

<dest> - Destination, <src1> - Source operand 1, <src2>

Page 15: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Q Modifier

– Saturating instructions that saturate set the “Cumulative saturation” flag (QC bit) in the

Floating-point Status and Control Register (FPSCR)

– The QC flag is sticky

• Use VMRS and VMSR instructions to read and to clear the flag

H Modifier

– Halves the result

– Can only be used on addition and subtraction instructions

• VHADD, VHSUB and VRHADD

D Modifier

– Only available for saturating variants “long” and “high half” multiplies

– VQDMLAL, VQDMLSL, VQDMULH, VQDMULL, VQRDMULH

R Modifier

– Always “Round to Nearest”, as defined in the IEEE 754 standard

– Available on instructions that include a right shift

• Including Halving and “high half” instructionsPage 15

Instruction Modifiers and Shapes

Page 16: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Long and Wide – Input elements are promoted before operation

Narrow – Input elements are demoted before operation

Page 16

Instruction Shapes

Dn

Dm

Qd

L Shape

N Shape

Qn

Qd

Dm

W ShapeQn

Qm

Dd

Page 17: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 17

Multiple 1-Element Structure Access

VLD1, VST1 provide standard array access

– An array of structures containing a single component is a basic array

– List can contain 1, 2, 3 or 4 consecutive registers

– Transfer multiple consecutive 8, 16, 32 or 64-bit elements

x4x5x6x7

x0x1x2x3 D3D4

x7

x5

x4

x3

x2

x1

x0

x6

[R1]

+2

+4

+6

+8

:

+10

+12

+14

VST1.16 {D3,D4}, [R1]

x0x1x2x3 D7

x3

x2

x1

x0[R4]

+2

+4

+6

:

VLD1.16 {D7}, [R4], R3

+R3

Page 18: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 18

Addition: Basic

NEON supports various useful forms of basic addition

Normal Addition - VADD, VSUB

– Floating-point

– Integer (8-bit to 64-bit elements)

– 64-bit and 128-bit registers

Long Addition - VADDL, VSUBL

– Promotes both inputs before operation

– Signed/unsigned (8-bit to 32-bit source elements)

Wide Addition - VADDW, VSUBW

– Promotes one input before operation

– Signed/unsigned (8-bit 32-bit source elements)

VADD.I16 D0, D1, D2

VSUB.F32 Q7, Q1, Q4

VADD.I8 Q15, Q14, Q15

VSUB.I64 D0, D30, D5

VADDW.U8 Q1, Q7, D8

VSUBW.S16 Q8, Q1, D5

VADDL.U16 Q1, D7, D8

VSUBL.S32 Q8, D1, D5

Page 19: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 19

Example – adding all lanes

Input in Q0 (D0 and D1)

u16 input values

Now Q0 contains 4x u32 values

(with 15 headroom bits)

Reducing/folding operation

needs 1 bit of headroom

Result is u64 in D0VPADDL.U32 D0, D0

VPADD.U32 D0, D0, D1

VPADDL.U16 Q0, Q0

DO

DO

DO

DO

DO

DO

D1

D1

D1

Page 20: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 20

Summing a vector

+

+

+

+

+

+

+

+

+

+

DO

DO

DO

D1

+

+

+

+

+

+

Page 21: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Some NEON clever features

Page 21

Some NEON clever features

Page 22: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 22

Data Movement: Table Lookup

Uses byte indexes to control byte look up in a table

– Table is a list of 1,2,3 or 4 adjacent registers

VTBL : out of range indexes generate 0 result

VTBX : out of range indexes leave destination unchanged

{D1,D2}b

D30826138411

D0

0 acghjkmop deiln

dai0niel

f

3

VTBL.8 D0, {D1, D2}, D3

Page 23: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 23

Element Load Store Instructions

All treat memory as an array of structures (AoS)

– SIMD registers are treated as structure of arrays (SoA)

– Enables interleaving/de-interleaving for efficient SIMD processing

– Transfer up to 256-bits in a single instruction

Three forms of Element Load Store instructions are provided

Forms distinguished by type of register list provided

– Multiple Structure Access e.g. {D0, D1}

– Single Structure Access e.g. {D0[2], D1[2]}

– Single Structure Load to all lanes e.g. {D0[], D1[]}

x0y0z0x1y1z1x2y2z2x3

3-element structureelement

Page 24: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 24

Multiple 2-Element Structure Access

VLD2, VST2 provide access to multiple 2-element structures

– List can contain 2 or 4 registers

– Transfer multiple consecutive 8, 16, or 32-bit 2-element structures

VLD2.16 {D2,D3}, [R1]

y0y1y2y3

x0x1x2x3 D2D3

x3

x2

x1

x0

y0

y1

y2

y3

[R1]

+2

+4

+6

+8

:

+10

+12

+14

VLD2.16 {D0,D1,D2,D3}, [R3]!

x3

x2

x1

x0

y0

y1

y2

[R3]

+2

+4

+6

+8

:

+10

+12

x7

y7

+28

+30

!

:x4x5x6x7

x0x1x2x3 D0D1

y4y5y6y7

y0y1y2y3 D2D3

Page 25: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 25

Multiple 3/4-Element Structure Access

VLD3/4, VST3/4 provide access to 3 or 4-element structures

– Lists contain 3/4 registers; optional space for building 128-bit vectors

– Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures

x2

x1

x0

y0

z0

y1

y3

z1

z3

D3D4

[R1]

+2

+4

+6

+8

:

+10

+12

+20

VST3.16 {D3,D4,D5}, [R1]

y0y1y2y3

x0x1x2x3

z0z1z2z3 D5

+22

:x2

x1

x0

y0

z0

y1

y3

z1

z3

D3D4

[R1]

+2

+4

+6

+8

:

+10

+12

+20

VLD3.16 {D0,D2,D4}, [R1]!

y0y1y2y3

+22

:

D0D1

x0x1x2x3

z0z1z2z3

D2

!

Page 26: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 26

Logical

NEON supports bitwise logical operations

VAND, VBIC, VEORR, VORN, VORR

– Bitwise logical operation

– Independent of data type

– 64-bit and 128-bit registers

VBIT, VBIF, VBSL

– Bitwise multiplex operations

– Insert True, Insert False, Select

– 3 versions overwrite different registers

– 64-bit and 128-bit registers

– Used with masks to provide selection

VAND D0, D0, D1

VORR Q0, Q1, Q15

VEOR Q7, Q1, Q15

VORN D15, D14, D1

VBIC D0, D30, D2

0 1 0 1 1 0

D0

D1

D2

D1

VBIT D1, D0, D2

Page 27: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 27

NEON instruction summary

A comprehensive set of data prcoessing instructions

Form a general purpose SIMD instruction set suitable for compilers

NEON operations fall in to the following categories

Addition / Subtraction ( Saturating, Halving, Rounding)

MIN, MAX, NEG, MOV, ABS, ABD, …

Multiplication (MUL, MLA, MLS, …)

Comparison and Selection

Logic (AND , ORR, EOR, BIC, ORN, …)

Bitfield

Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step

Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP. TRN, …)

Many more…

Page 28: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 28

NEON instruction reference

Official NEON instruction Set reference is “Advanced SIMD” in

ARM Architecture Reference Manual v7 A & R edition

Available to partners on www.arm.com

Page 29: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 29

Further Reading

Documentation

– ARM “ARM” v7 A&R

– “NEON Support in the Realview compiler” white paper

– “NEON optimizations in Android” white paper

– Realview Compiler Guide (for intrinsics)

– ARM Cortex-A Programmers‟ Guide (from www.arm.com downloads)

– Software blogs on www.arm.com

Page 30: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON Agenda

Page 30

NEON Hardware overview

NEON Instruction set overview

NEON Software Support

NEON improves performance

Page 30

Page 31: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 31

How to use NEON

NOEN optimized Open source Libraries

– OpenMAX DL (Development Layer): APIs contain a comprehensive set of

audio, video and imaging functions that can be used for a wide range of

accelerated codec functionality such as MPEG-4, H.264, MP3, AAC and

JPEG.

– Broad open source support for NEON

Vectorizing Compilers– Exploits NEON SIMD automatically with existing C source code

NEON Intrinsics– C function call interface to NEON operations

– Supports all data types and operations supported by NEON

Assembler Code– For those who really want to optimize at the lowest level

Page 32: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 32

OpenMAX DL v1.0 Library Summary

Audio Domain

MP3

AAC

Signal Processing Domain

FIR

IIR

FFT

Dot Product

Video Domain

– MPEG-4 simple profile

– H.264 baseline

Still Image Domain

– JPEG

Image Processing Domain

– Colorspace conversion

– De-blocking / de-ringing

– Rotation, scaling, compositing

Spec from: http://www.khronos.org/openmax/

Opensource implementation for ARM11 & NEON available from:

http://www.arm.com/zh/community/multimedia/standards-apis.php

NOTE: OpenMax DL provides low level data processing functions, not the complete codecs

Page 33: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Google – WebM – 11,000 lines NEON assembler!

Bluez – official Linux Bluetooth protocol stack

– NEON sbc audio encoder

Pixman (part of cairo 2D graphics library)

– Compositing/alpha blending

ffmpeg – libavcodec

– LGPL media player used in many Linux distros

– NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora

– NEON Audio: AAC, Vorbis, WMA

x264 – Google Summer Of Code 2009

– GPL H.264 encoder – e.g. for video conferencing

Android – NEON optimizations

– Skia library, S32A_D565_Opaque 5x faster using NEON

Eigen2 – C++ vector math / linear algebra template library

Theorarm – libtheora NEON version (optimized by Google)

libjpeg – optimized JPEG decode (IJG library)

FFTW – NEON enabled FFT library

LLVM – code generation backend used by Android Renderscript

NEON in opensource

Page 33

Page 34: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 34

Automatic Vectorization: ARM Compiler

Automatic vectorization can generate code targeted for NEON from

ordinary C source code

– Less effort to produce efficient code

– Portable - no compiler-specific source code features need to be used

To enable automatic vectorization on armcc, use these options together:

--vectorize - enable vectorization

--cpu Cortex-A9 - provide a CPU option with NEON support

-O2 or -O3 - select high optimization level.

-Otime - optimize for speed over space

--fpmode fast - the precision of vectorized floating-point operations is the same

as VFP RunFast mode

--diag_warning=optimizations to obtain useful diagnostics from the

compiler on what it could or could not optimize/vectorize

Note:

– When you specify --vectorize, automatic vectorization is enabled only if you also specify -

Otime and an optimization level of -O2 or -O3.

Page 35: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 35

Automatic Vectorization: GNU tools

To enable automatic vectorization on GCC/g++, use these options

together:

-mcpu=cortex-a9 -Specify a suitable ARMv7-A processor

-mfpu=neon - enable NEON support

-ftree-vectorize - support SIMD on many arch

- O3 implies -ftree-vectorize

-mvectorize-with-neon-quad - By default, GCC 4.4 only vectorize for

doubleword

-mfloat-abi=softfp -Can use “hard” for more efficient floating

point parameter passing, but all code must

be compiled with this option

Understand more with -ftree-vectorizer-verbose

Takes an integer value specifying the level of detail to provide, where 1 enables

additional printouts and higher values add even more information.

What vectorization the compiler is performing, or what is unable to perform because of

possible dependencies

Page 36: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Automatic Vectorization - how it works

void add_int(int * __restrict pa,

int * __restrict pb,

unsigned int n, int x)

{

unsigned int i;

for(i = 0; i < (n & ~3); i++)

pa[i] = pb[i] + x;

}

1. Analyze each loop:

Are pointer accesses safe for

vectorization?

What data types are being used?

How do they map onto NEON

vector registers?

How many loop iterations are

there?

void add_int(int *pa, int *pb,

unsigned n, int x)

{

unsigned int i;

for (i = ((n & ~3) >> 2); i; i--)

{

*(pa + 0) = *(pb + 0) + x;

*(pa + 1) = *(pb + 1) + x;

*(pa + 2) = *(pb + 2) + x;

*(pa + 3) = *(pb + 3) + x;

pa += 4; pb += 4;

}

}

2. Unroll the loop to the appropriate number of

iterations, and perform other transformations

like pointerization

3. Map each unrolled operation onto a

NEON vector lane, and generate

corresponding NEON instructions

Page 36

Page 37: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

ARM RVDS & gcc vectorising compiler

|L1.16|

VLD1.32 {d0,d1},[r0]!

SUBS r3,r3,#1

VLD1.32 {d2,d3},[r1]!

VADD.I32 q0,q0,q1

VST1.32 {d0,d1},[r2]!

BNE |L1.16|

armcc -S --cpu cortex-a8

-O3 -Otime --vectorize test.cint a[256], b[256], c[256];

foo () {

int i;

for (i=0; i<256; i++){

a[i] = b[i] + c[i];

}

} gcc -S -O3 -mcpu=cortex-a8

-mfpu=neon -ftree-vectorize

-ftree-vectorizer-verbose=6

test.c

.L2:

add r1, r0, ip

add r3, r0, lr

add r2, r0, r4

add r0, r0, #8

cmp r0, #1024

fldd d7, [r3, #0]

fldd d6, [r2, #0]

vadd.i32 d7, d7, d6

fstd d7, [r1, #0]

bne .L2

Page 37

Page 38: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

The goal is to try to make the code simple, straight forward, and parallel, so that the

compiler can easily convert the code to NEON assembly

Loops can be modified for better vectorizing:

– Short, simple loops work the best (even if it means multiple loops in your code)

– Avoid breaks / loop-carried dependencies / conditions inside loops

– Try to make the number of iteration a power of 2

– Try to make sure the number of iteration is known to the compiler

– Functions called inside a lop should be inlined

Pointer issues:

– Using arrays with indexing vectorizes better than using pointer

– Indirect addressing (multiple indexing or de-reference) doesn‟t vectorize

– Use __restricet key word to tell the compiler that pointers does not reference

overlapping areas of memory

Use suitable data types

– For best performance, always use the smallest data type that can hold the required values

Page 38

Tuning C/C++ Code for Vectorizing

Page 39: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 39

NEON Intrinsics

Available in armcc, GCC/g++, and llvm. Syntax is the same, so source

code that uses intrinsics can be compiled by any of these compilers.

Advantage:

– Provide low-level access to NEON instructions. Compiler do hard work like:

Register allocation.

Code scheduling, or re-ordering instructions.

The C compilers can reorder code to ensure the minimum number of stalls

according to a specific processor.

Disadvantage:

– Possibly the compiler output is not exactly the code you want, so there is still

some possibility of improvement when moving to NEON assembler code.

Page 40: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 40

NEON Intrinsics - Example

Include intrinsics header file

#include <arm_neon.h>

Use special NEON data types which correspond to D and Q registers, e.g.

int8x8_t D-register containing 8x 8-bit elements

int16x4_t D-register containing 4x 16-bit elements

int32x4_t Q-register containing 4x 32-bit elements

Use special intrinsics versions of NEON instructions

vin1 = vld1q_s32(ptr);

vout = vaddq_s32(vin1, vin2);

vst1q_s32(vout, ptr);

Strongly typed!

– Use vreinterpret_s16_s32( ) to change the type

Page 41: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 41

NEON Intrinsics - Example

NEON intrinsic

Command line with GCC

arm-none-linux-gnueabi-gcc -mfpu=neon intrinsic.c

Command line with RVCT

armcc --cpu=Cortex-A9 intrinsic.c

#include <arm_neon.h>

uint32x4_t double_elements(uint32x4_t input)

{

return(vaddq_u32(input, input));

}

Page 42: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 42

NEON Intrinsics : reference

For information about the intrinsic functions and vector data types, see the:

RealView Compilation Tools Compiler Reference Guide, available from

– http://infocenter.arm.com

GCC documentation, available from

– http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

Page 43: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 43

NEON Assembler Code

Advantage:

– Careful hand-scheduling is recommended to get the best out of any NEON

assembler code you write, especially for performance-critical applications

Disadvantage:

– Need to be aware of some underlying hardware features, like pipelining and

scheduling issues, memory access behavior and scheduling hazards.

– Optimization is processor dependent.

Page 44: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 44

NEON Assembler Code - Example

Gas

Command: arm-none-linux-gnueabi-as -mfpu=neon asm.s

RVCT

Command: armasm --cpu=Cortex-A9 asm.s

.text

.arm

.global double_elements

double_elements:

vadd.i32 q0,q0,q0

bx lr

.end

AREA RO, CODE, READONLY

ARM

EXPORT double_elements

double_elements

VADD.I32 Q0, Q0, Q0

BX LR

END

Page 45: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON data engine is disabled at reset

– Needs to be enabled in software before use

Enabling NEON requires 2 steps

1.Enable access to coprocessors 10 and 11 and allow Neon instructions

MRC p15, 0x0, r0, c1, c0, 2 ; Read CP15 CPACR

ORR r0, r0, #(0x0f << 20) ; Full access rights

MCR p15, 0x0, r0, c1, c0, 2 ; Write CP15 CPACR

ISB

2.Enable NEON and VFP

MOV r0, #0x40000000 ; set bit 30

VMSR FPEXC, r0 ; write r0 to Floating Point Exception

; Register

Enabling NEON before using

Page 45

Page 46: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON Agenda

Page 46

NEON Hardware overview

NEON Instruction set overview

NEON Software Support

NEON improves performance

Page 46

Page 47: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

FFT time No NEON

(v6 SIMD asm)

With NEON

(v7 NEON asm)

Cortex-A8 500MHz

Actual silicon

15.2 us 3.8 us

(x 4.0 performance)

NEON in Audio

FFT: 256-point, 16-bit signed complex numbers

– FFT is a key component of AAC, Voice/pattern recognition etc.

– Hand optimized assembler in both cases

Extreme example: FFT in ffmpeg: 12x faster

– C code -> handwitten asm

– Scalar -> vector processing

– Single-precision floating point on Cortex-A8 (VFPlite -> NEON)

Page 47

Page 48: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 Xilinx

ARM RVDS vectorizing compiler

• RVDS 4.0 professional includes auto-vectorizing armcc

– armcc --vectorize --cpu=Cortex-A8 x.c

• Up to 4x performance increase for benchmarks, with no source code changes(no source code changes are permitted for benchmarking)

• Simple source code changes can yield significant improvements above this

– Use C „__restrict‟ keyword to work around C pointer aliasing issues

– Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization

ARM vs NEON (Vectorize) on Cortex-A8

100% 100%

135%

169%

20%

70%

120%

170%

Telecom Consumer

ARM NEON

Improved

vectorization in

latest RVDS 4.0

Page 48

Page 49: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 Xilinx

FFmpeg/libav performance

git.ffmpeg.org

snapshot 21-Sep-09

YouTube HQ video decode

480x270, 30fps

Including AAC audio

Real silicon measurements

– OMAP3 Beagleboard

– ARM A9TC

NEON ~2x overall

performance

ffmpeg performance (relative to realtime)

0

0.5

1

1.5

2

2.5

3

Cortex-A8 256KB L2 500MHz Cortex-A9 512KB L2 400MHz

v7vfp

v7neon

Page 49

Page 50: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

Using ffmpeg (LGPL opensource codec) with extensive NEON optimizations

Dolby reference code also available optimized from Dolby and other vendors

AC3 decode MHz

MHz required

(lower is better)

ffmpeg from git.libav.org

Checkout from 14-Nov-2011

./configure --extra-cflags=“-mcpu=cortex-a9

-mfpu=neon -mfloat-abi=softfp”

Benchmarked on 500MHz Cortex-A9

(ARM Versatile Express)

Samples from:

samples.mplayerhq.hu/A-codecs

0.0

10.0

20.0

30.0

40.0

50.0

60.0

v7neonv7fpu

v7fpu --disable-asm

8.2

16.5 17.8

22.4

46.450.7

blade_runner (48KHz 192Kbit/s stereo)

Broadway-5.1-48KHz-448kbit.ac3

Page 50

Page 51: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

android_external_jpeg optimized for v6 and v7neon

16Mpixel digital camera test image - takes only 1.7s to decode on 500MHz

A9+NEON (improved from 4.5s unoptimized)

JPEG decode

0

20

40

60

80

100

120

140

160

djpeg djpeg.v6.opt djpeg.v7neon.opt

140

83

52

cycles/pixel - lower is better

Code from:

https://github.com/mansr

ARM Versatile Express 500MHz Cortex-A9

Ubuntu 11.04 Linux

djpeg -outfile /dev/null testfile.jpg

Also libjpeg-turbo via Linaro

http://libjpeg-turbo.virtualgl.org/

Page 51

Page 52: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx

NEON will become standard on general purpose apps & media-

centric devices.

– NEON option across the Cortex-A roadmap

– NEON ideal for use by open OS systems with downloaded apps

Full enabling technology to support Neon

– Compilers, profilers, debuggers, libraries all available now

– Key differentiator: easy to program, popular with software engineers

Strong ARM NEON ecosystem

Complementary to DSP/FPGA

NEON Summary

Page 52

Page 53: Using NEON for Parallel Data Processing - All Programmable · Example SIMD instruction –Vector ADD Page 9 ... VLD2, VST2 provide access to multiple 2-element structures – List

© Copyright 2009 XilinxCopyright 2012 Xilinx

Zynq-7000

Hardware Architecture