Download pdf - ARM Cortex Coding

7/31/2019 ARM Cortex Coding

1/24

Getting started on Cortex A8

Instruction Set


2/24

Instruction Sets

32-bit ARM instruction set :

16-bit Thumb instruction set :

32-bit Thumb-2 instruction set :

(Trade off between two above), Most 32 bit instructions are

unconditional when compared to ARM

Advanced SIMD architecture.

Enabling the same operation to be performed on multiple items in

parallel.

Instructions operate on vectors held in 64-bit or 128-bit registers

Other instruction sets

ThumbEE instruction set

Jazelle Extension


3/24

Register Set (ARM and Neon)

33 general-purpose 32-bit registers

In user mode only R0 to R15 are available

R14 -> Link register : Holds the return address when thebranch is called with link (BL)

R15 -> Program counter

seven 32-bit status registersStatus Flags/Processor mode

Neon Register Bank

View 1: 32x64-bit general-purpose registers or (D0-D31)

View 2: 16x128-bit (quadword) registers, Q0-Q15.

Combination of these 128-bit and 64-bit registers, Q0-Q15 andD0-D31.


4/24

ARM Instruction set

All ARM instructions are 32 bits long

Branch instructions

Data processing instructions

Register load and store instructions

Multiple register load and store instructionsStatus register access instructions(OOS)

Coprocessor instructions (OOS)


5/24

ARM Instruction set

Branch Instructions

branch backwards to form loops

branch forward in conditional structures

branch to subroutines

e.g.B label1

BL label1(Branch with link)

BEQ {pc}+4


6/24

ARM Instruction set

Data processing instructions

Add or multiply two registers

Add register with constant

Bitwise operations

operate on 8 bit, 16 bit and 32 bit data

Long multiply instructions give a 64-bit result in two registerse.g.

ADD r2, r1, r3

SUBS r8, r6, #240 ; sets the flags on the result

RSB r4, r4, #1280 ; subtracts contents of r4 from 1280AND r9,r2,#0xFF00

ORREQ r2,r0,r5

MOVS r3, r2, LSR #3 ;


7/24

ARM Instruction set

Register load and store instructions

Load or store the a single register - 8,16,32 bit

Load double words

Byte and halfword loads can be zero filled or sign extended

e.g.

STMFD r13!, {r0-r5}LDMFD r13!, {r0-r5}

PUSH {r5-r7,lr}

POP {r5-r7,pc}

LDR r3, [r0], #4 ;r0 is incremented by 4LDR r3, [r0],r4 ;r0 is incremented by r4

LDR r3,[r0,#0x2C] ;load with offset

LDR r3,[r0,r4,lsl #2] ;


8/24

ARM Instruction set

Conditional Execution

FlagsN Set when the result of the operation was Negative.

Z Set when the result of the operation was Zero.

C Set when the operation resulted in a Carry.

V Set when the operation caused oVerflow.

Most of the ARM instructions can be conditional

E.g.ADD r0, r1, r2 ; r0 = r1 + r2, don't update flags

ADDS r0, r1, r2 ; r0 = r1 + r2, and update flags

ADDSCS r0, r1, r2 ; If C flag set then r0 = r1 + r2, and updateflags

CMP r0, r1 ; update flags based on r0-r1.

why conditional instructions are required if branchinstructions are available?


9/24

ARM Instruction set

Suffix details


10/24

Neon Instruction set

Vector Duplicate

VDUP{cond}.size Qd, Dm[x]

cond is an optional condition code

size must be 8, 16, or 32

Qd specifies the destination register for a quadword operation

Dm[x] specifies the NEON scalar.

VADD.datatype {Qd}, Qn, Qm

VADD.datatype {Dd}, Dn, Dm

Datatype -> I8, I16, I32 for VADD and VSUB

Datatype -> S64, U64 for VQADD or VQSUB(depends on instruction,refer TRM)


11/24

Neon Instruction set (e.g.)


12/24

Effective Assembly coding

Branch prediction

Maximize usage of conditional instructions instead of branchesa 512-entry 2-way set associative Branch Target Buffer (BTB)

a 4096-entry Global History Buffer (GHB)

an 8-entry return stack

Pipeline model- Instruction cycle timing

fetch, decode, execute >> 13 stage

Load Store

MAC

ALU

Neon Pipeline >> 10Removing interlocks/stalls

Maximize usage of SIMD/Neon Instructions

Maximize Dual Issue


13/24


how to read ARM instruction tables

ADDEQ R0, R1, R2 LSL#10


14/24


Interlock e.g.(Refer Table in next slide)

SMLAL R0, R1, R2, R3

ADD R7,R8,R0 >> four cycles waisted

Alternate approach

SMLAL R0, R1, R2, R3MOV r4,#0x6

ADD r5,r4,r5

MOV r6,#0x6

LDR r5,[r6,#0x2C]

ADD R7,R8,R0


15/24


dummy


16/24


Dual Issue

Two basic pipeleines ->Pipeline0 and Pipeline1

LS pipeline, Multiply pipeline, ALU pipeline

Multiply pipeline always goes in Pipeline 0

The first instruction always issues in pipeline 0 and the second

instruction, if present, issues in pipeline 1Instructions with the same destination cannot be issued in the same

cycle.

Refer next Slide for more e.g.


17/24

Dual issue (contd..)


18/24

General ARM optimization Techniques

Loop unrolling

Use fixed point arithmetic

Use shifts instead of multiply and divisions

See if complex calculations can be avoided using table

lookupMinimize the number of arguments of a function

Avoid branches in low level functions


19/24

Assly Funcs/files e.g

First four argument go in r0,r1,r2,r3

e.g. of assembly function


20/24

General /Neon optimization Techniques

Code Vectorization in C itself

Use word arrays instead of halfword or byte arrays

Cache friendly coding

Put code belonging to same module in the same code

section


21/24

Code Vectorization


22/24

Code Vectorization


23/24

Code Vectorization


24/24

Code Warrior Demo/Hands on