7/31/2019 ARM Cortex Coding
1/24
Getting started on Cortex A8
Instruction Set
7/31/2019 ARM Cortex Coding
2/24
Instruction Sets
32-bit ARM instruction set :
16-bit Thumb instruction set :
32-bit Thumb-2 instruction set :
(Trade off between two above), Most 32 bit instructions are
unconditional when compared to ARM
Advanced SIMD architecture.
Enabling the same operation to be performed on multiple items in
parallel.
Instructions operate on vectors held in 64-bit or 128-bit registers
Other instruction sets
ThumbEE instruction set
Jazelle Extension
7/31/2019 ARM Cortex Coding
3/24
Register Set (ARM and Neon)
33 general-purpose 32-bit registers
In user mode only R0 to R15 are available
R14 -> Link register : Holds the return address when thebranch is called with link (BL)
R15 -> Program counter
seven 32-bit status registersStatus Flags/Processor mode
Neon Register Bank
View 1: 32x64-bit general-purpose registers or (D0-D31)
View 2: 16x128-bit (quadword) registers, Q0-Q15.
Combination of these 128-bit and 64-bit registers, Q0-Q15 andD0-D31.
7/31/2019 ARM Cortex Coding
4/24
ARM Instruction set
All ARM instructions are 32 bits long
Branch instructions
Data processing instructions
Register load and store instructions
Multiple register load and store instructionsStatus register access instructions(OOS)
Coprocessor instructions (OOS)
7/31/2019 ARM Cortex Coding
5/24
ARM Instruction set
Branch Instructions
branch backwards to form loops
branch forward in conditional structures
branch to subroutines
e.g.B label1
BL label1(Branch with link)
BEQ {pc}+4
7/31/2019 ARM Cortex Coding
6/24
ARM Instruction set
Data processing instructions
Add or multiply two registers
Add register with constant
Bitwise operations
operate on 8 bit, 16 bit and 32 bit data
Long multiply instructions give a 64-bit result in two registerse.g.
ADD r2, r1, r3
SUBS r8, r6, #240 ; sets the flags on the result
RSB r4, r4, #1280 ; subtracts contents of r4 from 1280AND r9,r2,#0xFF00
ORREQ r2,r0,r5
MOVS r3, r2, LSR #3 ;
7/31/2019 ARM Cortex Coding
7/24
ARM Instruction set
Register load and store instructions
Load or store the a single register - 8,16,32 bit
Load double words
Byte and halfword loads can be zero filled or sign extended
e.g.
STMFD r13!, {r0-r5}LDMFD r13!, {r0-r5}
PUSH {r5-r7,lr}
POP {r5-r7,pc}
LDR r3, [r0], #4 ;r0 is incremented by 4LDR r3, [r0],r4 ;r0 is incremented by r4
LDR r3,[r0,#0x2C] ;load with offset
LDR r3,[r0,r4,lsl #2] ;
7/31/2019 ARM Cortex Coding
8/24
ARM Instruction set
Conditional Execution
FlagsN Set when the result of the operation was Negative.
Z Set when the result of the operation was Zero.
C Set when the operation resulted in a Carry.
V Set when the operation caused oVerflow.
Most of the ARM instructions can be conditional
E.g.ADD r0, r1, r2 ; r0 = r1 + r2, don't update flags
ADDS r0, r1, r2 ; r0 = r1 + r2, and update flags
ADDSCS r0, r1, r2 ; If C flag set then r0 = r1 + r2, and updateflags
CMP r0, r1 ; update flags based on r0-r1.
why conditional instructions are required if branchinstructions are available?
7/31/2019 ARM Cortex Coding
9/24
ARM Instruction set
Suffix details
7/31/2019 ARM Cortex Coding
10/24
Neon Instruction set
Vector Duplicate
VDUP{cond}.size Qd, Dm[x]
cond is an optional condition code
size must be 8, 16, or 32
Qd specifies the destination register for a quadword operation
Dm[x] specifies the NEON scalar.
VADD.datatype {Qd}, Qn, Qm
VADD.datatype {Dd}, Dn, Dm
Datatype -> I8, I16, I32 for VADD and VSUB
Datatype -> S64, U64 for VQADD or VQSUB(depends on instruction,refer TRM)
7/31/2019 ARM Cortex Coding
11/24
Neon Instruction set (e.g.)
7/31/2019 ARM Cortex Coding
12/24
Effective Assembly coding
Branch prediction
Maximize usage of conditional instructions instead of branchesa 512-entry 2-way set associative Branch Target Buffer (BTB)
a 4096-entry Global History Buffer (GHB)
an 8-entry return stack
Pipeline model- Instruction cycle timing
fetch, decode, execute >> 13 stage
Load Store
MAC
ALU
Neon Pipeline >> 10Removing interlocks/stalls
Maximize usage of SIMD/Neon Instructions
Maximize Dual Issue
7/31/2019 ARM Cortex Coding
13/24
Effective Assembly coding
how to read ARM instruction tables
ADDEQ R0, R1, R2 LSL#10
7/31/2019 ARM Cortex Coding
14/24
Effective Assembly coding
Interlock e.g.(Refer Table in next slide)
SMLAL R0, R1, R2, R3
ADD R7,R8,R0 >> four cycles waisted
Alternate approach
SMLAL R0, R1, R2, R3MOV r4,#0x6
ADD r5,r4,r5
MOV r6,#0x6
LDR r5,[r6,#0x2C]
ADD R7,R8,R0
7/31/2019 ARM Cortex Coding
15/24
Effective Assembly coding
dummy
7/31/2019 ARM Cortex Coding
16/24
Effective Assembly coding
Dual Issue
Two basic pipeleines ->Pipeline0 and Pipeline1
LS pipeline, Multiply pipeline, ALU pipeline
Multiply pipeline always goes in Pipeline 0
The first instruction always issues in pipeline 0 and the second
instruction, if present, issues in pipeline 1Instructions with the same destination cannot be issued in the same
cycle.
Refer next Slide for more e.g.
7/31/2019 ARM Cortex Coding
17/24
Dual issue (contd..)
7/31/2019 ARM Cortex Coding
18/24
General ARM optimization Techniques
Loop unrolling
Use fixed point arithmetic
Use shifts instead of multiply and divisions
See if complex calculations can be avoided using table
lookupMinimize the number of arguments of a function
Avoid branches in low level functions
7/31/2019 ARM Cortex Coding
19/24
Assly Funcs/files e.g
First four argument go in r0,r1,r2,r3
e.g. of assembly function
7/31/2019 ARM Cortex Coding
20/24
General /Neon optimization Techniques
Code Vectorization in C itself
Use word arrays instead of halfword or byte arrays
Cache friendly coding
Put code belonging to same module in the same code
section
7/31/2019 ARM Cortex Coding
21/24
Code Vectorization
7/31/2019 ARM Cortex Coding
22/24
Code Vectorization
7/31/2019 ARM Cortex Coding
23/24
Code Vectorization
7/31/2019 ARM Cortex Coding
24/24
Code Warrior Demo/Hands on