Slides created by: Professor Ian G. Harris
Efficient C Code
Your C program is not exactly what is executed Machine code is specific to each ucontroller
Complete understanding of code execution requires
1. Understanding the compiler
2. Understanding the computer architecture
C code Machine codeCompiler ucontroller
Slides created by: Professor Ian G. Harris
ARM Instruction Set
An instruction set is the set of all machine instructions supported by the architecture
Load-Store Architecture • Data processing occurs in registers• Load and store instructions move data between memory and
registers• [] indicate an addressEx. LDR r0, [r1] moves data into r0 from memory at
address in r1 STR r0, [r1] moves data from r0 into memory at
address in r1
Slides created by: Professor Ian G. Harris
Data Processing Instructions
Move InstructionsMOV r0, r1 moves the contents of r1 into r0MOV r0, #3 moves the number 3 into r0
Shift Instructions – inputs to operations can be shiftedMOV r0, r1, LSL #2 moves (r1 << 2) into r0MOV r0, r1, ASR #2 moves (r1 >> 2) into r0, sign extend
Arithmetic InstructionsADD r3, r4, r5 places (r4 + r5) in r3
Slides created by: Professor Ian G. Harris
Condition Flags
Current Program Status Register (CPSR) contains the status of comparison instructions and some arithmetic instructions
N – negative, Z – zero, C – unsigned carry, V – overflow, Q - saturation
Flags are set as a result of a comparison instruction or an arithmetic instruction with an 'S' suffix
Ex. CMP r0, r1 – sets status bits as a result of (r0 – r1)ADDS r0, r1, r2 – r0 = r1 + r2 and status bits setADD r0, r1, r2 – r0 = r1 + r2 but no status bits set
Slides created by: Professor Ian G. Harris
Conditional Execution
All ARM instructions can be executed conditionally based on the CPSR register
Appropriate condition suffix needs to be added to the instruction
NE – not equal, EQ – equal, CC – less than (unsigned), LT less than (signed)
Ex. CMP r0, r1ADDNE r3, r4, r5BCC test
ADDNE is executed if r0 not equal to r1BCC is executed if r0 is less than r1
Slides created by: Professor Ian G. Harris
Variable Types and Casting
int checksum_v1 (int *data) {char i;int sum=0;
for (i=0; i<64; i++) {sum += data[I];
}return sum;
}
Program computes the sum of the first 64 elts in the data arrayVariable i is declared as a char to save space
i always less than 8 bits long
May use less register space and/or stack space
i as a char does NOT save any spaceAll stack entries and registers are 32 bits long
Slides created by: Professor Ian G. Harris
Declaring Shorter Variables
int test (void) {char i=255;int j=255;
i++; // i = 0j++; // j = 256
}
Shorter variables may save space in the heap, but not the stack (data)Compiler needs to mimic the behavior of a short variable with a long variable
If i is a char, its value overflows after 255
i is contained in a 32 bit registerCompiler must make i’s 32 bit register overflow after 255
Slides created by: Professor Ian G. Harris
Assembly Code for Checksum
checksum_v1MOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0 ; i = 0
checksum_v1_loopLDR r3, [r2, r1, LSL #2] ; r3 = data[I]ADD r1, r1, #1 ; r1 = i+1AND r1, r1, #0xff ; i = (char)r1CMP r1, #0x40 ; compare i, 64ADD r0, r3, r0 ; sum += r3BCC checksum_v1_loop ; if i<64 goto loopMOV pc, r14
•Argument, *data, passed in r0•Return address stored in r14•Stack avoided to reduce delay•LSL needed to increment by 4
•Highlighted instruction needed to mimic char•17% instruction overhead
Declaring i as an unsigned int would fix the problem
Slides created by: Professor Ian G. Harris
Shorter Variable Example 2
int checksum_v1 (short *data) {unsigned int i;short sum=0;
for (i=0; i<64; i++) {sum = (short) (sum + data[i]);
}return sum;
}
Data is an array of shorts, not intsType cast is needed because + only takes 32-bit args
Problems: 1. sum is a short, not int 2. Loading a halfword (16-bits) is limited
Slides created by: Professor Ian G. Harris
Assembly Code for Example 2
checksum_v1MOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0 ; i = 0
checksum_v1_loopADD r3, r2, r1, LSL #1 ; r3 = &data[i]LDRH r3, [r3, #0] ; r3 = data[i]ADD r1, r1, #1 ; r1 = i+1CMP r1, #0x40 ; compare i, 64ADD r0, r3, r0 ; sum += r3MOV r0, r0, LSL #16MOV r0, r0, ASR #16 ; sum = (short)r0BCC checksum_v1_loop ; if i<64 goto loopMOV pc, r14 ; return sum
LDRH cannot take shifted operands, so the ADD is needed
Sum is signed, so ASR is needed to sign extend
Slides created by: Professor Ian G. Harris
Shorter Variable Example 3
int checksum_v1 (short *data) {unsigned int i;int sum=0;
for (i=0; i<64; i++) {sum += *(data++);
}return (short) sum;
}
sum is an intdata is incremented, i is not used as an array indexIncrementing data can be part of the LDR instruction
Slides created by: Professor Ian G. Harris
Assembly Code for Example 3
checksum_v1MOV r2, #0 ; sum = 0MOV r1, #0 ; i = 0
checksum_v1_loopLDRSH r3, [r0], #2 ; r3 = *(data++)ADD r1, r1, #1 ; r1 = i+1CMP r1, #0x40 ; compare i, 64ADD r2, r3, r2 ; sum += r3BCC checksum_v1_loop ; if i<64 goto loopMOV r0, r2, LSL #16MOV r0, r0, ASR #16 ; r0 = (short)sumMOV pc, r14 ; return sum
*data is incremented as part of LDRSH instructionCast to short occurs once, outside of the loop
Slides created by: Professor Ian G. Harris
Loops, Fixed Iterations
checksum_v1MOV r2, #0 ; sum = 0MOV r1, #0 ; i = 0
checksum_v1_loopLDRSH r3, [r0], #2 ; r3 = *(data++)ADD r1, r1, #1 ; r1 = i+1CMP r1, #0x40 ; compare i, 64ADD r2, r3, r2 ; sum += r3BCC checksum_v1_loop ; if i<64 goto loopMOV pc, r14 ; return sum
A lot of time is spent in loopsLoops are a common target for optimization
3 instructions implement loop: add, compare, branchReplace them with: subtract/compare, branchResult of the subtract can be used to set condition flags
Slides created by: Professor Ian G. Harris
Condensing a Loop
Current loop counts up from 0 to 64i is compared to 64 to check for loop terminationOptimized loop can count down from 64 to 0i does not need to be explicitly compared to 0
– Add the 'S' suffix to the subtract so is sets condition flags
Ex. SUBS r1, r1, #1BNE loop
BNE checks Zero flag in CPSRNo need for a compare instruction
Slides created by: Professor Ian G. Harris
Loops, Counting Down
checksumMOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0x40 ; i = 64
checksum_loopLDR r3, [r2], #4 ; r3 = *(data++)SUBS r1, r1, #1 ; i-- and set flagsADD r0, r3, r0 ; sum += r3BCC checksum_loop ; if i!=0 goto loopMOV pc, r14 ; return sum
One comparison instruction removed from inside the loopPossible because ARM always compares to 0
Slides created by: Professor Ian G. Harris
Loop Unrolling
Loop overhead is the performance cost of implementing the loop– Ex. SUBS, BCC
For ARM, overhead is 4 clock cycles – SUBS = 1 clk, BCC = 3 clks
Overhead can be avoided by unrolling the loop– Repeating the loop body many times
Fixed iteration loops, unrolling can reduce overhead to 0Variable iteration loops, overhead is greatly reduced
Slides created by: Professor Ian G. Harris
Unrolling, Fixed Iterations
checksumMOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0x40 ; i = 32
checksum_loopSUBS r1, r1, #1 ; i-- and set flagsLDR r3, [r2], #4 ; r3 = *(data++)ADD r0, r3, r0 ; sum += r3LDR r3, [r2], #4 ; r3 = *(data++)ADD r0, r3, r0 ; sum += r3BCC checksum_loop ; if i!=0 goto loopMOV pc, r14 ; return sum
Only 32 iterations needed, loop body duplicatedLoop overhead cut in half
Slides created by: Professor Ian G. Harris
Unrolling Side Effects
Advantages:– Reduces loop overhead, improves performance
Disadvantages:– Increases code size– Displaces lines from the instruction cache– Degraded cache performance may offset gains
Slides created by: Professor Ian G. Harris
Register Allocation
Compiler must choose registers to hold all data used- i, data[i], sum, etc.
If number of vars > number of registers, stack must be used- very slow
Try to keep number of local variables small- approximately 12 available registers in ARM- 16 total registers but some may be used (SP, PC, etc.)
Slides created by: Professor Ian G. Harris
Function Calls, Arguments
ARM passes the first 4 arguments through r0, r1, r2, and r3Stack is only used if 5 or more arguments are usedKeep number of arguments <= 4Arguments can be merged into structures which are passed by reference
typedef struct {float x;float y;float z;
} Point;
float distance (point *a, point *b) {float t1, t2;
t1 = (a->x – b->x)^2;t2 =(a->y – b->y)^2;return(sqrt(t1 + t2));
}
Pass two pointers rather than six floats
Slides created by: Professor Ian G. Harris
Preserving Registers
Caller must preserve registers that the callee might corruptRegisters are preserved by writing them to memory and reading them back later
Example:– Function foo() calls function bar()– Both foo() and bar() use r4 and r5– Before the call, foo() writes registers to memory (STR)– After the call, foo() reads memory back (LDR)
If foo() and bar() are in different .c files, compiler will preserve all corruptible registersIf foo() and bar() are in the same file, compiler will only save corrupted registers
Slides created by: Professor Ian G. Harris
Function Calls, Inlining
Code for a called function can be inserted into the code of the caller
int foo(int x) {int z;
z = bar(x);return(z*2);
}
int bar(int y) {return(y+3);
}
int foo_inline(int x){int z;
z = x + 3;return(z*2);
}
Machine code is inlined, not the C codeCode size is increased, works well for small functions