22
Slides created by: Professor Ian G. Harris Efficient C Code Your C program is not exactly what is executed Machine code is specific to each ucontroller omplete understanding of code execution requires . Understanding the compiler . Understanding the computer architecture C code Machine code Compiler ucontroller

Slides created by: Professor Ian G. Harris Efficient C Code Your C program is not exactly what is executed Machine code is specific to each ucontroller

Embed Size (px)

Citation preview

Page 1: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Efficient C Code

Your C program is not exactly what is executed Machine code is specific to each ucontroller

Complete understanding of code execution requires

1. Understanding the compiler

2. Understanding the computer architecture

C code Machine codeCompiler ucontroller

Page 2: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

ARM Instruction Set

An instruction set is the set of all machine instructions supported by the architecture

Load-Store Architecture • Data processing occurs in registers• Load and store instructions move data between memory and

registers• [] indicate an addressEx. LDR r0, [r1] moves data into r0 from memory at

address in r1 STR r0, [r1] moves data from r0 into memory at

address in r1

Page 3: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Data Processing Instructions

Move InstructionsMOV r0, r1 moves the contents of r1 into r0MOV r0, #3 moves the number 3 into r0

Shift Instructions – inputs to operations can be shiftedMOV r0, r1, LSL #2 moves (r1 << 2) into r0MOV r0, r1, ASR #2 moves (r1 >> 2) into r0, sign extend

Arithmetic InstructionsADD r3, r4, r5 places (r4 + r5) in r3

Page 4: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Condition Flags

Current Program Status Register (CPSR) contains the status of comparison instructions and some arithmetic instructions

N – negative, Z – zero, C – unsigned carry, V – overflow, Q - saturation

Flags are set as a result of a comparison instruction or an arithmetic instruction with an 'S' suffix

Ex. CMP r0, r1 – sets status bits as a result of (r0 – r1)ADDS r0, r1, r2 – r0 = r1 + r2 and status bits setADD r0, r1, r2 – r0 = r1 + r2 but no status bits set

Page 5: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Conditional Execution

All ARM instructions can be executed conditionally based on the CPSR register

Appropriate condition suffix needs to be added to the instruction

NE – not equal, EQ – equal, CC – less than (unsigned), LT less than (signed)

Ex. CMP r0, r1ADDNE r3, r4, r5BCC test

ADDNE is executed if r0 not equal to r1BCC is executed if r0 is less than r1

Page 6: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Variable Types and Casting

int checksum_v1 (int *data) {char i;int sum=0;

for (i=0; i<64; i++) {sum += data[I];

}return sum;

}

Program computes the sum of the first 64 elts in the data arrayVariable i is declared as a char to save space

i always less than 8 bits long

May use less register space and/or stack space

i as a char does NOT save any spaceAll stack entries and registers are 32 bits long

Page 7: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Declaring Shorter Variables

int test (void) {char i=255;int j=255;

i++; // i = 0j++; // j = 256

}

Shorter variables may save space in the heap, but not the stack (data)Compiler needs to mimic the behavior of a short variable with a long variable

If i is a char, its value overflows after 255

i is contained in a 32 bit registerCompiler must make i’s 32 bit register overflow after 255

Page 8: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Assembly Code for Checksum

checksum_v1MOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0 ; i = 0

checksum_v1_loopLDR r3, [r2, r1, LSL #2] ; r3 = data[I]ADD r1, r1, #1 ; r1 = i+1AND r1, r1, #0xff ; i = (char)r1CMP r1, #0x40 ; compare i, 64ADD r0, r3, r0 ; sum += r3BCC checksum_v1_loop ; if i<64 goto loopMOV pc, r14

•Argument, *data, passed in r0•Return address stored in r14•Stack avoided to reduce delay•LSL needed to increment by 4

•Highlighted instruction needed to mimic char•17% instruction overhead

Declaring i as an unsigned int would fix the problem

Page 9: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Shorter Variable Example 2

int checksum_v1 (short *data) {unsigned int i;short sum=0;

for (i=0; i<64; i++) {sum = (short) (sum + data[i]);

}return sum;

}

Data is an array of shorts, not intsType cast is needed because + only takes 32-bit args

Problems: 1. sum is a short, not int 2. Loading a halfword (16-bits) is limited

Page 10: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Assembly Code for Example 2

checksum_v1MOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0 ; i = 0

checksum_v1_loopADD r3, r2, r1, LSL #1 ; r3 = &data[i]LDRH r3, [r3, #0] ; r3 = data[i]ADD r1, r1, #1 ; r1 = i+1CMP r1, #0x40 ; compare i, 64ADD r0, r3, r0 ; sum += r3MOV r0, r0, LSL #16MOV r0, r0, ASR #16 ; sum = (short)r0BCC checksum_v1_loop ; if i<64 goto loopMOV pc, r14 ; return sum

LDRH cannot take shifted operands, so the ADD is needed

Sum is signed, so ASR is needed to sign extend

Page 11: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Shorter Variable Example 3

int checksum_v1 (short *data) {unsigned int i;int sum=0;

for (i=0; i<64; i++) {sum += *(data++);

}return (short) sum;

}

sum is an intdata is incremented, i is not used as an array indexIncrementing data can be part of the LDR instruction

Page 12: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Assembly Code for Example 3

checksum_v1MOV r2, #0 ; sum = 0MOV r1, #0 ; i = 0

checksum_v1_loopLDRSH r3, [r0], #2 ; r3 = *(data++)ADD r1, r1, #1 ; r1 = i+1CMP r1, #0x40 ; compare i, 64ADD r2, r3, r2 ; sum += r3BCC checksum_v1_loop ; if i<64 goto loopMOV r0, r2, LSL #16MOV r0, r0, ASR #16 ; r0 = (short)sumMOV pc, r14 ; return sum

*data is incremented as part of LDRSH instructionCast to short occurs once, outside of the loop

Page 13: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Loops, Fixed Iterations

checksum_v1MOV r2, #0 ; sum = 0MOV r1, #0 ; i = 0

checksum_v1_loopLDRSH r3, [r0], #2 ; r3 = *(data++)ADD r1, r1, #1 ; r1 = i+1CMP r1, #0x40 ; compare i, 64ADD r2, r3, r2 ; sum += r3BCC checksum_v1_loop ; if i<64 goto loopMOV pc, r14 ; return sum

A lot of time is spent in loopsLoops are a common target for optimization

3 instructions implement loop: add, compare, branchReplace them with: subtract/compare, branchResult of the subtract can be used to set condition flags

Page 14: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Condensing a Loop

Current loop counts up from 0 to 64i is compared to 64 to check for loop terminationOptimized loop can count down from 64 to 0i does not need to be explicitly compared to 0

– Add the 'S' suffix to the subtract so is sets condition flags

Ex. SUBS r1, r1, #1BNE loop

BNE checks Zero flag in CPSRNo need for a compare instruction

Page 15: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Loops, Counting Down

checksumMOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0x40 ; i = 64

checksum_loopLDR r3, [r2], #4 ; r3 = *(data++)SUBS r1, r1, #1 ; i-- and set flagsADD r0, r3, r0 ; sum += r3BCC checksum_loop ; if i!=0 goto loopMOV pc, r14 ; return sum

One comparison instruction removed from inside the loopPossible because ARM always compares to 0

Page 16: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Loop Unrolling

Loop overhead is the performance cost of implementing the loop– Ex. SUBS, BCC

For ARM, overhead is 4 clock cycles – SUBS = 1 clk, BCC = 3 clks

Overhead can be avoided by unrolling the loop– Repeating the loop body many times

Fixed iteration loops, unrolling can reduce overhead to 0Variable iteration loops, overhead is greatly reduced

Page 17: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Unrolling, Fixed Iterations

checksumMOV r2, r0 ; r2 = dataMOV r0, #0 ; sum = 0MOV r1, #0x40 ; i = 32

checksum_loopSUBS r1, r1, #1 ; i-- and set flagsLDR r3, [r2], #4 ; r3 = *(data++)ADD r0, r3, r0 ; sum += r3LDR r3, [r2], #4 ; r3 = *(data++)ADD r0, r3, r0 ; sum += r3BCC checksum_loop ; if i!=0 goto loopMOV pc, r14 ; return sum

Only 32 iterations needed, loop body duplicatedLoop overhead cut in half

Page 18: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Unrolling Side Effects

Advantages:– Reduces loop overhead, improves performance

Disadvantages:– Increases code size– Displaces lines from the instruction cache– Degraded cache performance may offset gains

Page 19: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Register Allocation

Compiler must choose registers to hold all data used- i, data[i], sum, etc.

If number of vars > number of registers, stack must be used- very slow

Try to keep number of local variables small- approximately 12 available registers in ARM- 16 total registers but some may be used (SP, PC, etc.)

Page 20: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Function Calls, Arguments

ARM passes the first 4 arguments through r0, r1, r2, and r3Stack is only used if 5 or more arguments are usedKeep number of arguments <= 4Arguments can be merged into structures which are passed by reference

typedef struct {float x;float y;float z;

} Point;

float distance (point *a, point *b) {float t1, t2;

t1 = (a->x – b->x)^2;t2 =(a->y – b->y)^2;return(sqrt(t1 + t2));

}

Pass two pointers rather than six floats

Page 21: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Preserving Registers

Caller must preserve registers that the callee might corruptRegisters are preserved by writing them to memory and reading them back later

Example:– Function foo() calls function bar()– Both foo() and bar() use r4 and r5– Before the call, foo() writes registers to memory (STR)– After the call, foo() reads memory back (LDR)

If foo() and bar() are in different .c files, compiler will preserve all corruptible registersIf foo() and bar() are in the same file, compiler will only save corrupted registers

Page 22: Slides created by: Professor Ian G. Harris Efficient C Code  Your C program is not exactly what is executed  Machine code is specific to each ucontroller

Slides created by: Professor Ian G. Harris

Function Calls, Inlining

Code for a called function can be inserted into the code of the caller

int foo(int x) {int z;

z = bar(x);return(z*2);

}

int bar(int y) {return(y+3);

}

int foo_inline(int x){int z;

z = x + 3;return(z*2);

}

Machine code is inlined, not the C codeCode size is increased, works well for small functions