25
Automatic Application Automatic Application Profiling Profiling Lecture 22

Automatic Application Profiling

Embed Size (px)

DESCRIPTION

Automatic Application Profiling. Lecture 22. Today – What parts of the code are slow?. Amdahl’s law How to get the processor to tell us what’s taking the most time – Statistical program counter sampling. Amdahl’s Law. Gene Amdahl - “Optimize the common case” - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic Application Profiling

Automatic Application ProfilingAutomatic Application Profiling

Lecture 22

Page 2: Automatic Application Profiling

Today – What parts of the code are slow?Today – What parts of the code are slow?• Amdahl’s law

• How to get the processor to tell us what’s taking the most time – Statistical program counter sampling

Page 3: Automatic Application Profiling

Amdahl’s LawAmdahl’s Law

• Gene Amdahl - “Optimize the common case”• Double the speed of ¼ of the program:

• Quadruple the speed of ¼ of the program:

Enhanced

EnhancedNormal Speedup

FractionFraction

Speedup

1

14.1

2

25.075.0

1

Speedup

23.1

4

25.075.0

1

Speedup

Page 4: Automatic Application Profiling

Work on the Slow PartWork on the Slow Part

• Double the speed of ¾ of the program:

• Quadruple the speed of ¾ of the program:

Enhanced

EnhancedNormal Speedup

FractionFraction

Speedup

1

6.125.0

2

75.01

Speedup

29.225.0

4

75.01

Speedup

Page 5: Automatic Application Profiling

How Do We Find the Slow PartsHow Do We Find the Slow Parts• Option A: Measure the amount of time each region takes to execute

– Codeunsigned long times[NUM_FUNCTIONS];

void my_function(int i) { t_start = get_ticks(); /* function body */ times[THIS_FUNCTION_NUM] += get_ticks() - t_start;}

• Pros– Exact. Can get single cycle accuracy if needed.

• Cons– Tedious. Must add code to each function to be monitored.

– Need access to source code, which may be a problem for library functions.

Page 6: Automatic Application Profiling

Program Counter SamplingProgram Counter Sampling• Option B: Periodically examine the PC to see what’s running

– Result shows average fraction of time spent executing a region of code

• Supporting data structure: table of region information– Starting and ending addresses:

defined before sampling– Execution counts:updated

during sampling

• Sampling– Use a timer to interrupt application periodically– Within ISR

• Read PC off of stack• Examine table of region addresses to determine currently executing region N• Increment entry N of execution count table• Also increment a total number of ticks variable (to reveal out-of-range PC

instances)

– At end• Provide execution count table to user (via file, serial port, debugger, etc.)

typedef struct { char Name[PROFILE_NAME_SIZE]; unsigned long Start, End; unsigned Count;} PROFILE_T;

Page 7: Automatic Application Profiling

Configure Timer to Generate Periodic InterruptConfigure Timer to Generate Periodic Interrupt

• Call this function with desired sampling frequency in Hz

void Init_Profiling(unsigned samp_freq) { unsigned long divider; // set up timer A0 to interrupt at samp_freq ta0mr = 0x00; divider = ((unsigned long)MAIN_CLOCK)/samp_freq; if (divider > 0x0ffffl) ta0 = 0x0ffff; else ta0 = (unsigned) (divider & 0x0ffff); DISABLE_IRQ; ta0ic = 1; ENABLE_IRQ; ta0s = 1; }

Page 8: Automatic Application Profiling

Profiler Interrupt Service RoutineProfiler Interrupt Service Routine

• Don’t forget to register ISR in vector table!

#pragma INTERRUPT/B profile_intrvoid profile_intr(void) { unsigned char PC_H; unsigned int PC_ML; unsigned long PC; unsigned char i; /* Get PC from stack */ _asm("mov.w 2[FB], $$[FB]", PC_ML); _asm("mov.b 5[FB], $$[FB]", PC_H); PC = PC_H; PC <<= 16; PC += PC_ML; profile_ticks++; /* look up function in table and inc. counter */ for (i=0; i<NUM_PROFILE_REGIONS; i++) { if ((PC >= profiles[i].Start) && (PC <= profiles[i].End)) { profiles[i].Count++; return; } }}

Page 9: Automatic Application Profiling

Configure Profile Table with Region InformationConfigure Profile Table with Region Information

• Where do we find region addresses? Next page.

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0xf0380, UL 0xf0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main", UL main, UL (main+0x0199), 0}, {"LCD_Erase", UL LCD_Erase_FB, UL (LCD_Erase_FB+0x0240), 0}, {"sin", UL sin, UL (sin+0x0138), 0}, {"LCD_Plot_in_FB", UL LCD_Plot_in_FB, UL (LCD_Plot_in_FB+0x02e3), 0}, {"LCD_Display_FB", UL LCD_Display_FB, UL (LCD_Display_FB+0x0302), 0}, {"DisplayDelay", UL DisplayDelay, UL (DisplayDelay+0x01ad), 0}, {"LCD_write", UL LCD_write, UL (LCD_write+0x017f), 0}, {"", UL 0, UL 0, 0} };

To profile modules (source files)

To profile functions

Page 10: Automatic Application Profiling

Finding the Region AddressesFinding the Region Addresses• Get module addresses from linker’s map file (in debug directory)

• Get function lengths (if needed) from .LST file– Second column is start address of each assembly instruction– Subtract function’s first address from its last address to find length

506 ;## # FUNCTION LCD_Erase_FB507 ;## # FRAME AUTO (y) size 2, offset -4508 ;## # FRAME AUTO (x) size 2, offset -2509 ;## # ARG Size(0) Auto Size(4) Context Size(5)510 511 .align512 ;## # C_SRC : void LCD_Erase_FB(void) {513 .glb _LCD_Erase_FB514 00212 _LCD_Erase_FB:515 00212 7CF204 enter #04H516 ;## # C_SRC : for (x=0; x<8; x++)517 00215 D90BFE mov.w #0000H,-2[FB] ; x P(etc.)539 ;## # C_SRC : }540 0023F 7DF2 exitd541 00241 E7:

program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD

Page 11: Automatic Application Profiling

Overview of Profiling ApproachOverview of Profiling Approach• Start with the big picture (rough details) and use that to

determine where to look next• Profiling sequence

– module-level (file-level)

– function-level within the most common module

– basic block-level within the most common function

Page 12: Automatic Application Profiling

Detailed Steps to using profile.c/hDetailed Steps to using profile.c/h• Enable list file creation for each C source file

– HEW: Options -> Renesas M16C Standard Toolchain• C Tab -> Category: List. Check –dS and –dSL boxes

• Fill in array profiles with region addresses (e.g. names of functions), dummy lengths, and zero counts

• Compile– May need to add function prototypes if profiles table is declared before the functions

are

• Update profiles array with correct starting and ending addresses

• Recompile

• Run

• Examine profiles after running long enough

Page 13: Automatic Application Profiling

How Long is Enough?How Long is Enough?• Complex statistical question

– The statistician I asked said “it depends” and changed the subject• So, run it until the digits you care about stop changing• Example: Module-level profiling

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 100 1000 10000 100000

Number of Samples

Fra

cti

on

of

Pro

gra

m T

ime

mainprofileskp26skp_lcdlibrary

Page 14: Automatic Application Profiling

Where does the Lab 4 Skeleton Spend its Time?Where does the Lab 4 Skeleton Spend its Time?

We know there are delay loops executed every time the MCU writes to the LCD, but let’s verify how bad they are

• Start with modules

• Then look at functions in module

• Then look at basic blocks within function

DisplayString(LCD_LINE1," Lab #4 "); DisplayString(LCD_LINE2," Starter"); GRN_LED = LED_ON;

while (1) { for (f=6.0; f>0.0; f -= 0.4) { LCD_Erase_FB(); for (i=0; i<DISP_WIDTH_PIXELS; i++)

LCD_Plot_in_FB((unsigned char)i, (unsigned char) (3.5*(sin(i/f)+1.0)), 1);

LCD_Display_FB(LCD_LINE1); } }

Page 15: Automatic Application Profiling

Step One: Profile ModulesStep One: Profile Modules

• Define a profile region per module, and one for all the library functions

# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8

Page 16: Automatic Application Profiling

Step One: Profile ModulesStep One: Profile Modules

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0x0f0380, UL 0x0f0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };

Page 17: Automatic Application Profiling

Step 1 ResultsStep 1 Results

• Surprise! The LCD functions aren’t taking up most of the processor’s time! The library functions are instead

Execution Time per Module

main

profile

skp26

skp_lcd

library

other

Count Timemain 70 0.21%profile 0 0.00%skp26 0 0.00%skp_lcd 9642 29.11%library 23415 70.68%other 1 0.00%

Page 18: Automatic Application Profiling

Step Two: Profile LibraryStep Two: Profile Library

• We only have eight entries in our table, so let’s split up the library into eight regions of about three functions each

# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8

lib 1

lib 2

lib 3

lib 4

lib 5

lib 6

lib 7

lib 8

Page 19: Automatic Application Profiling

Step Two: Profile LibraryStep Two: Profile Library

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"lib 1", UL 0x0f0918, UL 0x0f161b, 0}, {"lib 2", UL 0x0f161c, UL 0x0f1a0b, 0}, {"lib 3", UL 0x0f1a0c, UL 0x0f1c7d, 0}, {"lib 4", UL 0x0f1c7e, UL 0x0f1deb, 0}, {"lib 5", UL 0x0f1dec, UL 0x0f22d9, 0}, {"lib 6", UL 0x0f22da, UL 0x0f2699, 0}, {"lib 7", UL 0x0f269a, UL 0x0f2823, 0}, {"lib 8", UL 0x0f2824, UL 0x0f290d, 0}};

Page 20: Automatic Application Profiling

Step Two ResultsStep Two Results

• Functions in group lib 6 are taking the most time, followed by lib 4 and lib 1

Execution Time per Library Function Group

lib 1

lib 2

lib 3

lib 4

lib 5

lib 6

lib 7

lib 8

other

Count Timelib 1 3079 10.22%lib 2 1796 5.96%lib 3 865 2.87%lib 4 3092 10.26%lib 5 687 2.28%lib 6 11044 36.65%lib 7 131 0.43%lib 8 444 1.47%other 8999 29.86%

Page 21: Automatic Application Profiling

Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions

• Examine the nine functions in these three groups, grouping two functions together

# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8

Page 22: Automatic Application Profiling

Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"_F4DIV", UL 0x0f0918, UL 0x0f0b27, 0}, {"_F4TOF8+_F8ADD", UL 0x0f0b28, UL 0x0f161b, 0}, {"_I4DIVU", UL 0x0f1c7e, UL 0x0f1ccb, 0}, {"_I4TOF4", UL 0x0f1ccc, UL 0x0f1ced, 0}, {"_LTOF", UL 0x0f1cee, UL 0x0f1deb, 0}, {"_F4RTOL", UL 0x0f22da, UL 0x0f2339, 0}, {"_F8DIV", UL 0x0f233a, UL 0x0f261d, 0}, {"_F8EQ", UL 0x0f261e, UL 0x0f2699, 0}};

Page 23: Automatic Application Profiling

Step Three ResultsStep Three Results

• Most time spent in double precision floating point divide• 3.5*(sin(i/f)+1.0) is culprit. Avoid floating point when possible

Execution Time per Library Function

_F4DIV_F4TOF8+_F8ADD_I4DIVU_I4TOF4_LTOF_F4RTOL_F8DIV_F8EQother

Count Time_F4DIV 482 1.83%_F4TOF8+_F8ADD2413 9.14%_I4DIVU 0 0.00%_I4TOF4 18 0.07%_LTOF 2748 10.41%_F4RTOL 36 0.14%_F8DIV 9418 35.67%_F8EQ 0 0.00%other 11291 42.76%

26406 57.24%

Page 24: Automatic Application Profiling

Disadvantages of SamplingDisadvantages of Sampling• Sampling is inexact - not guaranteed to get everything that runs

– Code which disables interrupts (e.g. ISRs, OS code) is not measured

– Rarely executed code may be missed

– Takes time for numbers to settle down

– Profile changes based on mode of program

• If manually creating table, user needs to update address table with each code change

Page 25: Automatic Application Profiling

Implementing InstrumentationImplementing Instrumentation• Tedious to do manually for large programs, so automate

• Have compiler instrument code for you– gcc and other compilers support profiling using a command line switch

– They provide a tool to process the output file to determine how much time each function takes

• Can also modify the binary (after compilation): Atom from DEC’s Western Research Lab, Etch, EEL– Tool processes binary files to run your instrumentation procedures for each

procedure, basic block, or instruction

• For the M16C, what would be best?– Create a program which reads the map file and creates a C file declaring our

profiles array with correct region names and addresses• Probably easiest to use a scripting language: sed, awk, perl, (f)lex

• Probably not enough memory to instrument all functions, so must be selective