Statistical Profiling: Hardware, OS, and Analysis Tools

Statistical Profiling: Hardware, OS, and Analysis Tools

Profiling Tutorial 2-210/4/98

Joint WorkDIGITAL Continuous Profiling Infrastructure (DCPI)

Project Members at

Systems Research CenterLance Berc, Sanjay Ghemawat, Monika Henzinger,Shun-Tak Leung, Dick Sites (now at Adobe),Mitch Lichtenberg, Mark Vandevoorde, Carl Waldspurger,Bill Weihl

Western Research LabJennifer Anderson, Jeffrey Dean

Other Collaborators

Cambridge Research LabJamey Hicks

Alpha EngineeringGeorge Chrysos, Scot Hildebrandt, Rick Kessler, Ed McLellan, Gerard Vernes, Jonathan White


Outline

Statistical sampling– What is it?– Why use it?

Data collection– Hardware issues– OS issues

Data analysis – In-order processors– Out-of-order processors


Statistical Profiling

Based on periodic sampling

Hardware generates periodic interrupts

OS handles the interrupts and stores data– Program Counter (PC) and any extra info

Analysis Tools convert data– for users– for compilers

Examples:DCPI, Morph, SGI Speedshop, Unix’s prof(), VTune


Sampling vs. Instrumentation

Much lower overhead than instrumentation– DCPI: program 1%-3% slower– Pixie: program 2-3 times slower

Applicable to large workloads– 100,000 TPS on Alpha– AltaVista

Easier to apply to whole systems (kernel, device drivers, shared libraries, ...)– Instrumenting kernels is very tricky– No source code needed


Information from Profiles

DCPI estimates

Where CPU cycles went, broken down by– image, procedure, instruction

How often code was executed– basic blocks and CFG edges

Where peak performance was lost and why


Example: Getting the Big Picture

Total samples for event type cycles = 6095201

cycles % cum% load file

2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so

cycles % cum% procedure load file

2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so


Example: Using the Microscope

...

...

21.0 cycles

3.5 cycles

Address Instruction Samples Culprits CPI

9618 addq s0,t6,t6 643 1.0 cyclesb (b = data dep on 2nd operand)

D (D = DTLB miss)

D 961c ldl t4,0(t6) 2111 9618 3.5 cycles

aa (a = data dep on 1st operand)a

di (d = d-cache miss) (i = i-cache miss)di

9620 xor t4,t12,t5 14152 961c 21.0 cycles 9624 beq 0x963c 0 0.0 cycles

Where peak performance is lost and why


Example: Summarizing Stalls

I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0%

Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3%------------------------------------------------------------- Subtotal dynamic 44.1%

Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0%------------------------------------------------------------- Subtotal static 4.8%------------------------------------------------------------- Total stall 48.9% Execution 51.2%Net sampling error -0.1%------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples)


Example: Sorting Stalls

% cum% cycles cnt cpi blame PC file:line10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488


Instruction-level Information Matters

DCPI anecdotes

TPC-D: 10% speedup

Duplicate filtering for AltaVista: part of 19X

Compress program: 22%

Compiler improvements: 20% in several Spec benchmarks


Outline





Typical Hardware Support

Timers– Clock interrupt after N units of time

Performance Counters– Interrupt after N

cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired, ...

– Alpha 21064, 21164; Ppro, PII;…– Easy to measure total cycles, issues, CPI, etc.

Only extra information is restart PC


Problem: Inaccurate Attribution

Experiment– count data loads– loop: single load +

hundreds of nops

In-Order Processor– Alpha 21164– skew– large peak

Out-of-Order Processor– Intel Pentium Pro– skew– smear

0 50 100 150 200

0

2

4

6

8

10

12

14

16

18

20

22

24

Histogram of Restart PCs

782

load


Ramification of Misattribution

No skew or smear– Instruction-level analysis is easy!

Skew is a constant number of cycles– Instruction-level analysis is possible– Adjust sampling period by amount of skew– Infer execution counts, CPI, stalls, and stall explanations

from cycles samples and program

Smear– Instruction-level analysis seems hopeless– Examples: PII, StrongARM


Desired Hardware Support

Sample fetched instructions

Save PC of sampled instruction– E.g., interrupt handler reads Internal Processor Register– Makes skew and smear irrelevant

Gather more information


random selection

ProfileMe: Instruction-Centric Profiling

fetch map issue exec retire

icache

branchpredict

dcache

interrupt!arithunits

done?

Fetch counter

overflow?

pc addr retired?miss?stage latencies

ProfileMe tag!

tagged?

historymp?capture!

internal processor registers

miss?


Instruction-Level Statistics

PC + Retire Status execution frequency

PC + Cache Miss Flag cache miss rates

PC + Branch Mispredict mispredict rates

PC + Event Flag event rates

PC + Branch Direction edge frequencies

PC + Branch History path execution rates

PC + Latency instruction stalls“100-cycle dcache miss” vs. “dcache miss”


Kernel Device Driver

Challenge: 1% of 64K is only 655 cycles/sample

Aggregate samples in hash table– (PID, PC, event) count

Minimize cache misses– ~100 cycles to memory– Pack data structures into cache lines

Eliminate expensive synchronization operations– Interprocessor interrupts for synchronization with

daemon– Replicate main data structures on each processor


Moving Samples to Disk

User-Space Daemon– Extracts raw samples from driver– Associates samples with compiled code– Updates disk-based profiles for compiled code

Mapping <PID, PC> samples to compiled code– Dynamic loader hook for dynamically loaded code– Exec hook for statically linked code– Other hooks for initializing mapping at daemon start-up

Profiles– text header + compact binary samples


Performance of Data Collection (DCPI)

Time– 1-3% total overhead for most workloads– Often less than variation from run to run

Space– 512 KB kernel memory per processor– 2-10 MB resident for daemon– 10 MB disk after one month of profiling on heavily used

timeshared 4-processor machine

Non-intrusive enough to be run for many hours on production systems, e.g.


Outline





Compile code

Samples

ANALYSIS Stall explanations

Frequency

Cycles per instruction

Data Analysis

Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas)

Frequency indicates frequent paths

CPI indicates stalls


1,000,000 1 CPI

? 10,000 100 CPI1,000,000 Cycles

Estimating Frequency from Samples

Problem– given cycle samples, compute frequency and CPI

Approach– Let F = Frequency / Sampling Period– E(Cycle Samples) = F X CPI– So … F = E(Cycle Samples) / CPI


Estimating Frequency (cont.)

F = E(Cycle Samples) / CPI

Idea– If no dynamic stall, then know CPI, so can estimate F– So… assume some instructions have no dynamic stalls

Consider a group of instructions with the same frequency (e.g., basic block)

Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy

Key insight:– Instructions without stalls have smaller sample counts


Address Instruction Samples MinCPI Samples/MinCPI

9600 subl s6, a1, s6 792 1 7929604 lda a3, 16411(s6) 611 1 6119608 cmovlt s6, a3, s6 649 1 649960c bis zero, zero, s3 0 0 Estimate 6309610 sll s6, 0x5, t6 1389 2 695 (Actual 615)9614 addl zero, t6, t6 616 1 6169618 addq s0, t6, t6 643 1 643961c ldl t4, 0(t6) 2111 1 21119620 xor t4, t12, t5 13152 2 65769624 beq t5, 963c 0 0

Estimating Frequency (Example)

Compute MinCPI from Code

Compute Samples/MinCPI

Select Data to Average

Does badly when:– Few issue points– All issue points stall


Frequency Estimate Accuracy

Compare frequency estimates for blocks to measured values obtained with pixie-like tool

Edge frequencies a bit less accurate


Explaining Stalls

Static stalls– Schedule instructions in each basic block optimistically

using a detailed pipeline model for the processor

Dynamic stalls– Start with all possible explanations

I-cache miss, D-cache miss, DTB miss, branch mispredict, ...

– Rule out unlikely explanations – List the remaining possibilities


Is the previous occurrence of an operand register the destination of a load instruction?

Search backward across basic block boundaries

Prune by block and edge execution frequencies

ldq t0,0(s1)

subq t0,t1,t2

addq t3,t0,t4

OR

subq t0,t1,t2

Ruling Out D-cache Misses


Out-of-Order Processors

In-Order processors– Periodic interrupt lands on “current” instruction, e.g., next

instruction to issue– Peak performance = no wasted issue slots– Any stall implies loss in performance

Out-of-Order Processors– Many instructions in-flight: no “current” instruction– Some stalls masked by concurrent execution

Instructions issue around stalled instruction

Example: does this stall matter?

load r1,… add …,r1,… average latency: 15.0 cycles… other instructions …


Issue: Need to Measure Concurrency

Interesting concurrency metrics– Retired instructions per cycle– Issue slots wasted while an instruction is in flight– Pipeline stage utilization

How to measure concurrency?

Special-purpose hardware– Some metrics difficult to measure

e.g. need retire/abort status

Sample potentially-concurrent instructions– Aggregate info from pairs of samples– Statistically estimate metrics


Paired Sampling

Sample two instructions – May be in-flight simultaneously– Replicate ProfileMe hardware, add intra-pair distance

Nested sampling– Sample window around first profiled instruction– Randomly select second profiled instruction– Statistically estimate frequency for F(first, second)

+W

... ...

... ...

... ...

... ...

-W

time

overlap no overlap


Explaining Lost Performance

An open question

Some in-order analysis applicable– E.g., D-cache miss & branch mispredict analysis

Pipe stage latencies from counters would help a lot


Summary & Conclusion

Statistical profiling can be– Inexpensive– Effective

Instruction-level analysis matters

Performance counters– Implementation details make a big difference

Out-of-order processors require better counters

Documents

Statistical Profiling: Hardware, OS, and Analysis Tools