Upload
benjamin-dylan-hardy
View
215
Download
1
Embed Size (px)
Citation preview
Statistical Profiling: Hardware, OS, and Analysis Tools
Profiling Tutorial 2-210/4/98
Joint WorkDIGITAL Continuous Profiling Infrastructure (DCPI)
Project Members at
Systems Research CenterLance Berc, Sanjay Ghemawat, Monika Henzinger,Shun-Tak Leung, Dick Sites (now at Adobe),Mitch Lichtenberg, Mark Vandevoorde, Carl Waldspurger,Bill Weihl
Western Research LabJennifer Anderson, Jeffrey Dean
Other Collaborators
Cambridge Research LabJamey Hicks
Alpha EngineeringGeorge Chrysos, Scot Hildebrandt, Rick Kessler, Ed McLellan, Gerard Vernes, Jonathan White
Profiling Tutorial 2-310/4/98
Outline
Statistical sampling– What is it?– Why use it?
Data collection– Hardware issues– OS issues
Data analysis – In-order processors– Out-of-order processors
Profiling Tutorial 2-410/4/98
Statistical Profiling
Based on periodic sampling
Hardware generates periodic interrupts
OS handles the interrupts and stores data– Program Counter (PC) and any extra info
Analysis Tools convert data– for users– for compilers
Examples:DCPI, Morph, SGI Speedshop, Unix’s prof(), VTune
Profiling Tutorial 2-510/4/98
Sampling vs. Instrumentation
Much lower overhead than instrumentation– DCPI: program 1%-3% slower– Pixie: program 2-3 times slower
Applicable to large workloads– 100,000 TPS on Alpha– AltaVista
Easier to apply to whole systems (kernel, device drivers, shared libraries, ...)– Instrumenting kernels is very tricky– No source code needed
Profiling Tutorial 2-610/4/98
Information from Profiles
DCPI estimates
Where CPU cycles went, broken down by– image, procedure, instruction
How often code was executed– basic blocks and CFG edges
Where peak performance was lost and why
Profiling Tutorial 2-710/4/98
Example: Getting the Big Picture
Total samples for event type cycles = 6095201
cycles % cum% load file
2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so
cycles % cum% procedure load file
2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
Profiling Tutorial 2-810/4/98
Example: Using the Microscope
...
...
21.0 cycles
3.5 cycles
Address Instruction Samples Culprits CPI
9618 addq s0,t6,t6 643 1.0 cyclesb (b = data dep on 2nd operand)
D (D = DTLB miss)
D 961c ldl t4,0(t6) 2111 9618 3.5 cycles
aa (a = data dep on 1st operand)a
di (d = d-cache miss) (i = i-cache miss)di
9620 xor t4,t12,t5 14152 961c 21.0 cycles 9624 beq 0x963c 0 0.0 cycles
Where peak performance is lost and why
Profiling Tutorial 2-910/4/98
Example: Summarizing Stalls
I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0%
Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3%------------------------------------------------------------- Subtotal dynamic 44.1%
Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0%------------------------------------------------------------- Subtotal static 4.8%------------------------------------------------------------- Total stall 48.9% Execution 51.2%Net sampling error -0.1%------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples)
Profiling Tutorial 2-1010/4/98
Example: Sorting Stalls
% cum% cycles cnt cpi blame PC file:line10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488
Profiling Tutorial 2-1110/4/98
Instruction-level Information Matters
DCPI anecdotes
TPC-D: 10% speedup
Duplicate filtering for AltaVista: part of 19X
Compress program: 22%
Compiler improvements: 20% in several Spec benchmarks
Profiling Tutorial 2-1210/4/98
Outline
Statistical sampling– What is it?– Why use it?
Data collection– Hardware issues– OS issues
Data analysis – In-order processors– Out-of-order processors
Profiling Tutorial 2-1310/4/98
Typical Hardware Support
Timers– Clock interrupt after N units of time
Performance Counters– Interrupt after N
cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired, ...
– Alpha 21064, 21164; Ppro, PII;…– Easy to measure total cycles, issues, CPI, etc.
Only extra information is restart PC
Profiling Tutorial 2-1410/4/98
Problem: Inaccurate Attribution
Experiment– count data loads– loop: single load +
hundreds of nops
In-Order Processor– Alpha 21164– skew– large peak
Out-of-Order Processor– Intel Pentium Pro– skew– smear
0 50 100 150 200
0
2
4
6
8
10
12
14
16
18
20
22
24
Histogram of Restart PCs
782
load
Profiling Tutorial 2-1510/4/98
Ramification of Misattribution
No skew or smear– Instruction-level analysis is easy!
Skew is a constant number of cycles– Instruction-level analysis is possible– Adjust sampling period by amount of skew– Infer execution counts, CPI, stalls, and stall explanations
from cycles samples and program
Smear– Instruction-level analysis seems hopeless– Examples: PII, StrongARM
Profiling Tutorial 2-1610/4/98
Desired Hardware Support
Sample fetched instructions
Save PC of sampled instruction– E.g., interrupt handler reads Internal Processor Register– Makes skew and smear irrelevant
Gather more information
Profiling Tutorial 2-1710/4/98
random selection
ProfileMe: Instruction-Centric Profiling
fetch map issue exec retire
icache
branchpredict
dcache
interrupt!arithunits
done?
Fetch counter
overflow?
pc addr retired?miss?stage latencies
ProfileMe tag!
tagged?
historymp?capture!
internal processor registers
miss?
Profiling Tutorial 2-1810/4/98
Instruction-Level Statistics
PC + Retire Status execution frequency
PC + Cache Miss Flag cache miss rates
PC + Branch Mispredict mispredict rates
PC + Event Flag event rates
PC + Branch Direction edge frequencies
PC + Branch History path execution rates
PC + Latency instruction stalls“100-cycle dcache miss” vs. “dcache miss”
Profiling Tutorial 2-1910/4/98
Kernel Device Driver
Challenge: 1% of 64K is only 655 cycles/sample
Aggregate samples in hash table– (PID, PC, event) count
Minimize cache misses– ~100 cycles to memory– Pack data structures into cache lines
Eliminate expensive synchronization operations– Interprocessor interrupts for synchronization with
daemon– Replicate main data structures on each processor
Profiling Tutorial 2-2010/4/98
Moving Samples to Disk
User-Space Daemon– Extracts raw samples from driver– Associates samples with compiled code– Updates disk-based profiles for compiled code
Mapping <PID, PC> samples to compiled code– Dynamic loader hook for dynamically loaded code– Exec hook for statically linked code– Other hooks for initializing mapping at daemon start-up
Profiles– text header + compact binary samples
Profiling Tutorial 2-2110/4/98
Performance of Data Collection (DCPI)
Time– 1-3% total overhead for most workloads– Often less than variation from run to run
Space– 512 KB kernel memory per processor– 2-10 MB resident for daemon– 10 MB disk after one month of profiling on heavily used
timeshared 4-processor machine
Non-intrusive enough to be run for many hours on production systems, e.g.
Profiling Tutorial 2-2210/4/98
Outline
Statistical sampling– What is it?– Why use it?
Data collection– Hardware issues– OS issues
Data analysis – In-order processors– Out-of-order processors
Profiling Tutorial 2-2310/4/98
Compile code
Samples
ANALYSIS Stall explanations
Frequency
Cycles per instruction
Data Analysis
Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas)
Frequency indicates frequent paths
CPI indicates stalls
Profiling Tutorial 2-2410/4/98
1,000,000 1 CPI
? 10,000 100 CPI1,000,000 Cycles
Estimating Frequency from Samples
Problem– given cycle samples, compute frequency and CPI
Approach– Let F = Frequency / Sampling Period– E(Cycle Samples) = F X CPI– So … F = E(Cycle Samples) / CPI
Profiling Tutorial 2-2510/4/98
Estimating Frequency (cont.)
F = E(Cycle Samples) / CPI
Idea– If no dynamic stall, then know CPI, so can estimate F– So… assume some instructions have no dynamic stalls
Consider a group of instructions with the same frequency (e.g., basic block)
Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy
Key insight:– Instructions without stalls have smaller sample counts
Profiling Tutorial 2-2610/4/98
Address Instruction Samples MinCPI Samples/MinCPI
9600 subl s6, a1, s6 792 1 7929604 lda a3, 16411(s6) 611 1 6119608 cmovlt s6, a3, s6 649 1 649960c bis zero, zero, s3 0 0 Estimate 6309610 sll s6, 0x5, t6 1389 2 695 (Actual 615)9614 addl zero, t6, t6 616 1 6169618 addq s0, t6, t6 643 1 643961c ldl t4, 0(t6) 2111 1 21119620 xor t4, t12, t5 13152 2 65769624 beq t5, 963c 0 0
Estimating Frequency (Example)
Compute MinCPI from Code
Compute Samples/MinCPI
Select Data to Average
Does badly when:– Few issue points– All issue points stall
Profiling Tutorial 2-2710/4/98
Frequency Estimate Accuracy
Compare frequency estimates for blocks to measured values obtained with pixie-like tool
Edge frequencies a bit less accurate
Profiling Tutorial 2-2810/4/98
Explaining Stalls
Static stalls– Schedule instructions in each basic block optimistically
using a detailed pipeline model for the processor
Dynamic stalls– Start with all possible explanations
I-cache miss, D-cache miss, DTB miss, branch mispredict, ...
– Rule out unlikely explanations – List the remaining possibilities
Profiling Tutorial 2-2910/4/98
Is the previous occurrence of an operand register the destination of a load instruction?
Search backward across basic block boundaries
Prune by block and edge execution frequencies
ldq t0,0(s1)
subq t0,t1,t2
addq t3,t0,t4
OR
subq t0,t1,t2
Ruling Out D-cache Misses
Profiling Tutorial 2-3010/4/98
Out-of-Order Processors
In-Order processors– Periodic interrupt lands on “current” instruction, e.g., next
instruction to issue– Peak performance = no wasted issue slots– Any stall implies loss in performance
Out-of-Order Processors– Many instructions in-flight: no “current” instruction– Some stalls masked by concurrent execution
Instructions issue around stalled instruction
Example: does this stall matter?
load r1,… add …,r1,… average latency: 15.0 cycles… other instructions …
Profiling Tutorial 2-3110/4/98
Issue: Need to Measure Concurrency
Interesting concurrency metrics– Retired instructions per cycle– Issue slots wasted while an instruction is in flight– Pipeline stage utilization
How to measure concurrency?
Special-purpose hardware– Some metrics difficult to measure
e.g. need retire/abort status
Sample potentially-concurrent instructions– Aggregate info from pairs of samples– Statistically estimate metrics
Profiling Tutorial 2-3210/4/98
Paired Sampling
Sample two instructions – May be in-flight simultaneously– Replicate ProfileMe hardware, add intra-pair distance
Nested sampling– Sample window around first profiled instruction– Randomly select second profiled instruction– Statistically estimate frequency for F(first, second)
+W
... ...
... ...
... ...
... ...
-W
time
overlap no overlap
Profiling Tutorial 2-3310/4/98
Explaining Lost Performance
An open question
Some in-order analysis applicable– E.g., D-cache miss & branch mispredict analysis
Pipe stage latencies from counters would help a lot
Profiling Tutorial 2-3410/4/98
Summary & Conclusion
Statistical profiling can be– Inexpensive– Effective
Instruction-level analysis matters
Performance counters– Implementation details make a big difference
Out-of-order processors require better counters