Upload
bernadette-newman
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Floating Point AnalysisFloating Point AnalysisUsing DyninstUsing Dyninst
Mike LamUniversity of Maryland, College Park
Jeff Hollingsworth, Advisor
BackgroundBackground• Floating point represents real numbers as (± sgnf × 2exp)
o Sign bito Exponento Significand (“mantissa” or “fraction”)
• Finite precisiono Single-precision: 24 bits (~7 decimal digits)o Double-precision: 53 bits (~16 decimal digits)
032 16 8 4
Significand (23 bits)Exponent (8 bits)
IEEE Single
2
03264 16 8 4
Significand (52 bits)Exponent (11 bits)
IEEE Double
MotivationMotivation• Finite precision causes round-off error
o Compromises certain calculationso Hard to detect and diagnose
• Increasingly important as HPC scaleso Computation on streaming processors is faster in single
precisiono Data movement in double precision is a bottleneck
o Need to balance speed (singles) and accuracy (doubles)
3
Our GoalOur Goal
Automated analysis techniques to inform developers about floating
point behavior and make recommendations regarding the
use of floating point arithmetic.
4
FrameworkFrameworkCRAFT: Configurable Runtime Analysis for
Floating-point Tuning
•Static binary instrumentationo Read configuration settingso Replace floating-point instructions with new codeo Rewrite modified binary
•Dynamic analysiso Run modified program on representative data seto Produce results and recommendations
5
Previous WorkPrevious Work
• Cancellation detectiono Reports loss of precision due to subtractiono Paper appeared in WHIST‘11
• Range trackingo Reports min/max values
• Replacemento Implements mixed-precision configurationso Paper to appear in ICS’13
6
Mixed PrecisionMixed Precision
7
1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1
6: solve Ly = Prk
7: solve Uzk = y8: xk ← xk-1 + zk
9: check for convergence10: end for
Red text indicates steps performed in double-precision (all other steps are single-precision)
Mixed-precision linear solver algorithm
• Use double precision where necessary• Use single precision everywhere else• Can be difficult to implement
ConfigurationConfiguration
8
downcast conversion
• In-place replacemento Narrowed focus: doubles singleso In-place downcast conversiono Flag in the high bits to indicate replacement
03264 16 8 4
Double
03264 16 8 4ReplacedDouble
7 F F 4 D E A D
Non-signalling NaN 032 16 8 4
Single
9
ImplementationImplementation
ExampleExample
gvec[i,j] = gvec[i,j] * lvec[3] + gvar
1 movsd 0x601e38(%rax, %rbx, 8) %xmm0
2 mulsd -0x78(%rsp) %xmm0
3 addsd -0x4f02(%rip) %xmm0
4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
10
gvec[i,j] = gvec[i,j] * lvec[3] + gvar
1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0
2 mulss -0x78(%rsp) %xmm0check/replace -0x4f02(%rip) and %xmm0
3 addss -0x20dd43(%rip) %xmm0
4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
11
ExampleExample
Block Editing Block Editing (PatchAPI)(PatchAPI)
12
double single conversion
original instruction in block
block splits
initialization cleanupcheck/replace
Automated SearchAutomated Search
• Manual mixed-precision analysiso Hard to use without intuition regarding potential
replacements
• Automatic mixed-precision analysiso Try lots of configurations (empirical auto-tuning)o Test with user-defined verification routine and data seto Exploit program control structure: replace larger structures
(modules, functions) firsto If coarse-grained replacements fail, try finer-grained
subcomponent replacements
13
System OverviewSystem Overview
14
NAS ResultsNAS Results
15
Benchmark(name.CLASS)
Candidates Configurations Tested
Instructions Replaced
% Static % Dynamic
bt.W 6,647 3,854 76.2 85.7
bt.A 6,682 3,832 75.9 81.6
cg.W 940 270 93.7 6.4
cg.A 934 229 94.7 5.3
ep.W 397 112 93.7 30.7
ep.A 397 113 93.1 23.9
ft.W 422 72 84.4 0.3
ft.A 422 73 93.6 0.2
lu.W 5,957 3,769 73.7 65.5
lu.A 5,929 2,814 80.4 69.4
mg.W 1,351 458 84.4 28.0
mg.A 1,351 456 84.1 24.4
sp.W 4,772 5,729 36.9 45.8
sp.A 4,821 5,044 51.9 43.0
AMGmk ResultsAMGmk Results
16
• Algebraic MultiGrid microkernel• Multigrid method is highly adaptive
• Good candidate for replacement
• Automatic search• Complete conversion (100% replacement)
• Manually-rewritten version• Speedup: 175 sec to 95 sec (1.8X)
• Conventional x86_64 hardware
SuperLU ResultsSuperLU Results
17
• Package for LU decomposition and linear solves• Reports final error residual
• Both single- and double-precision versions
• Verified manual conversion via automatic search• Used error from provided single-precision version as threshold
• Final config matched single-precision profile (99.9% replacement)
Threshold Instructions Replaced
% Static % Dynamic
Final Error
1.0e-03 99.1 99.9 1.59e-04
1.0e-04 94.1 87.3 4.42e-05
7.5e-05 91.3 52.5 4.40e-05
5.0e-05 87.9 45.2 3.00e-05
2.5e-05 80.3 26.6 1.69e-05
1.0e-05 75.4 1.6 7.15e-07
1.0e-06 72.6 1.6 4.7e7-07
RetrospectiveRetrospective• Twofold original motivation
o Faster computation (raw FLOPs)o Decreased storage footprint and memory bandwidth
o Domains vary in sensitivity to these parameters
• Computation-centric analysiso Less insight for memory-constrained domainso Sometimes difficult to translate instruction-level
recommendations to source code-level transformations
• Data-centric analysiso Focus on data motion, which is closer to source code-level
structures
18
Current ProjectCurrent Project• Memory-based replacement
o Perform all computation in double precisiono Save storage space by storing single-precision values in
some cases
• Implementationo Register-based computation remains double-precisiono Replace movement instructions (movsd)
o Memory to register: check and upcasto Register to memory: downcast if configured
o Searching for replaceable writes instead of computes
19
Preliminary ResultsPreliminary Results
20
Benchmark
(name.CLASS)
Candidates
Writes Replaced
% Static % Dynamic
cg.W 284 95.4 77.5
ep.W 226 96.0 28.4
ft.W 452 94.2 45.0
lu.W 1,782 68.3 81.3
mg.W 558 96.2 86.4
sp.W 1,607 80.7 84.7All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux.
21
22
23
24
25
26
27
Future WorkFuture Work
• Case studies
• Search convergence study
28
ConclusionConclusion
Automated binary instrumentation techniques can be used to implement
mixed-precision configurations for floating point code, and memory-based
replacement provides actionable results.
29
Thank you!Thank you!
sf.net/p/crafthpc
30