Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Floating Point AnalysisFloating Point AnalysisUsing DyninstUsing Dyninst

Mike LamUniversity of Maryland, College Park

Jeff Hollingsworth, Advisor

BackgroundBackground• Floating point represents real numbers as (± sgnf × 2exp)

o Sign bito Exponento Significand (“mantissa” or “fraction”)

• Finite precisiono Single-precision: 24 bits (~7 decimal digits)o Double-precision: 53 bits (~16 decimal digits)

032 16 8 4

Significand (23 bits)Exponent (8 bits)

IEEE Single

2

03264 16 8 4

Significand (52 bits)Exponent (11 bits)

IEEE Double

MotivationMotivation• Finite precision causes round-off error

o Compromises certain calculationso Hard to detect and diagnose

• Increasingly important as HPC scaleso Computation on streaming processors is faster in single

precisiono Data movement in double precision is a bottleneck

o Need to balance speed (singles) and accuracy (doubles)

3

Our GoalOur Goal

Automated analysis techniques to inform developers about floating

point behavior and make recommendations regarding the

use of floating point arithmetic.

4

FrameworkFrameworkCRAFT: Configurable Runtime Analysis for

Floating-point Tuning

•Static binary instrumentationo Read configuration settingso Replace floating-point instructions with new codeo Rewrite modified binary

•Dynamic analysiso Run modified program on representative data seto Produce results and recommendations

5

Previous WorkPrevious Work

• Cancellation detectiono Reports loss of precision due to subtractiono Paper appeared in WHIST‘11

• Range trackingo Reports min/max values

• Replacemento Implements mixed-precision configurationso Paper to appear in ICS’13

6

Mixed PrecisionMixed Precision

7

1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1

6: solve Ly = Prk

7: solve Uzk = y8: xk ← xk-1 + zk

9: check for convergence10: end for

Red text indicates steps performed in double-precision (all other steps are single-precision)

Mixed-precision linear solver algorithm

• Use double precision where necessary• Use single precision everywhere else• Can be difficult to implement

ConfigurationConfiguration

8

downcast conversion

• In-place replacemento Narrowed focus: doubles singleso In-place downcast conversiono Flag in the high bits to indicate replacement

03264 16 8 4

Double

03264 16 8 4ReplacedDouble

7 F F 4 D E A D

Non-signalling NaN 032 16 8 4

Single

9

ImplementationImplementation

ExampleExample

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulsd -0x78(%rsp) %xmm0

3 addsd -0x4f02(%rip) %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

10

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0

2 mulss -0x78(%rsp) %xmm0check/replace -0x4f02(%rip) and %xmm0

3 addss -0x20dd43(%rip) %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

11

ExampleExample

Block Editing Block Editing (PatchAPI)(PatchAPI)

12

double single conversion

original instruction in block

block splits

initialization cleanupcheck/replace

Automated SearchAutomated Search

• Manual mixed-precision analysiso Hard to use without intuition regarding potential

replacements

• Automatic mixed-precision analysiso Try lots of configurations (empirical auto-tuning)o Test with user-defined verification routine and data seto Exploit program control structure: replace larger structures

(modules, functions) firsto If coarse-grained replacements fail, try finer-grained

subcomponent replacements

13

System OverviewSystem Overview

14

NAS ResultsNAS Results

15

Benchmark(name.CLASS)

Candidates Configurations Tested

Instructions Replaced

% Static % Dynamic

bt.W 6,647 3,854 76.2 85.7

bt.A 6,682 3,832 75.9 81.6

cg.W 940 270 93.7 6.4

cg.A 934 229 94.7 5.3

ep.W 397 112 93.7 30.7

ep.A 397 113 93.1 23.9

ft.W 422 72 84.4 0.3

ft.A 422 73 93.6 0.2

lu.W 5,957 3,769 73.7 65.5

lu.A 5,929 2,814 80.4 69.4

mg.W 1,351 458 84.4 28.0

mg.A 1,351 456 84.1 24.4

sp.W 4,772 5,729 36.9 45.8

sp.A 4,821 5,044 51.9 43.0

AMGmk ResultsAMGmk Results

16

• Algebraic MultiGrid microkernel• Multigrid method is highly adaptive

• Good candidate for replacement

• Automatic search• Complete conversion (100% replacement)

• Manually-rewritten version• Speedup: 175 sec to 95 sec (1.8X)

• Conventional x86_64 hardware

SuperLU ResultsSuperLU Results

17

• Package for LU decomposition and linear solves• Reports final error residual

• Both single- and double-precision versions

• Verified manual conversion via automatic search• Used error from provided single-precision version as threshold

• Final config matched single-precision profile (99.9% replacement)

Threshold Instructions Replaced

% Static % Dynamic

Final Error

1.0e-03 99.1 99.9 1.59e-04

1.0e-04 94.1 87.3 4.42e-05

7.5e-05 91.3 52.5 4.40e-05

5.0e-05 87.9 45.2 3.00e-05

2.5e-05 80.3 26.6 1.69e-05

1.0e-05 75.4 1.6 7.15e-07

1.0e-06 72.6 1.6 4.7e7-07

RetrospectiveRetrospective• Twofold original motivation

o Faster computation (raw FLOPs)o Decreased storage footprint and memory bandwidth

o Domains vary in sensitivity to these parameters

• Computation-centric analysiso Less insight for memory-constrained domainso Sometimes difficult to translate instruction-level

recommendations to source code-level transformations

• Data-centric analysiso Focus on data motion, which is closer to source code-level

structures

18

Current ProjectCurrent Project• Memory-based replacement

o Perform all computation in double precisiono Save storage space by storing single-precision values in

some cases

• Implementationo Register-based computation remains double-precisiono Replace movement instructions (movsd)

o Memory to register: check and upcasto Register to memory: downcast if configured

o Searching for replaceable writes instead of computes

19

Preliminary ResultsPreliminary Results

20

Benchmark

(name.CLASS)

Candidates

Writes Replaced

% Static % Dynamic

cg.W 284 95.4 77.5

ep.W 226 96.0 28.4

ft.W 452 94.2 45.0

lu.W 1,782 68.3 81.3

mg.W 558 96.2 86.4

sp.W 1,607 80.7 84.7All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux.

21

22

23

24

25

26

27

Future WorkFuture Work

• Case studies

• Search convergence study

28

ConclusionConclusion

Automated binary instrumentation techniques can be used to implement

mixed-precision configurations for floating point code, and memory-based

replacement provides actionable results.

29

Thank you!Thank you!

sf.net/p/crafthpc

30

Documents

Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor