25
Profiling to Discover Specialization Opportunities CS 294-6 11/3/98 Eylon Caspi

Profiling to Discover Specialization Opportunities CS 294-6 11/3/98 Eylon Caspi

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Profiling to Discover Specialization Opportunities

CS 294-6

11/3/98

Eylon Caspi

Specialization

• if know input values / characteristics before a computation– fold computation into compiler

– reduce runtime workload

• on RC:– specialization --> reduce area --> more parallel resources

• specialize common case(s), detect+trap uncommon cases

• when and how to specialize?

Binding Times

• binding times:– compile time

– execution / invocation time

– run-time, lexical (on block entry/exit)

– run-time, dynamic (arbitrary points, epochal changes)

• how does specializer know binding times?– compiler analysis

– annotation-based (user-written / profile-driven)

– dynamically-discovered

Exploitation models

• algorithmic transformation (partial evaluation)• value-range / data-width specialization• template / place-holder

Specialization in different programming models

• logic-level (HDL, schematic)– bit-level constant folding + propagation

• HLL– lexical info (block + procedure binding times)

– word-level

• generators– application-specific (algorithmic)

Programming RC with HLL

• HLL vs. RC:– code block <==> LUTs (replicated?)

– variables/temps <==> registers (distributed?)

– function call <==> inlined HW / context switch

• easy on programmer– familiar

– word-model hides bit-level details

• hard on compiler

What to specialize in HLL?

• unchanging / slowly-changing variables– partially-evaluate ALU ops

– prune unused (uncommon) control-flow paths

• narrower data-paths– e.g. DSP specifies 32-bit ints for 16-bit op + overflow

• replicate code, apply different specializations– choose based on call-chain, parameters, data values

Why is RC on HLL difficult?

• HLL cannot express HW naturally– inherently sequential

• hard to expose parallelism, hard to partition

– irregular control flow (recursion, multiple exit paths)– timothyc - garpcc/gama:

<http://www.cs.berkeley.edu/Research/Projects/brass/mapper.html>

• specialization– word-model too coarse (8,16,32,64 bits)– pointers complicate static analysis– no runtime constants in some languages

Profiling-Guided Specialization

• Dynamically identify constant & slowly-changing values (bits)

• how to use:– annotate source, user can verify

– no runtime guarantees on characteristics discovered

• specialize common case, detect+trap uncommon case– e.g. check overflow on narrow-width op

– e.g. compare specialized constant input to real input(on each cycle)

Constancy models:Storage-based

• identify constant or slowly-changing bits of program variables + heap– bit-read == input to computation

• binding-time associated with HLL block scopes• HW benefit not immediately clear

(unless expand all temp vars)• pointer aliasing makes it difficult, less useful• difficult to detect epochal changes

Constancy models:Instruction based

• identify constant or slowly-changing inputs to instructions

• HW benefit easy to quantify• binding-time not obvious• instruction space too large --> filter instructions of

interest

Bit Constancy Lattice

• binding time for bits of variables (storage-based)

CBD SCBD

CSSI SCSSI

CESI SCESI

CASI SCASI

CAPI

const

…… Constant between definitions…… + signed…… Constant in some scope invocations…… + signed…… Constant in each scope invocation…… + signed…… Constant across scope invocations…… + signed…… Constant across program invocations

…… declared const

Experiments

• Applications:– UCLA MediaBench:

adpcm, epic, g721, gsm, jpeg, mesa, mpeg2(not shown today - ghostscript, pegwit, pgp, rasta)

– gzip, versatility, SPECint95 (parts)

• Compiler optimize --> instrument for profiling --> run• analyze variable usage, ignore heap

– heap-reads typically 0-10% of all bit-reads

– 90-10 rule (variables) - ~90% of bit reads in 1-20% or bits

Bit ClassificationBit Classification - Variables

(MediaBench, averaged per program)

SCASI 28%

CASI 38%

CBD 16%

CSSI 1%

SCBD 4%

SCSSI 0.5%

CESI 6%

SCESI 2%

const4%

constSCASI CASI SCESI CESI SCSSI CSSI SCBD CBD

Bit Classification

• does not reveal area savings– variables not all in HW

• huge variation– SCASI, CASI, CBD stddev ~25%

Bit-Reads ClassificationBit-Read Classification - Variables

(MediaBench, averaged per program)

SCASI 40%

CASI 11%SCESI

2%

CESI 5%

SCSSI 7%

CSSI 13%

SCBD 7%

CBD 15%

const0.3%

constSCASI CASI SCESI CESI SCSSI CSSI SCBD CBD

Bit-Reads Classification

• regular across programs– SCASI, CASI, CBD stddev ~11%

• nearly no activity in variables declared const• ~65% in constant + signed bits

– trivially exploited

Constant Bit-Ranges

• data paths are too wide• 55% of all bit-reads are to sign-bits• most CASI reads clustered in bit-ranges (10% of 11%)• CASI+SCASI reads (50%) are positioned:

– 2% low-order 8% whole-word constant39% high-order 1% elsewhere

Conclusion - Storage Profiling

• lots of constant bits in variables• not clear how to exploit

– need to see intermediate operations too• (i) expand all temps as variables

• (ii) instruction-based profiling

• does not distinguish radically different uses of same code– could replicate procs for each call path, then profile

• does not distinguish epochs

Multiplier Study

• common, important op for multimedia• area savings amplified: A=O(N^2) (input width N)• instruction-based profiling, restricted to 1 instruction• Questions:

– which multipliers to specialize?

– how?

– how often?

Multiplier Area Model

• NxM multiplier array– partial products, carry-save reduction, carry-propagate final

– A = App + Acs + Arc

• both inputs variable:– generate PPs 2x2 --> 4

• 1 input constant (N):– fully specialized: <= N/2 adds

– template-based: generate PPs 4x1 --> 1

• 2 inputs constant:– fold (zero cost)

Bit Constancy model

• associate source-code multiplier <==> HW multiplier• examine inputs on each mult. invocation to identify

constant input bits• probabilistic constants

– specialize around slowly-changing bits as if constant, re–specialize as necessary

• input model: bit-ranges [S|V], [C|V|C]– e.g. [C1|V1] * [C2|V2] = C1C2 + (C1V2 + C2V1) + V1V2

– “slowly-changing” bit range has high: (uses/change)

Savings / Cost of Specialization

• which multipliers should be specialized?how to decompose into bit ranges?– determines how often to specialize, i.e. specialization cost

• assume constant (parameterized) cost per specialization– specialization = generate context + load/switch context

• minimize total multiplier cost– (specialized instance cost) * uses + (specialization cost),

summed over all multiplier instances

Evaluating Benefit of Specialization

• multiplier “goodness” = (saved area) * uses / specializations– i.e. savings per specialization, scaled by #uses

– goodness depends on bit-range decomposition

• specialize multipliers in order of decreasing goodness• generate curves:

– total cost vs. #specialized multiplier

– total cost vs. #specializations(has minimum parameterized by specialization cost)

– minimized total cost vs. specialization cost

Flame Session