View
216
Download
0
Embed Size (px)
Citation preview
Specialization
• if know input values / characteristics before a computation– fold computation into compiler
– reduce runtime workload
• on RC:– specialization --> reduce area --> more parallel resources
• specialize common case(s), detect+trap uncommon cases
• when and how to specialize?
Binding Times
• binding times:– compile time
– execution / invocation time
– run-time, lexical (on block entry/exit)
– run-time, dynamic (arbitrary points, epochal changes)
• how does specializer know binding times?– compiler analysis
– annotation-based (user-written / profile-driven)
– dynamically-discovered
Exploitation models
• algorithmic transformation (partial evaluation)• value-range / data-width specialization• template / place-holder
Specialization in different programming models
• logic-level (HDL, schematic)– bit-level constant folding + propagation
• HLL– lexical info (block + procedure binding times)
– word-level
• generators– application-specific (algorithmic)
Programming RC with HLL
• HLL vs. RC:– code block <==> LUTs (replicated?)
– variables/temps <==> registers (distributed?)
– function call <==> inlined HW / context switch
• easy on programmer– familiar
– word-model hides bit-level details
• hard on compiler
What to specialize in HLL?
• unchanging / slowly-changing variables– partially-evaluate ALU ops
– prune unused (uncommon) control-flow paths
• narrower data-paths– e.g. DSP specifies 32-bit ints for 16-bit op + overflow
• replicate code, apply different specializations– choose based on call-chain, parameters, data values
Why is RC on HLL difficult?
• HLL cannot express HW naturally– inherently sequential
• hard to expose parallelism, hard to partition
– irregular control flow (recursion, multiple exit paths)– timothyc - garpcc/gama:
<http://www.cs.berkeley.edu/Research/Projects/brass/mapper.html>
• specialization– word-model too coarse (8,16,32,64 bits)– pointers complicate static analysis– no runtime constants in some languages
Profiling-Guided Specialization
• Dynamically identify constant & slowly-changing values (bits)
• how to use:– annotate source, user can verify
– no runtime guarantees on characteristics discovered
• specialize common case, detect+trap uncommon case– e.g. check overflow on narrow-width op
– e.g. compare specialized constant input to real input(on each cycle)
Constancy models:Storage-based
• identify constant or slowly-changing bits of program variables + heap– bit-read == input to computation
• binding-time associated with HLL block scopes• HW benefit not immediately clear
(unless expand all temp vars)• pointer aliasing makes it difficult, less useful• difficult to detect epochal changes
Constancy models:Instruction based
• identify constant or slowly-changing inputs to instructions
• HW benefit easy to quantify• binding-time not obvious• instruction space too large --> filter instructions of
interest
Bit Constancy Lattice
• binding time for bits of variables (storage-based)
CBD SCBD
CSSI SCSSI
CESI SCESI
CASI SCASI
CAPI
const
…… Constant between definitions…… + signed…… Constant in some scope invocations…… + signed…… Constant in each scope invocation…… + signed…… Constant across scope invocations…… + signed…… Constant across program invocations
…… declared const
Experiments
• Applications:– UCLA MediaBench:
adpcm, epic, g721, gsm, jpeg, mesa, mpeg2(not shown today - ghostscript, pegwit, pgp, rasta)
– gzip, versatility, SPECint95 (parts)
• Compiler optimize --> instrument for profiling --> run• analyze variable usage, ignore heap
– heap-reads typically 0-10% of all bit-reads
– 90-10 rule (variables) - ~90% of bit reads in 1-20% or bits
Bit ClassificationBit Classification - Variables
(MediaBench, averaged per program)
SCASI 28%
CASI 38%
CBD 16%
CSSI 1%
SCBD 4%
SCSSI 0.5%
CESI 6%
SCESI 2%
const4%
constSCASI CASI SCESI CESI SCSSI CSSI SCBD CBD
Bit Classification
• does not reveal area savings– variables not all in HW
• huge variation– SCASI, CASI, CBD stddev ~25%
Bit-Reads ClassificationBit-Read Classification - Variables
(MediaBench, averaged per program)
SCASI 40%
CASI 11%SCESI
2%
CESI 5%
SCSSI 7%
CSSI 13%
SCBD 7%
CBD 15%
const0.3%
constSCASI CASI SCESI CESI SCSSI CSSI SCBD CBD
Bit-Reads Classification
• regular across programs– SCASI, CASI, CBD stddev ~11%
• nearly no activity in variables declared const• ~65% in constant + signed bits
– trivially exploited
Constant Bit-Ranges
• data paths are too wide• 55% of all bit-reads are to sign-bits• most CASI reads clustered in bit-ranges (10% of 11%)• CASI+SCASI reads (50%) are positioned:
– 2% low-order 8% whole-word constant39% high-order 1% elsewhere
Conclusion - Storage Profiling
• lots of constant bits in variables• not clear how to exploit
– need to see intermediate operations too• (i) expand all temps as variables
• (ii) instruction-based profiling
• does not distinguish radically different uses of same code– could replicate procs for each call path, then profile
• does not distinguish epochs
Multiplier Study
• common, important op for multimedia• area savings amplified: A=O(N^2) (input width N)• instruction-based profiling, restricted to 1 instruction• Questions:
– which multipliers to specialize?
– how?
– how often?
Multiplier Area Model
• NxM multiplier array– partial products, carry-save reduction, carry-propagate final
– A = App + Acs + Arc
• both inputs variable:– generate PPs 2x2 --> 4
• 1 input constant (N):– fully specialized: <= N/2 adds
– template-based: generate PPs 4x1 --> 1
• 2 inputs constant:– fold (zero cost)
Bit Constancy model
• associate source-code multiplier <==> HW multiplier• examine inputs on each mult. invocation to identify
constant input bits• probabilistic constants
– specialize around slowly-changing bits as if constant, re–specialize as necessary
• input model: bit-ranges [S|V], [C|V|C]– e.g. [C1|V1] * [C2|V2] = C1C2 + (C1V2 + C2V1) + V1V2
– “slowly-changing” bit range has high: (uses/change)
Savings / Cost of Specialization
• which multipliers should be specialized?how to decompose into bit ranges?– determines how often to specialize, i.e. specialization cost
• assume constant (parameterized) cost per specialization– specialization = generate context + load/switch context
• minimize total multiplier cost– (specialized instance cost) * uses + (specialization cost),
summed over all multiplier instances
Evaluating Benefit of Specialization
• multiplier “goodness” = (saved area) * uses / specializations– i.e. savings per specialization, scaled by #uses
– goodness depends on bit-range decomposition
• specialize multipliers in order of decreasing goodness• generate curves:
– total cost vs. #specialized multiplier
– total cost vs. #specializations(has minimum parameterized by specialization cost)
– minimized total cost vs. specialization cost