Upload
kiri
View
158
Download
0
Tags:
Embed Size (px)
DESCRIPTION
HW/SW Synthesis. Outline. Synthesis CFSM Optimization Software synthesis Problem Task synthesis Performance analysis Task scheduling Compilation. POLIS Methodology. Graphical EFSM+ Esterel. Java. EC. Compilers. SW Synthesis. HW Synthesis. CFSMs. Partitioning. SW Estimation. - PowerPoint PPT Presentation
Citation preview
HW/SW SynthesisHW/SW Synthesis
2
OutlineOutline
SynthesisSynthesis
CFSM OptimizationCFSM Optimization
Software synthesisSoftware synthesis ProblemProblem
Task synthesisTask synthesis
Performance analysisPerformance analysis
Task schedulingTask scheduling
CompilationCompilation
3
Aptix Board Consists of
– micro of choice
– FPGA’s– FPIC’s
Aptix Board Consists of
– micro of choice
– FPGA’s– FPIC’s
POLIS Methodology
Graphical EFSM+
Esterel
Graphical EFSM+
EsterelJava ................
CFSMs
Partitioning
SW Synthesis
SW Code + RTOS
Logic Netlist
HW Synthesis
SW Estimation HW Estimation
Physical Prototyping
HW/SW Co-SimulationPerformance/trade-off
EvaluationFormal Verification
Compilers
ECEC
4
Hardware - Software ArchitectureHardware - Software Architecture
Hardware:Hardware: Currently:Currently:
Programmable processors (micro-controllers, DSPs)Programmable processors (micro-controllers, DSPs) ASICs (FPGAs)ASICs (FPGAs)
Software:Software: Set of concurrent Set of concurrent taskstasks
Customized Real-Time Operating SystemCustomized Real-Time Operating System
Interfaces:Interfaces: Hardware modulesHardware modules
Software procedures (polling, interrupt handlers, ...)Software procedures (polling, interrupt handlers, ...)
5
System PartitioningSystem Partitioning
CFSM1CFSM1
CFSM7CFSM7
CFSM6CFSM6
CFSM5CFSM5
CFSM4CFSM4
CFSM3CFSM3
CFSM2CFSM2
e2e2
e8e8
e3e3
e2e2
e1e1
e9e9
e3e3
e5e5
e7e7
e9e9
port5port5
port1port1
port2port2
port3port3
HW partition 1HW partition 1
HW partition2HW partition2
SW partition 3SW partition 3
SchedulerScheduler
port6port6
port7port7
6
Software SynthesisSoftware Synthesis
Two-level processTwo-level process ““Technology” (processor) independent:Technology” (processor) independent:
best decision/assignment sequence given CFSMbest decision/assignment sequence given CFSM
““Technology” (processor) dependent:Technology” (processor) dependent:
conversion into machine codeconversion into machine code instruction selection
instruction scheduling
register assignment
(currently left to compiler)
need need performance and cost analysisperformance and cost analysis Worst Case Execution Time
code and data size
7
Software SynthesisSoftware Synthesis
Technology-independent phase:Technology-independent phase: Construction of Control-Data Flow Graph from CFSMConstruction of Control-Data Flow Graph from CFSM
(based on BDD representation of Transition Function)(based on BDD representation of Transition Function)
Optimization of CDFG forOptimization of CDFG for execution speedexecution speed code sizecode size(based on BDD sifting algorithm)(based on BDD sifting algorithm)
Technology-dependent phase:Technology-dependent phase: Creation of (restricted) C codeCreation of (restricted) C code Cost and performance analysisCost and performance analysis CompilationCompilation
8
Software Implementation ProblemSoftware Implementation Problem
Input: Input: Set of tasks (specified by CFSMs)Set of tasks (specified by CFSMs) Set of timing constraints (e.g., input event rates and response constraints)Set of timing constraints (e.g., input event rates and response constraints)
Output:Output: Set of procedures that implement the tasks Set of procedures that implement the tasks Scheduler that satisfies the timing constraintsScheduler that satisfies the timing constraints
Minimizing:Minimizing: CPU cost CPU cost Memory sizeMemory size Power, etc.Power, etc.
9
Software ImplementationSoftware Implementation
How to do it ? How to do it ?
Traditional approach:Traditional approach: Hand-coding of proceduresHand-coding of procedures
Hand-estimation of timing input to scheduling algorithmsHand-estimation of timing input to scheduling algorithms
Long and error-proneLong and error-prone
Our approach: three-step Our approach: three-step automated automated procedure:procedure: Synthesize each task separatelySynthesize each task separately
Extract (estimated) timingExtract (estimated) timing
Schedule the tasksSchedule the tasks
Customized RT-OS (scheduler + drivers)Customized RT-OS (scheduler + drivers)
10
Software ImplementationSoftware Implementation
Current strategy: Current strategy: Iterate between synthesis, estimation and schedulingIterate between synthesis, estimation and scheduling
Designer chooses the scheduling algorithm Designer chooses the scheduling algorithm
Future work: Future work: Top-down propagation of timing constraintsTop-down propagation of timing constraints
Software synthesis under constraintsSoftware synthesis under constraints
Automated scheduling selection Automated scheduling selection (based on CPU utilization estimates)(based on CPU utilization estimates)
11
Software Synthesis ProcedureSoftware Synthesis Procedure
Specification, partitioning
S-graph synthesis
Timing estimation
Scheduling, validationnot
feasible feasible
Code generation
Compilation
Testing, validation
Production
pass
fail
12
Task Implementation Task Implementation
Goal: quick response time, within timing and size constraintsGoal: quick response time, within timing and size constraints
Problem statement:Problem statement: Given a CFSM transition function and constraintsGiven a CFSM transition function and constraints Find a procedure implementing the Find a procedure implementing the transition functiontransition function while meeting the while meeting the
constraintsconstraints
The procedure code is acyclic:The procedure code is acyclic: Powerful optimization and analysis techniquesPowerful optimization and analysis techniques Looping, state storage etc. are implemented outside Looping, state storage etc. are implemented outside
(in the OS)(in the OS)
13
SW Modeling IssuesSW Modeling Issues
The software model should be:The software model should be: Low-level enough to allow detailed optimization and estimationLow-level enough to allow detailed optimization and estimation High-level enough to avoid excessive detailsHigh-level enough to avoid excessive details
e.g. register allocation, instruction selectione.g. register allocation, instruction selection
Main types of “user-mode” instructions:Main types of “user-mode” instructions: Data movementData movement ALUALU Conditional/unconditional branchesConditional/unconditional branches Subroutine callsSubroutine calls
RTOS handles I/O, interrupts and so onRTOS handles I/O, interrupts and so on
14
SW Modeling IssuesSW Modeling Issues
Focus on control-dominated applicationsFocus on control-dominated applications Address only CFSM control structure optimizationAddress only CFSM control structure optimization
Data path left as “don’t touch”Data path left as “don’t touch”
Use Use Decision Diagrams Decision Diagrams (Bryant ‘86)(Bryant ‘86) Appropriate for control-dominated tasksAppropriate for control-dominated tasks
Well-developed set of optimization techniquesWell-developed set of optimization techniques
Augmented with arithmetic and Boolean operators, to perform data Augmented with arithmetic and Boolean operators, to perform data computationscomputations
15
ROBDDsROBDDs
• Reduced Ordered BDDs [Bryant 86] • A node represents a function given by the
Shannon decompositionf = x f x + x f x
• Variable appears once on any path from root to terminal
• Variables are ordered• No two vertices represent the same
function • Canonical
• Two functions are equal if and only if their BDDs are isomorphic Þ direct application in equivalence checking
f = xf = x11 + x + x22 x x33
xx11
xx22
xx33
11 00
ROBDD
xx11ff
xx11ff
16
ROBDDs and Combinational VerificationROBDDs and Combinational Verification
Given two circuits:Given two circuits:Build the ROBDDs of the outputs in terms of the primary inputsBuild the ROBDDs of the outputs in terms of the primary inputs
Two circuits are equivalent if and only if the ROBDDs are isomorphicTwo circuits are equivalent if and only if the ROBDDs are isomorphic
Complexity of verification depends on the Complexity of verification depends on the sizesize of ROBDDsof ROBDDsCompact in many casesCompact in many cases
17
ROBDDs and Memory ExplosionROBDDs and Memory Explosion
ROBDDs are not always compactROBDDs are not always compactSize of an ROBDD can be Size of an ROBDD can be exponentialexponential in number of variables in number of variables
Can happen for real life circuits alsoCan happen for real life circuits also e.g. Multiplierse.g. Multipliers
Commonly known as: Memory Explosion Problem of ROBDDs
18
Technique for Handling ROBDD Memory Technique for Handling ROBDD Memory ExplosionExplosion
ROBDDsROBDDsEnhancementsEnhancementsVariable Variable
OrderingOrdering
Free BDDsFree BDDsOFDDs, OFDDs, OKFDDsOKFDDs
PartitionedPartitionedROBDDsROBDDs
RelaxRelaxOrderingOrdering
NodeNode DecompDecomp.. PartitioningPartitioning
All the representations are canonical combinational equivalence checking
19
bb33
Handling Memory Explosion: Variable OrderingHandling Memory Explosion: Variable Ordering
BDD size very sensitive to variable orderingBDD size very sensitive to variable ordering
aa11bb11 + a + a22bb22 + a + a33bb33
aa11
bb11aa22
bb22
aa33
bb33
1100
Good Ordering: 8 nodesGood Ordering: 8 nodes1100
Bad Ordering: 16 nodesBad Ordering: 16 nodes
aa11
aa
22
aa22
aa33 aa33 aa33aa33
bb11bb11 bb11 bb11
bb22bb22
20
aa11
bb11
aa22
bb22
1100
Handling Memory Explosion: Variable OrderingHandling Memory Explosion: Variable Ordering
Good static as well as dynamic ordering techniques existGood static as well as dynamic ordering techniques exist Dynamic variable reorderingDynamic variable reordering [Rudell 93] [Rudell 93]
Change variable order automatically during computationsChange variable order automatically during computations Repeatedly swap a variable with adjacent variableRepeatedly swap a variable with adjacent variable Swapping can be done locallySwapping can be done locally Select the best locationSelect the best location
aa11bb11 + a + a22bb22
aa11
aa22
bb22
1100
aa22
bb11 bb11
21
SW Model: S-graphsSW Model: S-graphs
Acyclic extended decision diagram computing a transition functionAcyclic extended decision diagram computing a transition function
S-graph structure:S-graph structure: Directed acyclic graphDirected acyclic graph
Set of finite-valued variablesSet of finite-valued variables
TEST nodes evaluate an expression and branch accordinglyTEST nodes evaluate an expression and branch accordingly
ASSIGN nodes evaluate an expression and assign its result to a variableASSIGN nodes evaluate an expression and assign its result to a variable
Basic block + branch is a general CDFG modelBasic block + branch is a general CDFG model(but we constrain it to be (but we constrain it to be acyclicacyclic for optimization) for optimization)
22
An Example of S-graphAn Example of S-graph
a := a a := a + 1+ 1
a := 0a := 0
detect(c)detect(c)a<a<
bb
BEGINBEGIN
ENDEND
FF
TT
TT FF
– input event c– output event y– state int a– input int b– forever
if (detect(c))
if (a < b)
a := a + 1
emit(y)
else
a := 0
emit(y)
emit(y)emit(y)
23
S-graphs and FunctionsS-graphs and Functions
Execution of an s-graph computes a function from a set of input and Execution of an s-graph computes a function from a set of input and
state variables to a set of output and state variables:state variables to a set of output and state variables: Output variables are initially undefinedOutput variables are initially undefined Traverse the s-graph from BEGIN to ENDTraverse the s-graph from BEGIN to END
Well-formed s-graph: Well-formed s-graph: Every time a function depending on a variable is evaluated, that variable has a Every time a function depending on a variable is evaluated, that variable has a
defined valuedefined value
How do we derive an s-graph implementing a given function ?How do we derive an s-graph implementing a given function ?
24
S-graphs and FunctionsS-graphs and Functions
Problem statement:Problem statement: Given: a finite-valued multi-output function over a set of finite-valued variablesGiven: a finite-valued multi-output function over a set of finite-valued variables Find: an s-graph implementing itFind: an s-graph implementing it
Procedure based on Shannon expansionProcedure based on Shannon expansionf = x ff = x fxx + x’ f + x’ fx’x’
Result heavily depends on ordering of variables in expansionResult heavily depends on ordering of variables in expansion Inputs before outputs: TESTs dominate over ASSIGNsInputs before outputs: TESTs dominate over ASSIGNs Outputs before inputs: ASSIGNs dominate over TESTsOutputs before inputs: ASSIGNs dominate over TESTs
25
Example of S-graph ConstructionExample of S-graph Construction
x = a b + c
y = a b + daa
bbcc
dd
x := 1x := 1
y := 1y := 1
00 11
00 11
11
11
dd
00
x := 1x := 1
y := 0y := 0
x := 0x := 0
y := 1y := 1
x := 0x := 0
y := 0y := 0
0000 11
Order: a, b, c, d, x, y Order: a, b, c, d, x, y (inputs before (inputs before outputs)outputs)
26
Example of S-graph ConstructionExample of S-graph Construction
x = a b + cx = a b + c
y = a b + dy = a b + daa
bb
x := 1x := 1
y := 1y := 1
00 11
00 11
x := cx := c
y := dy := d
Order: a, b, x, y, c, d (interleaving inputs and outputs)
27
S-graph OptimizationS-graph Optimization
General trade-off: General trade-off: TEST-based is faster than ASSIGN-based (each variable is visited at most once)TEST-based is faster than ASSIGN-based (each variable is visited at most once)
ASSIGN-based is smaller than TEST-based (there is more potential for sharing)ASSIGN-based is smaller than TEST-based (there is more potential for sharing)
Implemented as Implemented as constrained siftingconstrained sifting of the Transition Function BDD of the Transition Function BDD
The procedure can be iterated over s-graph fragments:The procedure can be iterated over s-graph fragments: Local optimization, depending on fragment criticality (speed versus size)Local optimization, depending on fragment criticality (speed versus size)
Constraint-driven optimization (still to be explored)Constraint-driven optimization (still to be explored)
28
From S-graphs to InstructionsFrom S-graphs to Instructions
TEST nodes TEST nodes conditional branches conditional branches
ASSIGN nodes ASSIGN nodes ALU ops and data moves ALU ops and data moves
No loops in a No loops in a singlesingle CFSM transition CFSM transition (User loops handled at the RTOS level)(User loops handled at the RTOS level)
Data flow handling:Data flow handling: ““Don’t touch” them (except common sub-expression extraction)Don’t touch” them (except common sub-expression extraction) Map expression DAGs to C expressionsMap expression DAGs to C expressions C compiler allocates registers and select op-codesC compiler allocates registers and select op-codes
Need source-level debugging environment (with any of the Need source-level debugging environment (with any of the
chosen entry languages)chosen entry languages)
29
Software Synthesis ProcedureSoftware Synthesis Procedure
Specification, partitioningSpecification, partitioning
S-graph synthesisS-graph synthesis
Timing estimation
Scheduling, validationScheduling, validationnot not
feasiblefeasible feasiblefeasible
Code generation
Compilation
Testing, validation
Production
passpass
failfail
30
POLIS : S-graph Level EstimationPOLIS : S-graph Level Estimation
SW synthesis
CFSM
Sw code
S-graph synthesis and optimization
S-graph
Code generation
Timing / code size information
Estimation
31
Problems in Software Performance EstimationProblems in Software Performance Estimation
How to link behavior to assembly code?-> Model C code generated from S-graph and use a set of cost parameters
How to handle the variety of compilers and How to handle the variety of compilers and CPUs?CPUs?
32
Software ModelSoftware Model
func(E) event E; { static int st; Initialization of local variables; Structure of mixed if or switch statements and assign statements ; return; }
generated C codeT = Tpp + k Tinit + Tstruct S = Spp + k Sinit + Sstruct
Time T and Size S
Tpp, Spp
Tinit, Sinit
Tstruct, Sstruct
Time Size
33
Execution Time of a Path and the Code SizeExecution Time of a Path and the Code Size
PropertyProperty : Form of each statement is determined by type of : Form of each statement is determined by type of corresponding node.corresponding node.
TT struct struct = ƒ°pi Ct(= ƒ°pi Ct( pi: pi: takes value 1 if node i is on a path, otherwise 0.takes value 1 if node i is on a path, otherwise 0. Ct(n,v): Ct(n,v): execution time for node type n execution time for node type n and variable type v.and variable type v.
SS structstruct = ƒ°Cs(= ƒ°Cs( node_type_of node_type_of (i), (i), variable_type_of variable_type_of (i)) (i)) Cs(n,v): Cs(n,v): code size for node type n code size for node type n and variable type v.and variable type v.
path on S-graph
node_type_of node_type_of (i), (i), variable_type_of variable_type_of (i)) (i))
34
Cost ParametersCost Parameters
* Pre-calculated cost parameters for:* Pre-calculated cost parameters for:
(1) Ct(n,v), Cs(n,v): (1) Ct(n,v), Cs(n,v): Execution time and code size for node type n Execution time and code size for node type n and variable type v.and variable type v.
(2) T(2) Tpppp, S, Spppp: : Pre- and post- execution time and code size.Pre- and post- execution time and code size.
(3) T(3) Tinitinit, S, Sinitinit:: Execution time and code size for local variableExecution time and code size for local variable initialization.initialization.
35
Problems in Software Performance EstimationProblems in Software Performance Estimation
How to link behavior to assembly code?How to link behavior to assembly code?
How to handle the variety of compilers and CPUs?-> prepare cost parameters for each target
36
Extraction of Cost ParametersExtraction of Cost Parameters
set of benchmark programs
target C compiler
static analyzer execution & profilingor
parameter extractor
cost parameters
37
AlgorithmAlgorithm
Preprocess: extracting set of cost parameters. Weighting nodes and edges in given S-graph
with cost parameters. Traversing weighted S-graph. Finding maximum cost path and minimum cost
path using Depth-First Search on S-graph.
Accumulating 'size' costs on all nodes.
38
S-graph Level Estimation :AlgorithmS-graph Level Estimation :Algorithm
Cost C is a triple (min_time, max_time, code_size)Cost C is a triple (min_time, max_time, code_size)
Algorithm: Algorithm: SGtrace SGtrace (sg(sgii)) if (sgif (sgii == NULL) return (C(0, ,0)); == NULL) return (C(0, ,0)); if (sgif (sgii has been visited) has been visited) return ( pre-calculated Ci(*,*,0) associated with sgreturn ( pre-calculated Ci(*,*,0) associated with sgii ); ); CCii = initialize (max_time = 0, min_time = , code_size = 0); = initialize (max_time = 0, min_time = , code_size = 0); for each child sgfor each child sgjj of sg of sgii { { CCijij = = SGtrace SGtrace (sg(sgjj) + edge cost for edge e) + edge cost for edge eijij;; CCii.max_time = max(C.max_time = max(Cii.max_time, C.max_time, Cijij.max_time);.max_time); CCii.min_time = min(C.min_time = min(Cii.min_time, C.min_time, Cijij.min_time);.min_time); CCii.code_size += C.code_size += Cijij.code_size;.code_size; }} CCii += node cost for node sg += node cost for node sgii;; return (Creturn (Cii););
39
ExperimentsExperiments
* Proposed methods implemented and examined * Proposed methods implemented and examined in POLIS system.in POLIS system.
* Target CPU and compiler:* Target CPU and compiler: M68HC11 and Introl C compiler.M68HC11 and Introl C compiler.
* Difference D is defined as* Difference D is defined as
D = costestimated costmeasured-costmeasured
40
Experimental Results : S-graph LevelExperimental Results : S-graph Level
model estimated measured % differencemin 158 141 12.06
FRC max 469 496 -5.44size 654 690 5.22min 223 191 16.75
TIMER max 938 912 2.85size 1,573 1,436 9.54min 145 131 10.69
ODOMETER max 361 363 -0.55size 454 457 -0.66min 314 335 -6.27
SPEEDOMETER max 880 969 -9.18size 764 838 -8.83min 119 111 7.21
BELT max 322 323 -0.31size 511 520 -1.73min 197 171 15.20
FUEL max 533 586 -9.04size 637 647 -1.55min 262 221 18.55
CROSSDISP max 16,289 16,979 -4.06size 32,592 38,618 -15.60
41
Performance and Cost Estimation: SummaryPerformance and Cost Estimation: Summary
S-graph: low-level enough to allow accurate performance estimationS-graph: low-level enough to allow accurate performance estimation
Cost parameters assigned to each node, depending on:Cost parameters assigned to each node, depending on: System type (CPU, memory, bus, ...)System type (CPU, memory, bus, ...)
Node and expression typeNode and expression type
Cost parameters evaluated via simple benchmarksCost parameters evaluated via simple benchmarks Need timing and size measurements for each target systemNeed timing and size measurements for each target system
Currently implemented for MIPS, 68332 and 68HC11 processorsCurrently implemented for MIPS, 68332 and 68HC11 processors
42
Performance and Cost EstimationPerformance and Cost Estimation
4040
26264141 6363
14
18 9
Example: 68HC11 timing Example: 68HC11 timing
estimationestimation
Cost assigned to s-graph edgesCost assigned to s-graph edges
(Different for taken/not taken (Different for taken/not taken branches)branches)
Estimated time:Estimated time: Min: 26 cyclesMin: 26 cycles
Max: 126 cyclesMax: 126 cycles
Accuracy: within 20% of Accuracy: within 20% of
profilingprofiling
a := a a := a + 1+ 1
a := 0a := 0
detect(c)a<a<
bb
BEGINBEGIN
ENDEND
FF
TT
TT FF
emit(y)emit(y)
43
Open ProblemsOpen Problems
Better synthesis techniquesBetter synthesis techniques Add state variables to simplify s-graphAdd state variables to simplify s-graph
Performance-driven synthesis of critical pathsPerformance-driven synthesis of critical paths
Exact memory/speed trade-offExact memory/speed trade-off
Estimation of caching and pipelining effectsEstimation of caching and pipelining effects May have little impact on control-dominated systems May have little impact on control-dominated systems
(frequent branches and context switches)(frequent branches and context switches)
Relatively easy during co-simulationRelatively easy during co-simulation