ESLT The next generation of Design Automation Tools

  • Published on

  • View

  • Download


ESLT The next generation of Design Automation Tools. Agenda. Goal of ESL tools History Motivation for ESLT in these times The USU-ESLT On-going research Conclusions. Goal of ESL tools. To automate the generation of SoC solutions from HLPL (such as C/C++/Java..) - PowerPoint PPT Presentation


<ul><li><p>Goal of ESL toolsHistoryMotivation for ESLT in these timesThe USU-ESLTOn-going researchConclusions</p></li><li><p>To automate the generation of SoC solutions from HLPL (such as C/C++/Java..)To reduce design time of digital circuits from months to weeksInitial VHDL generation should complete in minutesFunctional verification/testing may take a few weeks</p></li><li><p>ToolsCones (1988)HardwareC (Stanford)Transmogrifier CSystem C C2Verilog (1998)Handel CBach CSpecCTrident C (LANL)SPARK (UCI)CASH (CMU)Mitrion CImpulse C (2004)Catapult C (2006, MG)ChallengesC is a sequential programming languageWhat does a pointer or dynamic memory allocation mean in hardware?RecursionFloating-point arithmeticHow is I/O represented?How are hardware design parameters introduced?SolutionsSupport only a subset of CUser-specified parallelismUser-specified I/OExtensive use of macros to guide circuit generation</p></li><li><p>2 Primary TrendsPanic in the Microprocessors IndustryNext generation chips from Intel, AMD, Apple are all multi-core with integrated heterogeneous componentsHennessey/Patterson guideline not good enough anymoreRenewed rigor into computer architecture researchSystems on a Chip are way too complicated to explore architecture options at RTLEmergence of FPGAs as a viable computing entityIndustry accepted platforms for architecture prototyping and researchExtremely complicated to explore VLSI architecture options at RTL</p></li><li><p>Restrict the ESL tool to a small set of algorithms that need acceleration beyond what microprocessors can provideTake advantage of user expertise in describing a template for the architectureLet the tool explore low level architecture optimizationTake advantage of gcc optimizationsAbility to integrate 3rd party IP cores</p><p>Our Approach: Its a workbench</p></li><li><p>void anneal(int *current) { float temperature; int current_val, next_val; int next[MAX_EVENTS]; current_val = RAND_MAX; while (temperature &gt; STOP_THRESHOLD) { copy(current, next); alter(next); next_val = evaluate(next); accept(&amp;current_val, next_val, current, next, temperature); temperature = adjustTemperature(); } };; Function anneal (anneal)anneal (current) { int next[10]; int next_val; int current_val; float temperature; double D.3292; float D.3291; int D.3290; # BLOCK 0 # PRED: ENTRY (fallthru) temperature = 1.0e+4; current_val = 2147483647; goto (); # SUCC: 2 (fallthru) # BLOCK 1 # PRED: 2 (true) :; copy (current, &amp;next); alter (&amp;next); D.3290 = evaluate (&amp;next); next_val = D.3290; accept (&amp;current_val, next_val, current, &amp;next, temperature); D.3291 = adjustTemperature (); temperature = D.3291; # SUCC: 2 (fallthru) # BLOCK 2 # PRED: 0 (fallthru) 1 (fallthru) :; D.3292 = (double) temperature; if (D.3292 &gt; 1.00000000000000004792173602385929598312941379845e-4) goto ; else goto ; # SUCC: 1 (true) 3 (false) # BLOCK 3 # PRED: 2 (false) :; return; # SUCC: EXIT}Problem: Given a circuit specification consisting of a set of components (adders / multipliers / etc.), estimate the FPGA resources (slices / BRAMs / DSP48s) used</p><p>Solution: Create a fifth-order equation for each (component, resource type) pair, representing usage as a function of data width</p><p>Done using discrete values and Matlab curve-matching feature</p><p>Fifth-order equation necessary for adequate estimationy = C5n5 + C4n4 + C3n3 + C2n2 + C1n + C0</p></li><li><p>List SchedulingAlso known as Critical Path SchedulingAssign a static priority to each node in the graphSchedule the nodes according to priorityStatic priorities are assigned by measuring the distance from the node in question and a sink nodeGiven a set of resources, determines time needed to complete a set of operations represented as a dependency graph</p></li><li><p>List Scheduling ExampleSchedule DFG on one multiplier, one adder, and one dividerMultiplication and division take two cycles each (non-pipelined), addition takes one</p><p>TimeStep+*/0a1b2ce13de24f15f26g17g2</p><p>TimeStep+*/0a1c2b3de14e25f16f27g18g2</p></li><li><p>List SchedulingHeuristic method does not guarantee an optimal scheduleComputational complexity of only O(Tn), where T is the number of time slots and n is the number of nodes to be scheduled Improvement MethodsModified Critical PathEarliest Time FirstDynamic Critical PathCritical Node Parent TreesCone-Based ClusteringPartial Critical Path schedulingAll O(n2) to O(n3) too complicated for use inside of a simulated annealing loop</p></li><li><p>Solution Ripple-List Schedulingassign static priority to each node in graphinitialize time to 0Loop while unscheduled nodes existLoop until no nodes can be scheduled on time stepupdate list of ready nodesschedule highest priority node possibleadjust priority of remaining nodesEndLoopincrement timeEndLoop</p></li><li><p>Ripple Factor (Rf)The degree of a vertex is the number of edges (both incoming and outgoing in the case of a directed graph) incident to it DG = The largest vertex degree in the entire graphd = distance between two vertices</p></li><li>Ripple FactorDG = 3The priorities of nodes that are one step away get updated by a ripple factor of 1/31, those that are two steps away get updated by 1/32, etc.Priorities are adjusted dynamically, but never jump to another priority bandMaximum ripple distance is applied to cut off updates and save computation (</li><li><p>Balancing Latency across Pipeline Stages through ILP extractionGoal: Maximize pipelined architecture performance within specified resource constraintsA pipeline can only run as fast as the latency of the slowest stageAn efficient pipeline will balance the latency of each stage as much as possibleSome stages can be redesigned to support additional parallelism, others are fixed</p></li><li><p>Algorithm for Pipelined Processor DSEGenerate minimal set of ALUs needed for each stage in the pipelineCompute latency of all stages (generate the architecture)LoopMark stage with worst latencyReduce the latency of this stage through exploitation of parallelism untilWorst latency can be passed to another stageIf 1 is not possible, reduce latency as much as possibleIntertwined SA and RLS algorithms or Data-port width extension where applicableEnd Loop when worst latency cannot be passed to another stage</p></li><li><p>ExampleGenerate minimal architecture for all stagesCopy:101 cycles300 slicesAlter:21 cycles390 slicesEvaluate:233 cycles317 slicesAccept:54 cycles1408 slicesMark stage with worst latency</p></li><li><p>ExampleReduce the latency of this stage through exploitation of parallelism until worst latency can be passed to another stageEvaluate:233 cycles317 slices</p><p>Evaluate:95 cycles777 slices</p><p>Allocation of additional resources</p></li><li><p>ExampleNew numbers for all stagesCopy:101 cycles300 slicesAlter:21 cycles390 slicesEvaluate:95 cycles777 slicesAccept:54 cycles1408 slicesMark stage with worst latency</p></li><li><p>ExampleReduce the latency of this stage through exploitation of parallelism until worst latency can be passed to another stageCopy:101 cycles300 slices</p><p>Copy:51 cycles600 slices</p><p>Widening of memory ports toallow for 2-word transfers</p></li><li><p>ExampleNew numbers for all stagesCopy:51 cycles600 slicesAlter:21 cycles390 slicesEvaluate:95 cycles777 slicesAccept:54 cycles1408 slicesRepeat process until FPGA resources are exhausted or no more parallelism can be extracted from worst-performing stage</p></li><li><p>DSE SummaryStage performances can be improved throughAllocating additional computational resources to a stage such as adders, multipliers, etc.Widening memory ports to accelerate block data transfersSome stages cannot be improvedIf the task does not have any ILP</p></li><li><p>Performance Xilinx V4-SX35</p></li><li><p>The FLEX (flexible processor) can perform either DFG 1 or DFG 2 computationsDesigned by taking the union of DFG 1 and DFG 2 data flow graphsThe FLEX processor can switch modes dynamically, depending on computational needsBranch probabilities from gcov can guide the FLEX design DFGs executed more frequently should be more optimizedConsiderably superior to Partial Dynamic Reconfiguration using Xilinx EAPR0.60.4</p></li><li><p>function main called 4 returned 100% blocks executed 100% -: 5:{ -: 5-block 0call 0 returned 100% -: 5-block 1branch 1 taken 86%(fallthrough)branch 2 taken 14% -: 5-block 2 -: 5-block 3 -: 5-block 4branch 3 taken 86%(fallthrough)branch 4 taken 14% -: 5-block 5 -: 5-block 6branch 5 taken 75%(fallthrough)branch 6 taken 25% -: 5-block 7 -: 5-block 8branch 7 taken 86%(fallthrough)branch 8 taken 14% -: 5-block 9 -: 5-block 10 -: 5-block 11On-going Technology Enhancement (1): FLEX Processor: Code Profiling using gcov</p></li><li><p>VHDL code can be compared with architecture descriptionThird-party Design Automation software used for synthesis, placement, debugging, verification, etc.ChipScope Pro (Xilinx)Timing closureImproved metadataStringent constraint imposition on DSE</p></li><li><p>**It is because of DG that priority bands are never jumped.***</p></li></ul>


View more >