High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths. Montek Singh and Steven Nowick Columbia University New York, USA {montek,nowick}@cs.columbia.edu http://www.cs.columbia.edu/~montek. - PowerPoint PPT Presentation

Text of High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths

  • High-ThroughputAsynchronous Pipelines forFine-Grain Dynamic DatapathsMontek Singh and Steven NowickColumbia UniversityNew York, USA{montek,nowick}@cs.columbia.eduhttp://www.cs.columbia.edu/~montekIntl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.

  • OutlineIntroductionBackground: Williams PS0 pipelinesNew Pipeline DesignsDual-Rail: LP3/1, LP2/2 and LP2/1Single-Rail: LPSR2/1Practical Issue: Handling slow environmentsResults and Conclusions

  • Why Dynamic Logic?Potentially:Higher speedSmaller areaLatch-free pipelines:Logic gate itself provides an implicit latchlower latencyshorter cycle timesmaller area very important in gate-level pipelining!

    Our Focus: Dynamic logic pipelines

  • How Do We Achieve High Throughput?Introduce novel pipeline protocols:specifically target dynamic logicreduce impact of handshaking delaysshorter cycle timesPipeline at very fine granularity:gate-level: each stage is a single-gate deephighest throughputs possiblelatch-free datapaths especially desirabledynamic logic is a natural match

  • Prior Work: Asynchronous PipelinesSutherland (1989), Yun/Beerel/Arceo (1996)very elegant 2-phase control expensive transition latchesDay/Woods (1995), Furber/Liu (1996)4-phase control simpler latches, but complex controllersKol/Ginosar (1997)double latches greater concurrency, but area-expensiveMolnar et al. (1997-99)Two designs: asp* and micropipeline both very fast, but:asp*: complex timing, cannot handle latch-free dynamic datapathsmicropipeline: area-expensive, cannot do logic processing at all!Williams (1991), Martin (1997)dynamic stages no explicit latches! low latencythroughput still limited

  • BackgroundIntroductionBackground: Williams PS0 pipelinesNew Pipeline DesignsDual-Rail: LP3/1, LP2/2 and LP2/1Single-Rail: LPSR2/1Practical Issue: Handling slow environmentsResults and Conclusions

  • PS0 Pipelines (Williams 1986-91)Basic Architecture:

  • PS0 Function BlockEach output is produced using a dynamic gate:

  • Dual-Rail Completion DetectorOR together two rails of each bitCombine results using C-element

  • PS0 ProtocolPRECHARGE N: when N+1 completes evaluationEVALUATE N: when N+1 completes prechargingPrecharge Evaluate: another 3 eventsComplete cycle: 6 eventsN+1 indicates doneN evaluatesN+1 evaluatesN+2 evaluatesN+2 indicates doneN+1 prechargesEvaluate Precharge: 3 eventsNN+1N+2

  • PS0 Performance

  • New Pipeline DesignsIntroductionBackground: Williams PS0 pipelinesNew Pipeline DesignsDual-Rail: LP3/1, LP2/2 and LP2/1Single-Rail: LPSR2/1Practical Issue: Handling slow environmentsResults and Conclusions

  • Overview of ApproachOur Goal: Shorter cycle time, without degrading latency

    Our Approach: Use Lookahead Protocols (LP):main idea: anticipate critical events based on richer observationTwo new protocol optimizations:Early evaluation:give stage head-start on evaluation by observing events further down the pipeline(actually, a similar idea proposed by Williams in PA0, but our designs exploit it much better)Early done:stage signals done when it is about to precharge/evaluate

  • Dual-Rail Design #1: LP3/1Uses early evaluation:each stage now has two control inputsthe new input comes from two stages aheadevaluate N as soon as N+1 starts precharging

  • LP3/1 ProtocolPRECHARGE N: when N+1 completes evaluationEVALUATE N: when N+2 completes evaluationN evaluatesN+1 evaluatesN+2 indicates doneN+2 evaluatesNN+1N+2

  • LP3/1: Comparison with PS0PS0LP3/1Only 4 events in cycle!6 events in cycleNN+1N+2NN+1N+2

  • LP3/1 Performancesaved pathSavings over PS0: 1 Precharge + 1 Completion Detection

  • Inside a Stage: Merging Two ControlsPrecharge when PC=1 (and Eval=0)Evaluate early when Eval=1 (or PC=0)A NAND gate combines the two control inputs:Problem: early Eval=1 is non-persistent!it may get de-asserted before the stage has completed evaluation!

  • LP3/1 Timing Constraints: ExampleObservation: PC=0 soon after Eval=1, and is persistent use PC as safe takeover for Eval!Solution: no change!Timing Constraint: PC=0 arrives before Eval=1 is de-assertedsimple one-sided timing requirementother constraints as well all easily satisfied in practiceProblem: early Eval=1 is non-persistent!

  • Dual-Rail Design #2: LP2/2Uses early done:completion detector now before functional blockstage indicates done when about to precharge/evaluate

  • LP2/2 Completion DetectorModified completion detectors needed:Done=1 when stage starts evaluating, and inputs validDone=0 when stage starts prechargingasymmetric C-element

  • LP2/2 ProtocolCompletion detection occurs in parallel with evaluation/precharge:N evaluatesN+1 evaluatesNN+1N+2

  • LP2/2 Performance124LP2/2 savings over PS0: 1 Evaluation + 1 Precharge

  • Dual-Rail Design #3: LP2/1Hybrid of LP3/1 and LP2/2. Combines:early evaluation of LP3/1early done of LP2/2

  • New Pipeline DesignsIntroductionBackground: Williams PS0 pipelinesNew Pipeline DesignsDual-Rail: LP3/1, LP2/2 and LP2/1Single-Rail: LPSR2/1Practical Issue: Handling slow environmentsResults and Conclusions

  • Single-Rail Design: LPSR2/1Derivative of LP2/1, adapted to single-rail:bundled-data: matched delays instead of completion detectors

  • Inside an LPSR2/1 Stage

  • LPSR2/1 ProtocolN evaluatesNN+1N+2

  • Practical Issue: Handling Slow EnvironmentsWe inherit a timing assumption from Williams PS0:Input (left) environment must precharge reasonably fastProblem:If environment is stuck in precharge,all pipelines (incl. PS0) will malfunction!Our Solution:Add a special robust controller for 1st stagesimply synchronizes input environment and pipelinedelay critical events until environment has finished prechargeModular solution overcomes shortcoming of Williams PS0No serious throughput overheadreal bottleneck is the slow environment!

  • Results and ConclusionsIntroductionBackground: Williams PS0 pipelinesNew Pipeline DesignsDual-Rail: LP3/1, LP2/2 and LP2/1Single-Rail: LPSR2/1Practical Issue: Handling slow environmentsResults and Conclusions

  • ResultsDesigned/simulated FIFOs for each pipeline style Experimental Setup:design: 4-bit wide, 10-stage FIFOtechnology: 0.6 HP CMOSoperating conditions: 3.3 V and 300K

  • Comparison with Williams PS0LP2/1: >2X faster than Williams PS0LPSR2/1: 1.2 Giga items/secdual-railsingle-rail

    Throughput

    Design

    Mega items/sec

    Improvement (%)

    PS0

    420

    -

    LP3/1

    590

    40%

    LP2/2

    760

    79%

    LP2/1

    860

    102%

    LPSR2/1

    1208

    188%

  • Comparison: LPSR2/1 vs. Molnar FIFOsLPSR2/1 FIFO: 1.2 Giga items/secAdding logic processing to FIFO:simply fold logic into dynamic gate little overheadComparison with Molnar FIFOs:asp* FIFO: 1.1 Giga items/secmore complex timing assumptions not easily formalizedrequires explicit latches, separate from logic!adding logic processing between stages significant overheadmicropipeline: 1.7 Giga items/sectwo parallel FIFOs, each only 0.85 Giga/secvery expensive transition latchescannot add logic processing to FIFO!

  • Practicality of Gate-Level PipeliningWhen datapath is wide:Can often split into narrow streamscomp. det. fairly low cost!Use localized completion detector for each stream:need to examine only a few bits small fan-insend done to only a few gates small fan-out

  • ConclusionsIntroduced several new dynamic pipelines:Use two novel protocols:early evaluationearly doneEspecially suitable for fine-grain (gate-level) pipeliningVery high throughputs obtained:dual-rail: >2X improvement over Williams PS0single-rail: 1.2 Giga items/second in 0.6 CMOSUse easy-to-satisfy, one-sided timing constraintsRobustly handle arbitrary-speed environmentsovercome a major shortcoming of Williams PS0 pipelinesRecent Improvement: Even faster single-rail pipeline (WVLSI00)