Computer Architecture - SNE/OS3 Homepage [OS3 .Computer Architecture VLIW ... Shekhar Borkar (2007)

  • View
    218

  • Download
    1

Embed Size (px)

Text of Computer Architecture - SNE/OS3 Homepage [OS3 .Computer Architecture VLIW ... Shekhar Borkar (2007)

  • Computer ArchitectureVLIW - multi-cores - hardware multi-threading

    1 lecture-12 - 24 september 2015

  • Summary of explicit concurrency (1)

    Power dissipation is one of the major issues in obtaining performance

    particularly true with constraints on embedded systems

    Concurrency can be used to reduce power and get performance

    reduce frequency which in turn allows a voltage reduction

    does not apply to pipelined concurrency where more concurrency increases power due to shared structures

    2 lecture-12 - 24 september 2015

  • Summary of explicit concurrency (2)

    VLIW is the simplest form of explicit concurrency it reduces hardware complexity, hence power requirements

    drawbacks are code compatibility and intolerance to cache misses

    EPIC attempts to resolve these problems in VLIW

    this adds complexity and requires large caches to get good performance

    Multi-cores are scalable but need new programming models

    the big question is - can we design general purpose multi-cores?

    the Concurrent Systems course explores this research direction

    3 lecture-12 - 24 september 2015

  • OoO not scalable what else?It is clear that out-of-order issue is not scalable

    register file and issue logic scale as ILP3 and ILP2 respectively

    A number of approaches have been followed to increase the utilisation of on-chip concurrency

    These include:

    VLIW processors

    Speculative VLIW processors - Intels IA64 - EPIC

    multi-core and multi-threaded processors

    4 lecture-12 - 24 september 2015

  • The power wall

    5 lecture-12 - 24 september 2015

  • MotivationMulti-cores typically exploit asynchronous concurrency

    This exposes a more complex machine model to software, so why go this route?

    Because of the power wall - cannot continue to increase execution frequency but still need to increase throughput

    Peter Hofstee (designer of Cell) recently said in a keynote:there are three orders of magnitude of power efficiency to be gained from conventional processors - Steps are:

    few complex cores to multi-cores, then

    multi-cores to programmable logic

    6 lecture-12 - 24 september 2015

  • Technology scaling - theoryThis is not simple as it involves playing with many parameters such as voltage, frequency, capacitance etc.

    A technology scaling step aims to:

    reduce gate delay by 30% increasing frequency by 43%

    double transistor density (line width reduced by 1.4)

    reduce transition energy by 65% saving 50% power (at 43% increase in frequency) - reducing Vdd

    7 lecture-12 - 24 september 2015

  • Power dissipationPower can be described by the following equations

    P = CVt

    2

    F + tVddIshortF + VddIleak

    1st term dominates (3rd will soon overtake it)

    C capacitance being charged by gates Vt voltage swing in charging F average frequency of state change in all gates

    2nd term is power dissipated when the power and ground rails are shorted for a brief period during switching

    Ishort is the short circuit current, t the transition time 3rd term is power dissipated in leakage across gates

    Ileak is the leakage current, Vdd the input voltage

    8 lecture-12 - 24 september 2015

  • Scaling predictionShekhar Borkar (2007) Thousand Core ChipsA Technology Perspective, DAC 2007, San Diego, California, USA. DAC 2007, June 48, 2007, San Diego, California, USA.

    9 lecture-12 - 24 september 2015

  • Example HPC processors

    Source: http://en.wikipedia.org/wiki/List_of_CPU_power_dissipation

    Complex cores

    Simple cores

    Complex again

    10 lecture-12 - 24 september 2015

  • Power-related constraintsProblems caused by power dissipation (heat):

    thermal runaway - the leakage currents of transistors increase with temperature

    electro-migration diffusion metal migration with current increases with temperature (fuse effect)

    electrical parameters shift - CMOS gates switch faster when cold (cf to over-clock get some liquid nitrogen!)

    silicon interconnections fatigue (expand and contract with temp. swings)

    package related failure

    Power dissipation constraints (e.g. ~100W per package) provide an upper limit on power usage, and thus component frequency and input voltage

    11 lecture-12 - 24 september 2015

  • Cooling costs

    From: S.H. Gunther, Managing the impact of increasing microprocessor power consumption, Intel Technology Journal Q1, 2001

    12 lecture-12 - 24 september 2015

  • Impact on performance

    High demand for portable devices, e.g. mobile phones, etc. - 95% of CMOS silicon production!

    Extensive use of multimedia features requires substantial performance (throughput) in these devices

    However power usage is constrained by battery life !

    The energy from a battery will not grow significantly in the near future due to technological and safety reasons

    The main product feature is: hours of use and hours of standby

    There is a need for techniques to improve energy efficiency without penalising performance

    13 lecture-12 - 24 september 2015

  • Low-power strategiesOS level - partitioning, power down while idle

    Software level - regularity, locality, concurrency

    Architecture level : pipelining, redundancy, data encoding

    ISA level - architectural design using explicit concurrency, conservative rather than speculative execution, memory hierarchy

    Circuit/logic level - logic styles, transistor sizing, energy recovery

    Choice of logic family, conditional clocking, adiabatic circuits, asynchronous design

    Technology level - Threshold reduction, multi-threshold devices

    S. Younis T. Knight, Asymptotically zero energy computing using split- level charge recovery logic, Technical Report AITR-1500, MIT AI Laboratory June 1994.

    Michael P. Frank, Physical Limits of Computing. Lecture #24 Adiabatic CMOS, Spring 2002.

    14 lecture-12 - 24 september 2015

  • Reducing power - system levelThe main technique for minimising power dissipated for a given chip architecture and technology implementation is to adjust the frequency and/or voltage supply

    Reducing voltage increases gate delay and reduces operating frequency

    Fmax = K(Vdd-Vt)2

    Most microprocessor systems reduce frequency when processor is idle

    Many signal processors reduce voltage and frequency depending on system load

    ARM are investigating algorithm-controlled dynamic voltage scaling - e.g. MPEG encoding based on image flow complexity

    The more complex the image flow (artifacts moving in the image) the more computation is required hence the algorithm can predict the performance - frequency - voltage required

    15 lecture-12 - 24 september 2015

  • Reducing power - architecture level

    Avoid speculation and issue instructions conservatively

    any misspredicted speculation requires power dissipation for no tangible results

    Organise and power up memory by banks independently

    sequential access to memory

    Avoid unecessary operations

    e.g. reading registers where data is bypassed

    Reduce number of swings on data busses

    e.g. use grey codes to exploit locality

    only send bits that change - form of data compression

    16 lecture-12 - 24 september 2015

  • Reducing power - circuit levelClock gating

    avoid transitions in logic where no activity is taking place can also save on clock distribution to those areas

    Double clocking with half rate clocks

    use both edges of clock pulse to reduce power dissipation in clock distribution - does not reduce logic transitions

    40% of power dissipation in 21164 was in clock distribution (at 300MHz)

    Asynchronous logic design

    only dissipate power on gate transitions that are required

    17 lecture-12 - 24 september 2015

  • Multi-core processors and hw multithreading

    Explicit concurrency in hardware, explicit concurrency in low-level software

    18 lecture-12 - 24 september 2015

  • Pollacks ruleThis is an empirical trend observed by Pollack

    Sequential performance is roughly proportional to the square root of the cores complexity

    i.e. linear with line width not as the square

    double the logic to get 140% increase in performance

    Multi-core however can deliver close to linear increase for loosely coupled work loads if overheads of concurrency can be minimised

    19 lecture-12 - 24 september 2015

  • Less power for same performance

    Use frequency-voltage scaling for dynamic power, ie reduce f and reduce V

    Dynamic power proportional to f and V2, single core throughput proportional to 1/f

    2 processors running at f/2 can give the same performance as one running at f for scalable computation

    The ratio of dynamic power dissipated is

    1 core at f proportional to: fV12, 2 cores at f/2 proportional to: 2(f/2)V2

    2 = fV2

    2

    power saving of (V2/V1)2

    Hence spread work across cores through concurrency, reduce power dissipated by a square of the voltage difference - also multiple cores help spread dissipation across the chip

    20 lecture-12 - 24 september 2015

  • Multi-cores main playersSun (now Oracle) was the forerunner in this field with its Niagara chips

    Intel have moved to multi-core without significantly changing their architecture

    i7 is a quad core with 14 stage speculative pipeline, Poulson IA-64 with 8 cores

    80-core Tera-Scale research chip - each core has 256KB local memory and each cores