28
Low-power Architecture By: Jonathan Herbst Scott Duntley

Low power Architecture

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Low-power Architecture

By: Jonathan HerbstScott Duntley

Why low power?

• Has become necessary with new-age demands:o Increasing design complexityo Demands of and for portable equipment

CommunicationMediaMobile computers

o Most embedded systems run on batteriesObjective to extend battery life as long as possible without sacrificing too much performance

o Lower running costs $$• Go green!

Low power architecture

• Memory techniqueso Associativityo Low Power refresho Drowsy cache

• Bus Techniqueso Bus inversion

• ISA• Branch prediction

• Parallel Processing vs. Superpipelining

• Clock gating/scaling

• Voltage Scaling

• Cortex A8

Memory - Associativity• Direct-mapped cache - Least power -> no block searching

• Conventional Set associativeo As block read occurs -> Both tag and data arrays reado Data written to bus -> Only used if tags matcho As associativity , power consumption

• Alternative: Phased-set associativeo Tag and data are broken in sub-arrayso Only tag array is read and comparedo Data sub-array r/w to a buffer upon cache hit, and then to

the buso Advantage: Less power consumption by avoiding

unnecessary data readso Disadvantage: Takes 2 clock cycles rather than one

Memory - Phased set associative

Phased Set Associative Cache

Memory - Associativity - BenchmarkCache Type Miss Rate Average Power Increase

from Direct-Mapped

Direct-Mapped .046 -

4-way Set Associative .035 85.6%

4-way Phased Set Associative

.035 68.5%

Cache power analysis

Power Management

Static• Power Domains

• Voltage Domains

Dynamic• Clock Scaling/Gating

• Voltage Scaling

• Wait-For-Interrupt

Memory - Drowsy cache

• Modern processors -> Growing cache sizeo Contributes a size-able fraction of a chip's power

consumptiono As transistor sizing decreases -> large amount of power

due to leakage

• Idea: Put the cold cache lines into a state-preserving low power state to prevent leakage currento Low-power state = 25% of full-power energy

• Disadvantage: Slight performance loss due to the "wake-up" time required to access drowsy cache

Drowsy cache - Benchmark

Drowsy cache benchmark

Buses - Bus inversion

Where,• Alpha = switching factor• f = clock frequency• C = capacitance• V = voltage

Want to:• Minimize switching factor

• Bus lines are normally of high capacitanceo Large amount of power consumption due to switching

Idea:• If the # of bits on an N bit line that need to switch are > N/2

o Invert entire line, and then switch necessary bits back

o Advantage: Less power consumedo Disadvantage: More hardware needed

Buses - Bus inversion

Bus Inversion

Buses - Bus inversion

Parallel Processing and Pipelining

Parallel Computations• Multiple cores• Multiple Issue pipelines• Linear power increase

Pipelining• Faster clock• Exponential power increase• Longer branch miss-predictions

Low power & ISA

• Single Issue, Multiple Data (SIMD)o Reduce number of instruction fetches/decodes -> Reduce

power• RISC vs. CISC

o ASP Embedded - CISCMore specific hardware helps reduce overhead from general hardware -> less power

o General Embedded - RISCLess specific operations neededReduced complexity helps with power consumption

o The line is blurring - less and less need for ASP processors since GPP's are rapidly becoming more powerful and low-power

Branch prediction techniques• Accurately predict branches without too much complexity

o Static branch predictionSimple, done at compile time by ISAExamination of program behaviorChoose backward branches taken, forward branches not

o Dynamic Branch PredictionMore complex, More hardwareOccurs during run-timeHigher power consumption but much more accurateBranch Target Buffer (BTB)Pattern history table (PHT)

Cortex A8 Die

Cortex A8 Architecture

Architecture Overview

• < 300 mW to 1 W Power Consumption

• 600 MHz at 1.08 V, 1 GHz at 0.9 V Configuration (up to 1.5 GHz, but suffers a significant power increase)

• 13 cycle, 2 issue superscalar pipeline

• Static scheduling scoreboard

• Integrated NEON multimedia pipeline

• Static and dynamic power management

Static Scheduling Scoreboard

• Static instruction scheduling• In-order issue, in-order retire• Dynamic voltage and clock scaling

Pending Queue:• Takes better advantage of 2-issue pipeline

Replay Queue:• Holds issue information only• Avoid long cache miss stalls

Instruction Set Architecture

• RISC Architecture

• 2-issue instructions

• Multicycle instructions

• SIMD Instructions for NEON

• Shift included instructions

• 32-bit instructions compressed to 16-bit for a 30% code reduction

Branch Prediction

• 95% accuracy• 10-bit Global History Register (GHR)• 4096 entry (256x16) Global History Buffer (GHB) with 2-bit

saturating counterso column indexed by first 8 bits of GHRo row indexed by last two bits of GHR XORed with low 4

bits of PC• 512 entry Branch Target Buffer (BTB)

o indexed by addresso stores branch address and branch type

• 1 stall cycle on branch taken• 13 cycle penalty on missprediction

MemoryL1 Cache• 32 or 64 KB• Separate instruction and data cache• 1 cycle latency• 4-way set associative• Hash Virtual Address Buffer

Data Cache• 3 entry 64-bit integer store buffer• 8 enrty 128-bit NEON store buffer

L2 Cache• Up to 1 MB• 8 cycle latency• 8-way set associative• Nonblocking NEON loads

Static Power Management

Power Domains

Static Power Management

Voltage Domains

Dynamic Power Management

• Wait-For-Interrupt Architecture

• Clock gating

• Voltage scaling

Future of ARM?

• ARM chips currently offered at $10-20 a pieceo Intel atom -> $35+

• ARM currently controls about 90% of the mobile phone processor market -> Low Price/Powero Intel still needs more R&D to be able to compete with

ARM power specs• Why not for laptops/netbooks?

o Regular Windows cannot run it (Linux/Android)Windows Mobile/CE (Embedded Compact)

o Excludes main part of consumer PC marketo Mainstream version release of windows -> Supports ARM

ARM could easily move into market• Increasing parallelism

• Increased performance-to-power ratio

Future/Theoretical : DRAM Refresh• Two ideas, but not necessarily implemented yet:

o Intelligent RefreshIdea: A cell that has been written or read to recently does not need to be refreshedMost effective power reduction during periods of great useDrawback: Large amount of overhead needed to keep track of which cells have been accessed recently

o OS Controlled RefreshIdea: Not necessary to refresh unused memory so disable itThe OS knows what memory has been usedInstead of only swapping out pages when memory is full, swap out unused memory -> No refresh

Conclusion

• Basic idea - Reduce powero Trade-off->low performance and/or more complexity

• Recent architecture and design trendso Static power becoming as important as dynamic

Dynamic

Static

o Reduce any of these, reduce overall power