20
Xiaochen Guo , Engin Ipek, and Tolga Soyata Rochester Computer Systems Architecture Laboratory RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, STT-MRAM BASED COMPUTING

RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Embed Size (px)

Citation preview

Page 1: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Xiaochen Guo, Engin Ipek, and Tolga Soyata

Rochester Computer Systems Architecture Laboratory

RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, STT-MRAM BASED COMPUTING

Page 2: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Multicore Scaling Limited by Power

6/21/12

2

  Traditional MOSFET scaling theory relies on reducing VDD in proportion to device dimensions

  VDD has scaled very slowly since 90nm   Multicore scaling severely challenged by power

P = Pdynamic + Pstatic = N (Ceff VDD2 f + Ileak VDD) Pdynamic = N (Ceff VDD2 f )

Ileak∝ e-Vth

2x

2x 1.4x

1.4x 1.4x

Page 3: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Our Approach: Resistive Computation

6/21/12

3

  Opportunity: spin-torque transfer magnetoresistive RAM (STT-MRAM)  Near-zero leakage power   Low-energy read operation

  Goal: selectively migrate on-chip storage and combinational logic to STT-MRAM to reduce power  On-chip storage

  Caches, TLBs, RF, queues

 Combinational logic   Lookup-table (LUT) based computing

Page 4: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

STT-MRAM

6/21/12

4

  Desirable properties  CMOS compatibility  Read speed as fast as SRAM  Density comparable to DRAM  Unlimited write endurance

  Key challenge: expensive writes   Long switching latency (6.7ns @ 32nm)  High switching energy (0.3pJ/bit @ 32nm)

+" -"Vwrite"

Value = 1"

-" +"Vwrite"

Value = 0"

-" +"Vread"

MTJ"

Access transistor"

Page 5: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Switching Time vs. Cell Size

  Faster switching with wider access transistors + Faster writes - Slower reads - Lower density - Higher read energy

6/21/12

5

RF, L1D$

L2$, L1I$, LUTs, TLBs, MC Queues

Page 6: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

RAM Arrays and Lookup Tables

Fundamental Building Blocks

Page 7: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

  Problem: low write throughput

  Existing solutions incur high overheads to sustain adequate write throughput in STT-MRAM arrays

STT-MRAM Arrays

6/21/12

7

Multiporting Banking

Page 8: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

STT-MRAM Arrays

  CMOS subbank buffers   Latch in addr/data and

release H-tree; complete write locally

 Allow forwarding from ongoing writes

  Facilitate local differential writes

  Reads access subbank via exclusive read port

6/21/12

8

Page 9: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

STT-MRAM LUTs [Suzuki09, Matsunaga08]

  Store truth tables of logic functions directly in STT-MRAM

  Benefits   Leakage confined to

peripheral circuitry   Low-power (low-swing)

lookups   Fast lookups using sense amp

  Logic functions with many minterms can utilize LUTs effectively

6/21/12

9

Page 10: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Case Study: 3-bit Adder

6/21/12

10

Page 11: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Pipeline Organization

Page 12: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Hybrid CMT Pipeline

6/21/12

12

Small arrays and simple logic in CMOS

Large arrays and complex logic in STT-MRAM

Page 13: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Front End

6/21/12

13

LUT-based carry-select adder to compute PC+4

LUT-based front-end thread selection logic

SRAM-based refill queue to avoid I$ conflicts

Predecode and back-end thread selection with MRAM-related stall conditions

Page 14: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Register File

6/21/12

14

Architectural registers of all threads aggregated in a unified STT-MRAM array to amortize subbank buffers

Registers of a single thread striped across subbanks to reduce subbank buffer conflicts

Page 15: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Floating-Point Unit

6/21/12

15

STT-MRAM FPU

CMOS FPU

Add, Sub, Mult

24 cycles 12 cycles

Div 64 cycles 64 cycles

Page 16: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Memory System

6/21/12

16

Use store buffers to avoid L1 D$ subbank conflicts

L1s optimized for fast writes using 30F2 cells

L2 and memory controllers optimized for density using 10F2

cells

Page 17: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Evaluation

Page 18: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Performance

6/21/12

18

Page 19: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Power

6/21/12

19

Page 20: RESISTIVE COMPUTATION: AVOIDING THE POWER WALL …isca2010.inria.fr/media/slides/isca_guo.pdf · RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, ... spin-torque transfer

Contributions and Findings

6/21/12

20

  New technique to reduce leakage and dynamic power in a deep-submicron microprocessor  Selectively migrate on-chip storage and combinational

logic from CMOS to STT-MRAM  Use subbank buffers to alleviate long write latency

  STT-MRAM is an attractive low-power solution beyond 32nm  Dramatically lower leakage power  Modest loss in performance