36
Sustainable Silicon: Energy‐Efficient VLSI Interconnect Research Patrick Chiang, Assistant Professor (and Students) Oregon State University VLSI Research Group [email protected] DOE, Aug 2011 h#p://eecs.oregonstate.edu/research/vlsi Collaborators: Department of Energy, Intel, Boeing, LSI, SRC, Air Force Research Laboratory, HP Labs, National Science Foundation

Sustainable Silicon: Energy‐Efficient VLSI Interconnect …science.energy.gov/~/media/ascr/ascac/pdf/meetings/aug11/Chaing.pdf · VLSI Interconnect ... IEEE Computer Architecture

  • Upload
    lykien

  • View
    223

  • Download
    3

Embed Size (px)

Citation preview

SustainableSilicon:Energy‐EfficientVLSIInterconnectResearch

PatrickChiang,AssistantProfessor(andStudents)OregonStateUniversityVLSIResearchGroup

[email protected]

DOE,Aug2011h#p://eecs.oregonstate.edu/research/vlsi

Collaborators: Department of Energy, Intel, Boeing, LSI, SRC, Air Force Research Laboratory, HP Labs, National Science Foundation

Whatdoesexascalemean?

IBMRoadrunner(12000PowerXCPUs6000AMDOpterons)

2018

Power 10‐20MW

Speed(petaflop=1015) 1000

Space ~2x

1000 ~1000x

? ????

IBMRoadrunner(12000PowerXCPUs6000AMDOpterons)

2009

Power 2.35MW

Speed(petaflop=1015FLOP/s)

1.04

Space 296Racks6000sq.R.

Memory 103Terabytes

Cost $125M

IBM: Roadrunner, Los Alamos (2009)

Notjustsuper‐computers

109 OPS < 1mW

SENSOR 1018 FLOPS

~1MW

EXASCALE 1012 FLOPS

~1W

EMBEDDED

Source: DARPA Exascale Study, 2008

4MajorChallenges

1)  MemoryandStorage

–  Capacity;Latency2)ConcurrencyandLocality

–  fCLK=1GHzParallelism

–  So[ware/hardware

4)EnergyandPower

–  Interconnect

Random Inputs

3)ResiliencyChallenge

–  LowVDD–  ProcessVariability

InterconnectEnergyWall

•  “TheEnergyandPowerChallengeisthemostpervasiveofthefour,andhasitsrootsintheinabilityofthegrouptoprojectanycombina_onofcurrentlymaturetechnologiesthatwilldeliversufficientlypowerfulsystemsinanyclassatthedesiredpowerlevels.“

•  “Akeyobserva_onofthestudyisthatitmaybeeasiertosolvethepowerproblemassociatedwithbasecomputa_onthanitwillbetoreducetheproblemoftranspor_ngdatafromonesitetoanother‐onthesamechip,betweencloselycoupledchipsinacommonpackage,orbetweendifferentracksonoppositesidesofalargemachineroom…”

Source: DARPA Exascale Study, Sep. 2008

•  Microprocessors:“at0.13umapproximately51%ofmicroprocessorpowerwasconsumedbyinterconnect,withaprojec_onthatwithoutchangesindesignphilosophy,inthenextfiveyearsupto80%ofmicroprocessorpowerwillbeconsumedbyinterconnect”,ITRS‐07

PowerDensity

Source: ITRS 2010

Notenoughpoweravailable

Source: Bill Dally, 2011

Needmorebandwidth

InterconnectEnergies

3.5% capacitance improvement / year

Gap widening > 100x

No solution < 1pJ/bit

Alterna_ves•  3DStacking

Source: G. Loh, 2008; Intel 2010

•  SiliconPhotonics

1.  Off-chip bandwidth: –  3D: dense through-silicon vias –  Photonics: wavelength division multiplexing

2.  Off‐chipenergy:

–  3D:makethingscloser

–  Photonics:uselow‐losslight

Electrical transceivers

Problem1:On‐ChipWires•  On‐chipwirepowerdoesnotscale

–  Dominatedbyinterconnectcapacitance(CVDD2)

ON-CHIP (Status Quo): 100 - 200fJ/bit/mm

[DOE, Exascale Workshop]

OUR GOAL: < 5fJ/bit/mm

Problem2:Off‐chipI/O

Source: Exascale Roadmap Meeting, Dec. 2009

OFF-CHIP: 1-10pJ/bit (1-10mW / Gbps)

CONVENTIONAL

OUR GOAL: < 0.1pJ/bit

Low‐PowerInterconnectsOverview

1.  On‐ChipI/O

–  Network‐on‐a‐chipwithreduced‐swinginterconnect–  Fundamentallimitstolow‐voltageswing

2.  Off‐ChipI/O–  Sub‐1mW/Gbpsoff‐chipI/O

Near‐ThresholdOpera_on(VDD~0.4V)

[S. Hanson, 2006] MIT, MICHIGAN, PURDUE, INTEL

Sync_um:near‐thresholdparallelprocessor

-  Throughput of SIMD; energy-efficiency of near-threshold operation

-  Eight parallel lanes, at near-threshold (VDD=0.5V)

-  Variation Tolerance: Razor-like detection/recovery per lane -  Lane weaving -  Decoupled instruction queues"

E. Krimer, R. Pawlowski, M. Erez, P. Chiang, "Synctium: a Near-Threshold Stream Processor for Energy-Constrained Parallel Applications", IEEE Computer Architecture Letters, 2010.

(1)Energy‐Efficient,On‐ChipLinks

•  OurGoal:low‐poweron‐chiplinks–  Analoglow‐voltageswing:

–  (3)XBARS–  (4)LINKTRAVERSAL

•  RouterPower:–  (1)Buffering:30%–  (2)Arbitra_on:10%–  (3)XBAR:30%–  (4)LINKS:30%

Intel, 80 Cores, ISSCC 2007

1

2

3

4

TokenFlowControlNoC(Li‐ShiuanPeh,MIT)

•  Conven_onalRouter:–  Eachhoprequires4cycles

•  ProposedTFCRouter:–  Firsthoprequires4cycles–  Followinghopsrequire2cycles

•  Tokensforadvancealloca_on–  Ifli#leconges_on,bufferingisskipped

•  NoCpowerdominatedbyXBARandLT–  TFCreducesbufferwrites

R R R R R R R R R R R R R R R R R R R G G R R R R R G R R G R R R R G R R G R R R R R G G R R R R R R R R R R R R R R R R R R R

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

XBAR

LINKS

Low‐Swing,Bitcell‐BasedCrossbar

0 20 40 60 80

100 120 140

0 0.25 0.5 0.75 1 1.25

Diff

. Mod

e C

ross

talk

(mV

)

Aggressor Distance from Center of Diff. Pair (um)

Shielded

Unshielded

MeasurementSummary

21 ICCD, 2010

NextGoal:1‐5fJ/mmConven_onalFullSwing

Schinkel(JSSCC‘09)

Stojanovic(ISSCC‘09)

OurTFCRouter

NewGoal

WireLength 1mm 2mm 10mm 1mm 1mm‐5mm

Supply 1.2V 1.2V ‐ 1.2V 0.5‐1.0V

TransceiverArea

21um2 TX:20um 2880um2 23um2 20‐30um2

SignalSwing 1.2V 120mV 200mV 250mV 50mV

Energy/Bit 305fJ 105fJ 356fJ 28‐60fJ 1‐5fJ/mm

•  Determine:Fundamentallimitstoenergy‐efficient,on‐chiplink

•  GOAL:5‐50ximprovementinon‐chiplinkenergy

–  Energyscalability–  Lowarea–  Robust

Energy‐ScalableOn‐ChipTxRx

Channel: 1-4mm

Digital Offset Cancellation Decision Feedback Equalization

DynamicOpera_on

On‐ChipLinks–MeasurementResults

InverterCross‐OverPoint

Off‐ChipI/OScalingTrends

•  I/Opowerefficiencyisafunc_onof:•  datarate

•  processtechnology

•  channellossequaliza_oncomplexity

•  Projectgoal:<1mW/Gbpsoverascalable5‐10GbpsdatarateSource: S. Palermo, Texas A&M

(2)Off‐ChipLinks:GlobalClockDistribu_onOp_miza_on(withIntel)

•  Clockdominatespowerinlinks

•  Clockdominatespowerinlinks

•  Isthereawaytoshareclockpower?

Clk/CDR is 54% of Total Power (10Gbs)

Clk/CDR is 71% of Total Power (15Gbs)

Intel (VLSI, 2007)

Conven_onalClockgenera_onanddistribu_on

1)   CLOCK DISTRIBUTION: 0.5W in the global distribution alone

2) DESKEW: Significant energy in multi-phase generation

Chip#1:Injec_on‐LockedReceiverArchitecture

31

ILRO:ExtensionofAdler’sEqua_onAdler’s doesn’t apply

to ring oscillators!

Chip#2:Near‐Threshold,0.15mW/Gbps,8GbpsSerialLinkReceiver

–  Sub‐harmonicInjec_on‐Locking– OperatesatVDD~0.6V

MeasuredDeskewRangeandBER

ComparisonTable

accepted, CICC-2011

Take‐AwayPoints•  Lowerenergysiliconispossible

– Aggressiveinterconnectcircuitsshow:•  Off‐Chip:5ximprovements•  On‐Chip:50ximprovements

•  Reliabilityatlow‐Vddisissue–  Explorein‐situadapta_ontoself‐healautonomously

•  MagicbulletsdoNOTexist–  Lowerenergy‐‐>lowerperformance

–  Dynamicallyadapttheen_resystem–  Requiresco‐designinterac_onbetweensoRware,architecture,andunderlyingsilicon

Outreach•  Publica_ons

–  K.Hu,T.Jiang,S.Palermo,andP.Chiang,"Low‐Power8Gb/sNear‐ThresholdSerialLinkReceiversUsingSuper‐HarmonicInjec_onLockingin65nmCMOS",Accepted,IEEECustomIntegratedCircuitsConference,2011.

–  R.Bai,J.Wang,L.Xia,F.Zhang,Z.Yang,W.Hu,P.Chiang,"SinusoidalClockSamplingforMul_‐GigahertzADCs",accepted,IEEETransac_onsonCircuitsandSystem‐I,2011.

–  LingliXia,JingguangWang,WillBeake,JacobPostman,andPatrickYinChiang,"Sub‐2ps,Sta_cPhaseErrorCalibra_onTechniqueIncorpora_ngMeasurementUncertaintyCancella_onforMul_‐GigahertzTime‐InterleavedT/HCircuits",accepted,IEEETransac_onsonCircuitsandSystem‐I,2011.

–  K.Hu,L.Wu,andP.Chiang,"ACompara_veStudyof20‐Gb/sNRZandDuobinarySignalingUsingSta_s_calAnalysis",accepted,IEEETransac_onsonVLSISystems,May2011.

–  T.Jiang,W.Liu,C.Zhong,F.Zhong,P.Chiang,"Single‐Channel,1.25‐GS/s,6‐bit,Loop‐UnrolledAsynchronousSAR‐ADCin40nm‐CMOS",IEEECustomIntegratedCircuitsConference,Sep.2010.

–  J.PostmanandP.Chiang,"Energy‐EfficientTransceiverCircuitsforShort‐RangeOn‐chipInterconnects",Accepted,IEEECustomIntegratedCircuitsConference,2011.

–  B.Goska,J.Postman,M.Erez,P.Chiang,"Hardware/so[wareco‐designforenergy‐efficientparallelcompu_ng,"accepted,DepartmentofEnergySciDACConference,July2011.

–  JosephCrop,RobertPawlowski,NarimanMoezzi‐Madani,JarrodJacksonandPatrickChiang,"DesignAutoma_onMethodologyforImprovingtheVariabilityofSynthesizedDigitalCircuitsOpera_ngintheSub/Near‐ThresholdRegime,"accepted,WorkshoponLow‐PowerSystemonChip(SoC),2ndGreenCompu_ngConference,2011.

–  T.Krishna,J.Postman,L.‐S.Peh,P.Chiang,"SWIFT:ASWing‐reducedInterconnectForaToken‐basedNetwork‐on‐Chipin90nmCMOS",Interna_onalConferenceonComputerDesign(ICCD),Amsterdam,Netherlands,October2010.

–  E.Krimer,R.Pawlowski,M.Erez,P.Chiang,"Sync_um:aNear‐ThresholdStreamProcessorforEnergy‐ConstrainedParallelApplica_ons",IEEEComputerArchitectureLemers,Jan/June2010.

•  Talks–  P.Chiang,"Hardware/so[wareco‐designforenergy‐efficientparallelcompu_ng,"accepted,DepartmentofEnergySciDACConference,July2011.–  P.Chiang,Carnegie‐Mellon,May2011.

–  P.Chiang,Princeton,May2011.–  P.Chiang,UC‐Davis,Feb2011.

–  P.Chiang,Stanford,Feb2011.–  P.Chiang,Intel,Mar2011.–  P.Chiang,UC‐SanDiego,Jan2011.

–  P.Chiang,Broadcom,Dec2010.–  P.Chiang,Illinois,Oct2010.

–  P.Chiang,Michigan,Oct2010.–  P.Chiang,USC,Oct2010.