59
Chip Chip Architectures: Architectures: Design Rationale Design Rationale By Joe Peric By Joe Peric

Chip Architectures: Design Rationale By Joe Peric

Embed Size (px)

Citation preview

Chip Architectures: Chip Architectures: Design RationaleDesign Rationale

By Joe PericBy Joe Peric

RationaleRationale Performance: We don’t want slow Performance: We don’t want slow

computerscomputers Cost: They have to be affordableCost: They have to be affordable Money: How much cash to design and Money: How much cash to design and

test?test? Time: How much time do we have?Time: How much time do we have? Target Market: Who will want it?Target Market: Who will want it? Competition: What do we do to beat the Competition: What do we do to beat the

competition, and what are they doing?competition, and what are they doing?

Problems in Delivering Problems in Delivering SolutionsSolutions

Clock frequencyClock frequency Core efficiencyCore efficiency Processor complexityProcessor complexity Power and heatPower and heat Old mindsets in new timesOld mindsets in new times Communicating with the rest of the Communicating with the rest of the

computercomputer How do we build parallel computers?How do we build parallel computers?

Clock frequencyClock frequency

Increasing clock frequency will Increasing clock frequency will always decrease latency and always decrease latency and increase throughput.increase throughput.

It is the simplest way to increase It is the simplest way to increase processor performanceprocessor performance

Essentially the “heartbeat” of a Essentially the “heartbeat” of a processorprocessor

How it is viewed by different partiesHow it is viewed by different parties

Core EfficiencyCore Efficiency

More core efficiency = less More core efficiency = less wastefulnesswastefulness

Increasing core efficiency leads to Increasing core efficiency leads to more complex coresmore complex cores

The increasing complexity could The increasing complexity could potentially make the cores less potentially make the cores less efficientefficient

““Useful” execution per unit time is Useful” execution per unit time is decent measure of efficiencydecent measure of efficiency

ComplexityComplexity

CPUs were originally CISC, and CPUs were originally CISC, and were complicatedwere complicated

Hardware was expensiveHardware was expensive RISC was born to make hardware RISC was born to make hardware

cheap by decreasing complexitycheap by decreasing complexity Today’s situation: CISC/RISC hybrid Today’s situation: CISC/RISC hybrid

with astronomical complexitywith astronomical complexity

Old CPUOld CPU

And now…

ComplexityComplexity

Complexity makes it harder to design Complexity makes it harder to design and build, and test new chips.and build, and test new chips.

Steeper learning curvesSteeper learning curves Larger CPUsLarger CPUs Slower, more expensive, longer time Slower, more expensive, longer time

to reach marketto reach market Things have always been moving this Things have always been moving this

way but may change in the near way but may change in the near futurefuture

CISCCISC

Complex instruction set computingComplex instruction set computing CPU can handle large range of instructionsCPU can handle large range of instructions New instructions are added over time to New instructions are added over time to

new chip designs when compilers demand new chip designs when compilers demand themthem

Chips got expensiveChips got expensive Many instructions weren’t used often, but Many instructions weren’t used often, but

implemented in the chips.implemented in the chips. Chips got larger, and more complexChips got larger, and more complex

RISCRISC

Reduced instruction set computingReduced instruction set computing CISC costs increased with complexityCISC costs increased with complexity RISC was to be cheap hardware RISC was to be cheap hardware

alternativealternative Idea took Amdahl’s law to heartIdea took Amdahl’s law to heart Increased compiler costsIncreased compiler costs RISC computers started to become RISC computers started to become

more CISC like over timemore CISC like over time

VLIWVLIW Very long instruction wordVery long instruction word Designed to remedy the problem of Designed to remedy the problem of

RISC mutating into CISCRISC mutating into CISC Puts extreme emphasis on compiler Puts extreme emphasis on compiler

development and scheduling of codedevelopment and scheduling of code VLIW executes “words” made up of VLIW executes “words” made up of

multiple RISC-like instructionsmultiple RISC-like instructions

VLIWVLIW

VLIW didn’t catch on muchVLIW didn’t catch on much Software development too long and Software development too long and

slowslow Instruction level parallelism is Instruction level parallelism is

already achieved through current already achieved through current hardwarehardware

Making ILP done through software Making ILP done through software just make software more expensivejust make software more expensive

VLIW ExamplesVLIW Examples

Intel’s ItaniumIntel’s Itanium New instruction set (IA-64)New instruction set (IA-64) Relied heavily on compiler Relied heavily on compiler

optimization for an ISA that hadn’t optimization for an ISA that hadn’t existed beforeexisted before

Huge investment and startup costsHuge investment and startup costs Itanium chips were/are expensive and Itanium chips were/are expensive and

provide little benefit over standard, provide little benefit over standard, well established solutionswell established solutions

VLIW ExamplesVLIW Examples

VLIW ExamplesVLIW Examples

Transmeta CrusoeTransmeta Crusoe Had own, brand new native ISAHad own, brand new native ISA Had code-morphing software that Had code-morphing software that

read machine code and transformed read machine code and transformed it into the crusoe ISA code in real-it into the crusoe ISA code in real-timetime

Code-morphing made this machine Code-morphing made this machine ISA independent, but very slowISA independent, but very slow

VLIW ExamplesVLIW Examples

Power and HeatPower and Heat

Not an issue ten years ago compared to Not an issue ten years ago compared to todaytoday

CPUs draw much more power than everCPUs draw much more power than ever They produce more heat than everThey produce more heat than ever How to give CPU powerHow to give CPU power How to cool CPUHow to cool CPU Power to cool CPUPower to cool CPU Power to cool coolingPower to cool cooling

Power and HeatPower and Heat Thermal density Thermal density http://bwrc.eecs.berkeley.edu/classes/icdesign/ee241_s05/http://bwrc.eecs.berkeley.edu/classes/icdesign/ee241_s05/

Lectures/Lecture20_Thermal.pdfLectures/Lecture20_Thermal.pdf

Power and HeatPower and Heat Cost to wire more pins to deliver Cost to wire more pins to deliver

more powermore power Most pins dedicated to thisMost pins dedicated to this

Power and HeatPower and Heat

What is the problem with heat?What is the problem with heat? How to remove itHow to remove it How to prevent it to begin with?How to prevent it to begin with? Clock frequency, manufacturing Clock frequency, manufacturing

processesprocesses Other “tricks” to decrease power Other “tricks” to decrease power

usage and heat outputusage and heat output

Engineer MentalityEngineer Mentality

Loving an ideaLoving an idea Market is the plaintiff, judge, jury and Market is the plaintiff, judge, jury and

executioner. Many fail to realize thisexecutioner. Many fail to realize this Brand new ideas are ok but tend to failBrand new ideas are ok but tend to fail Old ideas were ok but are outdated Old ideas were ok but are outdated

and inapplicableand inapplicable Slightly new but not radical ideas are Slightly new but not radical ideas are

bestbest

Engineer MentalityEngineer Mentality What has the market led us to?What has the market led us to? x86 architecturex86 architecture 26 years of backwards compatibility in a 26 years of backwards compatibility in a

slow, inefficient monster that stole ideas slow, inefficient monster that stole ideas from most other design philosophiesfrom most other design philosophies

Started in original IBM PC; now it’s Started in original IBM PC; now it’s everywhereeverywhere

Radical new designs keep losing to this Radical new designs keep losing to this beastbeast

Engineer mentality must disappointingly be Engineer mentality must disappointingly be limited by the real world and the marketlimited by the real world and the market

Solutions that worked in Solutions that worked in the pastthe past

Increase clock frequencyIncrease clock frequency Increase widthIncrease width Make memory faster, more Make memory faster, more

integrated, more availableintegrated, more available OoO executionOoO execution PredictionPrediction Deeper, wider execution pipelinesDeeper, wider execution pipelines

Problems with old Problems with old solutionssolutions

Every year in the past few years have Every year in the past few years have seen diminishing returns from seen diminishing returns from microarchitectual enhancementsmicroarchitectual enhancements

ILP and clock increase engineering ILP and clock increase engineering complexity and provide less and less complexity and provide less and less benefit each yearbenefit each year

Heat and powerHeat and power CostCost Time to marketTime to market

The rest of the computerThe rest of the computer

Processor is useless if it is outside Processor is useless if it is outside the socketthe socket

How does it communicate with other How does it communicate with other CPUs, memory, the rest of the CPUs, memory, the rest of the computer?computer?

Many different approaches by Many different approaches by different companies as to how it is different companies as to how it is donedone

Xeon’s busXeon’s bus

Essentially same design as the Essentially same design as the original Pentium’s busoriginal Pentium’s bus

Has increased in frequency and bus Has increased in frequency and bus widthwidth

A few electrical improvementsA few electrical improvements Huge standardized technology; Intel Huge standardized technology; Intel

is huge so it cannot change its base is huge so it cannot change its base technologies so fasttechnologies so fast

Congested designCongested design

Xeon’s busXeon’s bus Intel’s solution to multi-CPU Intel’s solution to multi-CPU

connectivityconnectivity

HypertransportHypertransport

Newer then Intel’s interconnectivity Newer then Intel’s interconnectivity approachapproach

So far more promisingSo far more promising Open for use by anyone in consortiumOpen for use by anyone in consortium Tries to simplify interconnections, Tries to simplify interconnections,

reduce cost, and spread bandwidth reduce cost, and spread bandwidth over the system to avoid concentrated over the system to avoid concentrated congestioncongestion

Hypertransport diagramHypertransport diagram

Power4/5 InterconnectPower4/5 Interconnect

FBC: Fabric Bus ControllerFBC: Fabric Bus Controller FBC is like intercommunication FBC is like intercommunication

chipset that would be on Xeon chipset that would be on Xeon servers, but it is in the actual CPUservers, but it is in the actual CPU

FBC handles memory transactions FBC handles memory transactions and communication between other and communication between other CPUs via their FBCsCPUs via their FBCs

Power4/5 InterconnectPower4/5 Interconnect

MultiprocessingMultiprocessing

Connect multiple computers Connect multiple computers together with their own CPUstogether with their own CPUs

Beowulf clustersBeowulf clusters To some extent the InternetTo some extent the Internet A LANA LAN Slow connections between CPUsSlow connections between CPUs

MultiprocessingMultiprocessing

To remedy the problem of slow and To remedy the problem of slow and timely communication, processors timely communication, processors are put on the same circuit boardare put on the same circuit board

Communication much faster Communication much faster between processors and applicationsbetween processors and applications

Can be expensive if you want many Can be expensive if you want many CPUs to be connected in this fashionCPUs to be connected in this fashion

MultiprocessingMultiprocessing

4 sockets on one motherboard:4 sockets on one motherboard:

MultiprocessingMultiprocessing If the communication is still too slow and If the communication is still too slow and

takes too long, just stick the CPUs right takes too long, just stick the CPUs right next to each other with no obstruction in next to each other with no obstruction in betweenbetween

Multi-core solutionsMulti-core solutions Quick communication between threadsQuick communication between threads External comms. Must accommodate 2 External comms. Must accommodate 2

CPUs nowCPUs now Twice the space for one CPUTwice the space for one CPU If you buy two CPUs, you’re stuck with If you buy two CPUs, you’re stuck with

bothboth

MultiprocessingMultiprocessing

The current trend in computer architectureThe current trend in computer architecture Single cores have not been doubling in Single cores have not been doubling in

performance every 18 months; in the past 5 performance every 18 months; in the past 5 years it is closer to every 36 months (or years it is closer to every 36 months (or more)more)

If you want to go faster, don’t wait 18 If you want to go faster, don’t wait 18 months to double your speed. Go parallel months to double your speed. Go parallel insteadinstead

Design one core, copy & pasteDesign one core, copy & paste Multiple cores can perform load balancing Multiple cores can perform load balancing

to help keep coolto help keep cool

MultiprocessingMultiprocessing

Dual CoreDual Core High-Speed High-Speed

bus on CPU bus on CPU itselfitself

Can talk to Can talk to each other’s each other’s cache memory cache memory much more much more quicklyquickly

MultiprocessingMultiprocessing

SMT: Simultaneous multiple threadingSMT: Simultaneous multiple threading Allows each CPU core to execute Allows each CPU core to execute

multiple tasks/threads simultaneously multiple tasks/threads simultaneously instead of sequentiallyinstead of sequentially

““Hyperthreading” is Intel’s Hyperthreading” is Intel’s implementation of SMT on their CPUsimplementation of SMT on their CPUs

Threads communicate not through a Threads communicate not through a high-speed direct interconnect, but to high-speed direct interconnect, but to each other directlyeach other directly

MultiprocessingMultiprocessing

SMT increases CPU efficiencySMT increases CPU efficiency One CPU pretending to be two CPUs One CPU pretending to be two CPUs

is actually effectiveis actually effective Two threads on a single core not as Two threads on a single core not as

fast as two threads on separate coresfast as two threads on separate cores Two threads on one core must fight Two threads on one core must fight

for / share execution resourcesfor / share execution resources SMT is actually real multitaskingSMT is actually real multitasking

MicroprocessorsMicroprocessors

Intel’s XeonIntel’s Xeon AMD’s OpteronAMD’s Opteron Sun’s NiagaraSun’s Niagara IBM’s CellIBM’s Cell Intel’s TerascaleIntel’s Terascale

XeonXeon Intel’s workstation/server CPUIntel’s workstation/server CPU Originally started as Pentium ProOriginally started as Pentium Pro Lucrative marketLucrative market Always had weak comms. bussesAlways had weak comms. busses Add plenty of on-chip memory to Add plenty of on-chip memory to

alleviate problemalleviate problem Xeons are Pentiums given features Xeons are Pentiums given features

to work as server chipsto work as server chips

The Pentium ProThe Pentium Pro - Only x86 CPU made for servers, then moved to desktop

Xeon PicturesXeon Pictures

The current XeonThe current Xeon - Fast CPUs, but on interconnect architecture similar to Pentium Pro

Die photoDie photo

Xeon PicturesXeon Pictures

OpteronOpteron

Introduced in 2003Introduced in 2003 Designed primarily as a server CPUDesigned primarily as a server CPU Can have up to 4 external Can have up to 4 external

communication ports; 3 for communication ports; 3 for hyeprtransport, 1 for memoryhyeprtransport, 1 for memory

128 KB L1 and 1,024 KB L2 Cache128 KB L1 and 1,024 KB L2 Cache 106 million transistors106 million transistors

Opteron’s comms.Opteron’s comms.

Opteron die photoOpteron die photo

NiagaraNiagara New design by SunNew design by Sun Contains 8 cores on one CPUContains 8 cores on one CPU Each core can execute 4 threads Each core can execute 4 threads

simultaneouslysimultaneously 32 threads simultaneously on one core32 threads simultaneously on one core Very simple coresVery simple cores Focus on throughputFocus on throughput Very weak performance on single threaded Very weak performance on single threaded

codecode High bandwidth within CPU and to memoryHigh bandwidth within CPU and to memory No SMP support, meant for small systemsNo SMP support, meant for small systems

Niagara’s targetNiagara’s target

CellCell

9 cores; one general purpose, 8 “SPEs”9 cores; one general purpose, 8 “SPEs” SPE: Synergistic Processing ElementSPE: Synergistic Processing Element Each SPE has powerful math unitsEach SPE has powerful math units Coined as “supercomputer on a chip”Coined as “supercomputer on a chip” Used in a few servers/supercomputersUsed in a few servers/supercomputers Used in Playstation 3Used in Playstation 3 EIB: One bus connecting 9 coresEIB: One bus connecting 9 cores

Cell die photoCell die photo

TerascaleTerascale

Proof-of-Concept designProof-of-Concept design The chip itself is toy-like with The chip itself is toy-like with

respect to real power and ISArespect to real power and ISA Each chip has a router on itEach chip has a router on it Allows seamless addition of cores to Allows seamless addition of cores to

the CPUthe CPU Very cheap for designVery cheap for design Not very effective for performanceNot very effective for performance

Terascale die photoTerascale die photo

TerascaleTerascale

Each core can communicate to Each core can communicate to immediate neighbour on each of 4 immediate neighbour on each of 4 sidessides

Example chip had 80 coresExample chip had 80 cores Cores not in use decrease power Cores not in use decrease power

consumptionconsumption If area of CPU gets too hot, work done If area of CPU gets too hot, work done

in that area of CPU is passed to other in that area of CPU is passed to other corescores

The FutureThe Future

Old methods of improving performance Old methods of improving performance are no longer as fruitful as they used to beare no longer as fruitful as they used to be

Systems are developing more integration, Systems are developing more integration, as fewer chips are needed in a computer as fewer chips are needed in a computer to perform the same functionsto perform the same functions

Parallelism at the instruction level seems Parallelism at the instruction level seems to be fully exploited from compilers and to be fully exploited from compilers and hardwarehardware

CPUs are dealing with thermal density CPUs are dealing with thermal density problemsproblems

The FutureThe Future

Moore’s law is holding and now transistor Moore’s law is holding and now transistor budgets are becoming more relaxedbudgets are becoming more relaxed

Cores have become ridiculously Cores have become ridiculously complicatedcomplicated

We are now seeing limits to sequential We are now seeing limits to sequential computing at the hardware levelcomputing at the hardware level

Single core performance not promising Single core performance not promising looking into the futurelooking into the future

Where does this lead us?Where does this lead us?

The FutureThe Future Parallel is promising for performanceParallel is promising for performance One simple core can be copied many One simple core can be copied many

timestimes Most new designs have parallelism in Most new designs have parallelism in

mind alreadymind already Software is taking it’s sweet time to Software is taking it’s sweet time to

catch upcatch up Programmers need software to help them Programmers need software to help them

parallelize their programs!parallelize their programs! OSes need better scheduling and OSes need better scheduling and

allocation algorithms!allocation algorithms!

The FutureThe Future

Most current compute-intensive programs Most current compute-intensive programs and algorithms can be parallelizedand algorithms can be parallelized

Uses: media processing is embarrassingly Uses: media processing is embarrassingly parallel and obtains near linear parallel and obtains near linear performance increase with more SMT and performance increase with more SMT and corescores

Making programmers make parallel code Making programmers make parallel code could lead to better programs, or programs could lead to better programs, or programs that scale with better hardware!that scale with better hardware!

ReferencesReferences http://www.aceshardware.com/http://www.aceshardware.com/ http://www.anandtech.com/http://www.anandtech.com/ http://www.sandpile.org/http://www.sandpile.org/ http://www.chip-architect.com/http://www.chip-architect.com/ http://www.intel.com/http://www.intel.com/ http://www.transmeta.com/pdfs/techdocs/efficeon_tm8600_http://www.transmeta.com/pdfs/techdocs/efficeon_tm8600_

prod_brief.pdfprod_brief.pdf http://www.amd.com/us-en/assets/content_type/http://www.amd.com/us-en/assets/content_type/

white_papers_and_tech_docs/31411.pdf white_papers_and_tech_docs/31411.pdf http://www.arstechnica.com/http://www.arstechnica.com/