Upload
evangeline-hunt
View
219
Download
0
Embed Size (px)
Citation preview
RationaleRationale Performance: We don’t want slow Performance: We don’t want slow
computerscomputers Cost: They have to be affordableCost: They have to be affordable Money: How much cash to design and Money: How much cash to design and
test?test? Time: How much time do we have?Time: How much time do we have? Target Market: Who will want it?Target Market: Who will want it? Competition: What do we do to beat the Competition: What do we do to beat the
competition, and what are they doing?competition, and what are they doing?
Problems in Delivering Problems in Delivering SolutionsSolutions
Clock frequencyClock frequency Core efficiencyCore efficiency Processor complexityProcessor complexity Power and heatPower and heat Old mindsets in new timesOld mindsets in new times Communicating with the rest of the Communicating with the rest of the
computercomputer How do we build parallel computers?How do we build parallel computers?
Clock frequencyClock frequency
Increasing clock frequency will Increasing clock frequency will always decrease latency and always decrease latency and increase throughput.increase throughput.
It is the simplest way to increase It is the simplest way to increase processor performanceprocessor performance
Essentially the “heartbeat” of a Essentially the “heartbeat” of a processorprocessor
How it is viewed by different partiesHow it is viewed by different parties
Core EfficiencyCore Efficiency
More core efficiency = less More core efficiency = less wastefulnesswastefulness
Increasing core efficiency leads to Increasing core efficiency leads to more complex coresmore complex cores
The increasing complexity could The increasing complexity could potentially make the cores less potentially make the cores less efficientefficient
““Useful” execution per unit time is Useful” execution per unit time is decent measure of efficiencydecent measure of efficiency
ComplexityComplexity
CPUs were originally CISC, and CPUs were originally CISC, and were complicatedwere complicated
Hardware was expensiveHardware was expensive RISC was born to make hardware RISC was born to make hardware
cheap by decreasing complexitycheap by decreasing complexity Today’s situation: CISC/RISC hybrid Today’s situation: CISC/RISC hybrid
with astronomical complexitywith astronomical complexity
ComplexityComplexity
Complexity makes it harder to design Complexity makes it harder to design and build, and test new chips.and build, and test new chips.
Steeper learning curvesSteeper learning curves Larger CPUsLarger CPUs Slower, more expensive, longer time Slower, more expensive, longer time
to reach marketto reach market Things have always been moving this Things have always been moving this
way but may change in the near way but may change in the near futurefuture
CISCCISC
Complex instruction set computingComplex instruction set computing CPU can handle large range of instructionsCPU can handle large range of instructions New instructions are added over time to New instructions are added over time to
new chip designs when compilers demand new chip designs when compilers demand themthem
Chips got expensiveChips got expensive Many instructions weren’t used often, but Many instructions weren’t used often, but
implemented in the chips.implemented in the chips. Chips got larger, and more complexChips got larger, and more complex
RISCRISC
Reduced instruction set computingReduced instruction set computing CISC costs increased with complexityCISC costs increased with complexity RISC was to be cheap hardware RISC was to be cheap hardware
alternativealternative Idea took Amdahl’s law to heartIdea took Amdahl’s law to heart Increased compiler costsIncreased compiler costs RISC computers started to become RISC computers started to become
more CISC like over timemore CISC like over time
VLIWVLIW Very long instruction wordVery long instruction word Designed to remedy the problem of Designed to remedy the problem of
RISC mutating into CISCRISC mutating into CISC Puts extreme emphasis on compiler Puts extreme emphasis on compiler
development and scheduling of codedevelopment and scheduling of code VLIW executes “words” made up of VLIW executes “words” made up of
multiple RISC-like instructionsmultiple RISC-like instructions
VLIWVLIW
VLIW didn’t catch on muchVLIW didn’t catch on much Software development too long and Software development too long and
slowslow Instruction level parallelism is Instruction level parallelism is
already achieved through current already achieved through current hardwarehardware
Making ILP done through software Making ILP done through software just make software more expensivejust make software more expensive
VLIW ExamplesVLIW Examples
Intel’s ItaniumIntel’s Itanium New instruction set (IA-64)New instruction set (IA-64) Relied heavily on compiler Relied heavily on compiler
optimization for an ISA that hadn’t optimization for an ISA that hadn’t existed beforeexisted before
Huge investment and startup costsHuge investment and startup costs Itanium chips were/are expensive and Itanium chips were/are expensive and
provide little benefit over standard, provide little benefit over standard, well established solutionswell established solutions
VLIW ExamplesVLIW Examples
Transmeta CrusoeTransmeta Crusoe Had own, brand new native ISAHad own, brand new native ISA Had code-morphing software that Had code-morphing software that
read machine code and transformed read machine code and transformed it into the crusoe ISA code in real-it into the crusoe ISA code in real-timetime
Code-morphing made this machine Code-morphing made this machine ISA independent, but very slowISA independent, but very slow
Power and HeatPower and Heat
Not an issue ten years ago compared to Not an issue ten years ago compared to todaytoday
CPUs draw much more power than everCPUs draw much more power than ever They produce more heat than everThey produce more heat than ever How to give CPU powerHow to give CPU power How to cool CPUHow to cool CPU Power to cool CPUPower to cool CPU Power to cool coolingPower to cool cooling
Power and HeatPower and Heat Thermal density Thermal density http://bwrc.eecs.berkeley.edu/classes/icdesign/ee241_s05/http://bwrc.eecs.berkeley.edu/classes/icdesign/ee241_s05/
Lectures/Lecture20_Thermal.pdfLectures/Lecture20_Thermal.pdf
Power and HeatPower and Heat Cost to wire more pins to deliver Cost to wire more pins to deliver
more powermore power Most pins dedicated to thisMost pins dedicated to this
Power and HeatPower and Heat
What is the problem with heat?What is the problem with heat? How to remove itHow to remove it How to prevent it to begin with?How to prevent it to begin with? Clock frequency, manufacturing Clock frequency, manufacturing
processesprocesses Other “tricks” to decrease power Other “tricks” to decrease power
usage and heat outputusage and heat output
Engineer MentalityEngineer Mentality
Loving an ideaLoving an idea Market is the plaintiff, judge, jury and Market is the plaintiff, judge, jury and
executioner. Many fail to realize thisexecutioner. Many fail to realize this Brand new ideas are ok but tend to failBrand new ideas are ok but tend to fail Old ideas were ok but are outdated Old ideas were ok but are outdated
and inapplicableand inapplicable Slightly new but not radical ideas are Slightly new but not radical ideas are
bestbest
Engineer MentalityEngineer Mentality What has the market led us to?What has the market led us to? x86 architecturex86 architecture 26 years of backwards compatibility in a 26 years of backwards compatibility in a
slow, inefficient monster that stole ideas slow, inefficient monster that stole ideas from most other design philosophiesfrom most other design philosophies
Started in original IBM PC; now it’s Started in original IBM PC; now it’s everywhereeverywhere
Radical new designs keep losing to this Radical new designs keep losing to this beastbeast
Engineer mentality must disappointingly be Engineer mentality must disappointingly be limited by the real world and the marketlimited by the real world and the market
Solutions that worked in Solutions that worked in the pastthe past
Increase clock frequencyIncrease clock frequency Increase widthIncrease width Make memory faster, more Make memory faster, more
integrated, more availableintegrated, more available OoO executionOoO execution PredictionPrediction Deeper, wider execution pipelinesDeeper, wider execution pipelines
Problems with old Problems with old solutionssolutions
Every year in the past few years have Every year in the past few years have seen diminishing returns from seen diminishing returns from microarchitectual enhancementsmicroarchitectual enhancements
ILP and clock increase engineering ILP and clock increase engineering complexity and provide less and less complexity and provide less and less benefit each yearbenefit each year
Heat and powerHeat and power CostCost Time to marketTime to market
The rest of the computerThe rest of the computer
Processor is useless if it is outside Processor is useless if it is outside the socketthe socket
How does it communicate with other How does it communicate with other CPUs, memory, the rest of the CPUs, memory, the rest of the computer?computer?
Many different approaches by Many different approaches by different companies as to how it is different companies as to how it is donedone
Xeon’s busXeon’s bus
Essentially same design as the Essentially same design as the original Pentium’s busoriginal Pentium’s bus
Has increased in frequency and bus Has increased in frequency and bus widthwidth
A few electrical improvementsA few electrical improvements Huge standardized technology; Intel Huge standardized technology; Intel
is huge so it cannot change its base is huge so it cannot change its base technologies so fasttechnologies so fast
Congested designCongested design
Xeon’s busXeon’s bus Intel’s solution to multi-CPU Intel’s solution to multi-CPU
connectivityconnectivity
HypertransportHypertransport
Newer then Intel’s interconnectivity Newer then Intel’s interconnectivity approachapproach
So far more promisingSo far more promising Open for use by anyone in consortiumOpen for use by anyone in consortium Tries to simplify interconnections, Tries to simplify interconnections,
reduce cost, and spread bandwidth reduce cost, and spread bandwidth over the system to avoid concentrated over the system to avoid concentrated congestioncongestion
Power4/5 InterconnectPower4/5 Interconnect
FBC: Fabric Bus ControllerFBC: Fabric Bus Controller FBC is like intercommunication FBC is like intercommunication
chipset that would be on Xeon chipset that would be on Xeon servers, but it is in the actual CPUservers, but it is in the actual CPU
FBC handles memory transactions FBC handles memory transactions and communication between other and communication between other CPUs via their FBCsCPUs via their FBCs
MultiprocessingMultiprocessing
Connect multiple computers Connect multiple computers together with their own CPUstogether with their own CPUs
Beowulf clustersBeowulf clusters To some extent the InternetTo some extent the Internet A LANA LAN Slow connections between CPUsSlow connections between CPUs
MultiprocessingMultiprocessing
To remedy the problem of slow and To remedy the problem of slow and timely communication, processors timely communication, processors are put on the same circuit boardare put on the same circuit board
Communication much faster Communication much faster between processors and applicationsbetween processors and applications
Can be expensive if you want many Can be expensive if you want many CPUs to be connected in this fashionCPUs to be connected in this fashion
MultiprocessingMultiprocessing If the communication is still too slow and If the communication is still too slow and
takes too long, just stick the CPUs right takes too long, just stick the CPUs right next to each other with no obstruction in next to each other with no obstruction in betweenbetween
Multi-core solutionsMulti-core solutions Quick communication between threadsQuick communication between threads External comms. Must accommodate 2 External comms. Must accommodate 2
CPUs nowCPUs now Twice the space for one CPUTwice the space for one CPU If you buy two CPUs, you’re stuck with If you buy two CPUs, you’re stuck with
bothboth
MultiprocessingMultiprocessing
The current trend in computer architectureThe current trend in computer architecture Single cores have not been doubling in Single cores have not been doubling in
performance every 18 months; in the past 5 performance every 18 months; in the past 5 years it is closer to every 36 months (or years it is closer to every 36 months (or more)more)
If you want to go faster, don’t wait 18 If you want to go faster, don’t wait 18 months to double your speed. Go parallel months to double your speed. Go parallel insteadinstead
Design one core, copy & pasteDesign one core, copy & paste Multiple cores can perform load balancing Multiple cores can perform load balancing
to help keep coolto help keep cool
MultiprocessingMultiprocessing
Dual CoreDual Core High-Speed High-Speed
bus on CPU bus on CPU itselfitself
Can talk to Can talk to each other’s each other’s cache memory cache memory much more much more quicklyquickly
MultiprocessingMultiprocessing
SMT: Simultaneous multiple threadingSMT: Simultaneous multiple threading Allows each CPU core to execute Allows each CPU core to execute
multiple tasks/threads simultaneously multiple tasks/threads simultaneously instead of sequentiallyinstead of sequentially
““Hyperthreading” is Intel’s Hyperthreading” is Intel’s implementation of SMT on their CPUsimplementation of SMT on their CPUs
Threads communicate not through a Threads communicate not through a high-speed direct interconnect, but to high-speed direct interconnect, but to each other directlyeach other directly
MultiprocessingMultiprocessing
SMT increases CPU efficiencySMT increases CPU efficiency One CPU pretending to be two CPUs One CPU pretending to be two CPUs
is actually effectiveis actually effective Two threads on a single core not as Two threads on a single core not as
fast as two threads on separate coresfast as two threads on separate cores Two threads on one core must fight Two threads on one core must fight
for / share execution resourcesfor / share execution resources SMT is actually real multitaskingSMT is actually real multitasking
MicroprocessorsMicroprocessors
Intel’s XeonIntel’s Xeon AMD’s OpteronAMD’s Opteron Sun’s NiagaraSun’s Niagara IBM’s CellIBM’s Cell Intel’s TerascaleIntel’s Terascale
XeonXeon Intel’s workstation/server CPUIntel’s workstation/server CPU Originally started as Pentium ProOriginally started as Pentium Pro Lucrative marketLucrative market Always had weak comms. bussesAlways had weak comms. busses Add plenty of on-chip memory to Add plenty of on-chip memory to
alleviate problemalleviate problem Xeons are Pentiums given features Xeons are Pentiums given features
to work as server chipsto work as server chips
The Pentium ProThe Pentium Pro - Only x86 CPU made for servers, then moved to desktop
Xeon PicturesXeon Pictures
The current XeonThe current Xeon - Fast CPUs, but on interconnect architecture similar to Pentium Pro
OpteronOpteron
Introduced in 2003Introduced in 2003 Designed primarily as a server CPUDesigned primarily as a server CPU Can have up to 4 external Can have up to 4 external
communication ports; 3 for communication ports; 3 for hyeprtransport, 1 for memoryhyeprtransport, 1 for memory
128 KB L1 and 1,024 KB L2 Cache128 KB L1 and 1,024 KB L2 Cache 106 million transistors106 million transistors
NiagaraNiagara New design by SunNew design by Sun Contains 8 cores on one CPUContains 8 cores on one CPU Each core can execute 4 threads Each core can execute 4 threads
simultaneouslysimultaneously 32 threads simultaneously on one core32 threads simultaneously on one core Very simple coresVery simple cores Focus on throughputFocus on throughput Very weak performance on single threaded Very weak performance on single threaded
codecode High bandwidth within CPU and to memoryHigh bandwidth within CPU and to memory No SMP support, meant for small systemsNo SMP support, meant for small systems
CellCell
9 cores; one general purpose, 8 “SPEs”9 cores; one general purpose, 8 “SPEs” SPE: Synergistic Processing ElementSPE: Synergistic Processing Element Each SPE has powerful math unitsEach SPE has powerful math units Coined as “supercomputer on a chip”Coined as “supercomputer on a chip” Used in a few servers/supercomputersUsed in a few servers/supercomputers Used in Playstation 3Used in Playstation 3 EIB: One bus connecting 9 coresEIB: One bus connecting 9 cores
TerascaleTerascale
Proof-of-Concept designProof-of-Concept design The chip itself is toy-like with The chip itself is toy-like with
respect to real power and ISArespect to real power and ISA Each chip has a router on itEach chip has a router on it Allows seamless addition of cores to Allows seamless addition of cores to
the CPUthe CPU Very cheap for designVery cheap for design Not very effective for performanceNot very effective for performance
TerascaleTerascale
Each core can communicate to Each core can communicate to immediate neighbour on each of 4 immediate neighbour on each of 4 sidessides
Example chip had 80 coresExample chip had 80 cores Cores not in use decrease power Cores not in use decrease power
consumptionconsumption If area of CPU gets too hot, work done If area of CPU gets too hot, work done
in that area of CPU is passed to other in that area of CPU is passed to other corescores
The FutureThe Future
Old methods of improving performance Old methods of improving performance are no longer as fruitful as they used to beare no longer as fruitful as they used to be
Systems are developing more integration, Systems are developing more integration, as fewer chips are needed in a computer as fewer chips are needed in a computer to perform the same functionsto perform the same functions
Parallelism at the instruction level seems Parallelism at the instruction level seems to be fully exploited from compilers and to be fully exploited from compilers and hardwarehardware
CPUs are dealing with thermal density CPUs are dealing with thermal density problemsproblems
The FutureThe Future
Moore’s law is holding and now transistor Moore’s law is holding and now transistor budgets are becoming more relaxedbudgets are becoming more relaxed
Cores have become ridiculously Cores have become ridiculously complicatedcomplicated
We are now seeing limits to sequential We are now seeing limits to sequential computing at the hardware levelcomputing at the hardware level
Single core performance not promising Single core performance not promising looking into the futurelooking into the future
Where does this lead us?Where does this lead us?
The FutureThe Future Parallel is promising for performanceParallel is promising for performance One simple core can be copied many One simple core can be copied many
timestimes Most new designs have parallelism in Most new designs have parallelism in
mind alreadymind already Software is taking it’s sweet time to Software is taking it’s sweet time to
catch upcatch up Programmers need software to help them Programmers need software to help them
parallelize their programs!parallelize their programs! OSes need better scheduling and OSes need better scheduling and
allocation algorithms!allocation algorithms!
The FutureThe Future
Most current compute-intensive programs Most current compute-intensive programs and algorithms can be parallelizedand algorithms can be parallelized
Uses: media processing is embarrassingly Uses: media processing is embarrassingly parallel and obtains near linear parallel and obtains near linear performance increase with more SMT and performance increase with more SMT and corescores
Making programmers make parallel code Making programmers make parallel code could lead to better programs, or programs could lead to better programs, or programs that scale with better hardware!that scale with better hardware!
ReferencesReferences http://www.aceshardware.com/http://www.aceshardware.com/ http://www.anandtech.com/http://www.anandtech.com/ http://www.sandpile.org/http://www.sandpile.org/ http://www.chip-architect.com/http://www.chip-architect.com/ http://www.intel.com/http://www.intel.com/ http://www.transmeta.com/pdfs/techdocs/efficeon_tm8600_http://www.transmeta.com/pdfs/techdocs/efficeon_tm8600_
prod_brief.pdfprod_brief.pdf http://www.amd.com/us-en/assets/content_type/http://www.amd.com/us-en/assets/content_type/
white_papers_and_tech_docs/31411.pdf white_papers_and_tech_docs/31411.pdf http://www.arstechnica.com/http://www.arstechnica.com/