Computer Evolution Computer Evolution and Performanceand Performance
ENIAC - backgroundENIAC - background
Electronic Numerical Integrator And Electronic Numerical Integrator And ComputerComputer
Eckert and MauchlyEckert and Mauchly University of PennsylvaniaUniversity of Pennsylvania Trajectory tables for weapons Trajectory tables for weapons Started 1943Started 1943 Finished 1946Finished 1946
Too late for war effortToo late for war effort Used until 1955Used until 1955
ENIAC - detailsENIAC - details
Decimal (not binary)Decimal (not binary) 20 accumulators of 10 digits20 accumulators of 10 digits Programmed manually by switchesProgrammed manually by switches 18,000 vacuum tubes18,000 vacuum tubes 30 tons30 tons 15,000 square feet15,000 square feet 140 kW power consumption140 kW power consumption 5,000 additions per second5,000 additions per second
von Neumann/Turingvon Neumann/Turing Stored Program conceptStored Program concept Main memory storing programs and dataMain memory storing programs and data ALU operating on binary dataALU operating on binary data Control unit interpreting instructions Control unit interpreting instructions
from memory and executingfrom memory and executing Input and output equipment operated by Input and output equipment operated by
control unitcontrol unit Princeton Institute for Advanced Studies Princeton Institute for Advanced Studies
IASIAS Completed 1952Completed 1952
Structure of von Structure of von Neumann machineNeumann machine
IAS - detailsIAS - details 1000 x 40 bit words1000 x 40 bit words
Binary numberBinary number 2 x 20 bit instructions2 x 20 bit instructions
Set of registers (storage in CPU)Set of registers (storage in CPU) Memory Buffer RegisterMemory Buffer Register Memory Address RegisterMemory Address Register Instruction RegisterInstruction Register Instruction Buffer RegisterInstruction Buffer Register Program CounterProgram Counter AccumulatorAccumulator Multiplier QuotientMultiplier Quotient
Structure of IAS – detailStructure of IAS – detail
Cont..Cont..
MBR – contains a word to be stored in memory MBR – contains a word to be stored in memory or sent to the I/O unit, or is used to receive a or sent to the I/O unit, or is used to receive a word from memory or from the I/O unit.word from memory or from the I/O unit.
MAR – specifies the address in memory of the MAR – specifies the address in memory of the word to be written from or read into MBR.word to be written from or read into MBR.
IR – contains 8-bit opcode instruction being IR – contains 8-bit opcode instruction being executed.executed.
PC – contains the address of the next PC – contains the address of the next instructioninstruction
AC and MQ – hold temporarily operands.AC and MQ – hold temporarily operands.
Commercial ComputersCommercial Computers
1947 - Eckert-Mauchly Computer 1947 - Eckert-Mauchly Computer CorporationCorporation
UNIVAC I (Universal Automatic UNIVAC I (Universal Automatic Computer)Computer)
US Bureau of Census 1950 calculationsUS Bureau of Census 1950 calculations Became part of Sperry-Rand CorporationBecame part of Sperry-Rand Corporation Late 1950s - UNIVAC IILate 1950s - UNIVAC II
FasterFaster More memoryMore memory
IBMIBM
Punched-card processing equipmentPunched-card processing equipment 1953 - the 7011953 - the 701
IBM’s first stored program computerIBM’s first stored program computer Scientific calculationsScientific calculations
1955 - the 7021955 - the 702 Business applicationsBusiness applications
Lead to 700/7000 seriesLead to 700/7000 series
TransistorsTransistors
Replaced vacuum tubesReplaced vacuum tubes SmallerSmaller CheaperCheaper Less heat dissipationLess heat dissipation Solid State deviceSolid State device Made from Silicon (Sand)Made from Silicon (Sand) Invented 1947 at Bell LabsInvented 1947 at Bell Labs William Shockley et al.William Shockley et al.
Transistor Based Transistor Based ComputersComputers
Second generation machinesSecond generation machines NCR & RCA produced small NCR & RCA produced small
transistor machinestransistor machines IBM 7000IBM 7000 DEC - 1957DEC - 1957
Produced PDP-1Produced PDP-1
MicroelectronicsMicroelectronics
Literally - “small electronics”Literally - “small electronics” A computer is made up of gates, A computer is made up of gates,
memory cells and interconnectionsmemory cells and interconnections These can be manufactured on a These can be manufactured on a
semiconductorsemiconductor e.g. silicon wafere.g. silicon wafer
Generations of ComputerGenerations of Computer Vacuum tube - 1946-1957Vacuum tube - 1946-1957 Transistor - 1958-1964Transistor - 1958-1964 Small scale integration - 1965 onSmall scale integration - 1965 on
Up to 100 devices on a chipUp to 100 devices on a chip Medium scale integration - to 1971Medium scale integration - to 1971
100-3,000 devices on a chip100-3,000 devices on a chip Large scale integration - 1971-1977Large scale integration - 1971-1977
3,000 - 100,000 devices on a chip3,000 - 100,000 devices on a chip Very large scale integration - 1978 -1991Very large scale integration - 1978 -1991
100,000 - 100,000,000 devices on a chip100,000 - 100,000,000 devices on a chip Ultra large scale integration – 1991 -Ultra large scale integration – 1991 -
Over 100,000,000 devices on a chipOver 100,000,000 devices on a chip
Moore’s LawMoore’s Law Increased density of components on chipIncreased density of components on chip Gordon Moore – co-founder of IntelGordon Moore – co-founder of Intel Number of transistors on a chip will double Number of transistors on a chip will double
every yearevery year Since 1970’s development has slowed a littleSince 1970’s development has slowed a little
Number of transistors doubles every 18 monthsNumber of transistors doubles every 18 months Cost of a chip has remained almost unchangedCost of a chip has remained almost unchanged Higher packing density means shorter electrical Higher packing density means shorter electrical
paths, giving higher performancepaths, giving higher performance Smaller size gives increased flexibilitySmaller size gives increased flexibility Reduced power and cooling requirementsReduced power and cooling requirements Fewer interconnections increases reliabilityFewer interconnections increases reliability
Growth in CPU Transistor Growth in CPU Transistor CountCount
IBM 360IBM 360 System 360 is the industry’s first planned family of computers. The System 360 is the industry’s first planned family of computers. The
model are compatible in the sense that a program written for one model are compatible in the sense that a program written for one model should be capable of being executed by another model in the model should be capable of being executed by another model in the series (diff only in time) series (diff only in time)
The characteristics of a family are as follows:The characteristics of a family are as follows: Similar instruction setSimilar instruction set In many cases, the exact same set of In many cases, the exact same set of
machine instructions is supported on all members of the family. machine instructions is supported on all members of the family. Thus a program that executes on one machine will also execute on Thus a program that executes on one machine will also execute on any other. any other.
Similar operating systemSimilar operating system The same basic operating system is The same basic operating system is available for all family members.available for all family members.
Increasing speedIncreasing speed The rate of instruction execution increase in The rate of instruction execution increase in going from lower to higher family members. going from lower to higher family members.
Cont..Cont..
Increasing I/O portIncreasing I/O port In going from lower to In going from lower to higher family members.higher family members.
Increasing memoryIncreasing memory In going from lower to In going from lower to higher family members.higher family members.
Increasing costIncreasing cost In going from lower to higher In going from lower to higher family members.family members.
DEC PDP-8DEC PDP-8
19641964 First minicomputer.First minicomputer. Did not need air conditioned roomDid not need air conditioned room Small enough to sit on a lab benchSmall enough to sit on a lab bench Cheap $16,000, IBM360 $100k. Cheap $16,000, IBM360 $100k. Embedded applications & OEMEmbedded applications & OEM Use BUS STRUCTURE- OmnibusUse BUS STRUCTURE- Omnibus
DEC - PDP-8 Bus DEC - PDP-8 Bus StructureStructure
Omnibus consists of 96 separate signal paths, used to carry control, address, and data signals. Because all system components share a common set of signal paths, their use must be controlled by the CPU.
Semiconductor MemorySemiconductor Memory In 1950s and 1960s, most computer memory was In 1950s and 1960s, most computer memory was
constructed from tiny rings of ferromagnetic constructed from tiny rings of ferromagnetic material.material.
In 1970, Fairchild produced the first relatively In 1970, Fairchild produced the first relatively capacious semiconductor memory. capacious semiconductor memory.
Size of a single coreSize of a single core i.e. 1 bit of magnetic core storagei.e. 1 bit of magnetic core storage
Holds 256 bits of memoryHolds 256 bits of memory Non-destructive readNon-destructive read Much faster than coreMuch faster than core Capacity approximately doubles each yearCapacity approximately doubles each year
IntelIntel
1971 – Intel developed 4004 1971 – Intel developed 4004 First microprocessorFirst microprocessor Contain all CPU components on a single chipContain all CPU components on a single chip 4 bit4 bit
Followed in 1972 by 8008Followed in 1972 by 8008 8 bit8 bit 4004, 8008 designed for specific applications4004, 8008 designed for specific applications
1974 - 80801974 - 8080 Intel’s first general purpose microprocessorIntel’s first general purpose microprocessor
Speeding it upSpeeding it up PipeliningPipelining On board cacheOn board cache On board L1 & L2 cacheOn board L1 & L2 cache Branch predictionBranch prediction – the processor looks ahead in the – the processor looks ahead in the
instruction code fetched from memory and predicts instruction code fetched from memory and predicts which branches, or group of instructions to be process which branches, or group of instructions to be process next.next.
Data flow analysisData flow analysis – the processor analyzes which – the processor analyzes which instructions are dependent on each other’s result, or instructions are dependent on each other’s result, or data, to create an optimized schedule of instructions. data, to create an optimized schedule of instructions.
Speculative executionSpeculative execution – using branch prediction and – using branch prediction and data flow analysis, some processor speculatively execute data flow analysis, some processor speculatively execute instructions ahead of their actual appearance in the instructions ahead of their actual appearance in the program execution, holding the results in temporary program execution, holding the results in temporary locations. This enables the processor to keep its locations. This enables the processor to keep its execution engines as busy as possible by executing execution engines as busy as possible by executing instruction that are likely to be needed. instruction that are likely to be needed.
Performance BalancePerformance Balance
Processor speed increasedProcessor speed increased Memory capacity increasedMemory capacity increased Memory speed lags behind Memory speed lags behind
processor speedprocessor speed
Logic and Memory Logic and Memory Performance GapPerformance Gap
SolutionsSolutions Increase the number of bits that are retrieved at one Increase the number of bits that are retrieved at one
time by making DRAMs “wider” rather than “deeper” time by making DRAMs “wider” rather than “deeper” and by using wide bus data paths.and by using wide bus data paths.
Change the DRAM interface to make it more efficient Change the DRAM interface to make it more efficient by including a cache or other buffering scheme on the by including a cache or other buffering scheme on the DRAM.DRAM.
Reduce the frequency of memory access by Reduce the frequency of memory access by incorporating increasingly complex and efficient cache incorporating increasingly complex and efficient cache structures between the processor and main memory. structures between the processor and main memory. Incorporation of one or more caches on the processor Incorporation of one or more caches on the processor chip.chip.
Increase the interconnect bandwidth between Increase the interconnect bandwidth between processors and memory by using higher-speed buses processors and memory by using higher-speed buses and using hierarchy of buses to buffer and structure and using hierarchy of buses to buffer and structure data flow. data flow.
I/O DevicesI/O Devices Peripherals with intensive I/O demandsPeripherals with intensive I/O demands Large data throughput demandsLarge data throughput demands Processors can handle thisProcessors can handle this Problem moving data Problem moving data Solutions:Solutions:
CachingCaching BufferingBuffering Higher-speed interconnection busesHigher-speed interconnection buses More elaborate bus structuresMore elaborate bus structures Multiple-processor configurationsMultiple-processor configurations
Typical I/O Device Data Typical I/O Device Data RatesRates
Key is BalanceKey is Balance
Processor componentsProcessor components Main memoryMain memory I/O devicesI/O devices Interconnection structuresInterconnection structures
Improvements in Chip Improvements in Chip Organization and Organization and
ArchitectureArchitecture Increase hardware speed of processorIncrease hardware speed of processor
Fundamentally due to shrinking logic gate sizeFundamentally due to shrinking logic gate size More gates, packed more tightly, increasing clock rateMore gates, packed more tightly, increasing clock rate Propagation time for signals reducedPropagation time for signals reduced
Increase size and speed of cachesIncrease size and speed of caches Dedicating part of processor chip Dedicating part of processor chip
Cache access times drop significantlyCache access times drop significantly
Change processor organization and Change processor organization and architecturearchitecture Increase effective speed of executionIncrease effective speed of execution ParallelismParallelism
Problems with Clock Speed and Login Problems with Clock Speed and Login DensityDensity PowerPower
Power density increases with density of logic and Power density increases with density of logic and clock speedclock speed
Dissipating heatDissipating heat RC delayRC delay
Speed at which electrons flow limited by resistance Speed at which electrons flow limited by resistance and capacitance of metal wires connecting themand capacitance of metal wires connecting them
Delay increases as RC product increasesDelay increases as RC product increases Wire interconnects thinner, increasing resistanceWire interconnects thinner, increasing resistance Wires closer together, increasing capacitanceWires closer together, increasing capacitance
Memory latencyMemory latency Memory speeds lag processor speedsMemory speeds lag processor speeds
Solution:Solution: More emphasis on organizational and architectural More emphasis on organizational and architectural
approachesapproaches
Intel Microprocessor Intel Microprocessor PerformancePerformance
Increased Cache Increased Cache CapacityCapacity
Typically two or three levels of cache Typically two or three levels of cache between processor and main memorybetween processor and main memory
Chip density increasedChip density increased More cache memory on chipMore cache memory on chip
Faster cache accessFaster cache access
Pentium chip devoted about 10% of Pentium chip devoted about 10% of chip area to cachechip area to cache
Pentium 4 devotes about 50%Pentium 4 devotes about 50%
More Complex Execution More Complex Execution LogicLogic
Enable parallel execution of Enable parallel execution of instructionsinstructions
Pipeline works like assembly linePipeline works like assembly line Different stages of execution of different Different stages of execution of different
instructions at same time along pipelineinstructions at same time along pipeline Superscalar allows multiple pipelines Superscalar allows multiple pipelines
within single processorwithin single processor Instructions that do not depend on one Instructions that do not depend on one
another can be executed in parallelanother can be executed in parallel
Diminishing ReturnsDiminishing Returns
Internal organization of processors Internal organization of processors complexcomplex Can get a great deal of parallelismCan get a great deal of parallelism Further significant increases likely to be Further significant increases likely to be
relatively modestrelatively modest Benefits from cache are reaching limitBenefits from cache are reaching limit Increasing clock rate runs into power Increasing clock rate runs into power
dissipation problem dissipation problem Some fundamental physical limits are being Some fundamental physical limits are being
reachedreached
New Approach – Multiple New Approach – Multiple CoresCores
Multiple processors on single chipMultiple processors on single chip Large shared cacheLarge shared cache
Within a processor, increase in performance Within a processor, increase in performance proportional to square root of increase in proportional to square root of increase in complexitycomplexity
If software can use multiple processors, doubling If software can use multiple processors, doubling number of processors almost doubles number of processors almost doubles performanceperformance
So, use two simpler processors on the chip rather So, use two simpler processors on the chip rather than one more complex processorthan one more complex processor
With two processors, larger caches are justifiedWith two processors, larger caches are justified Power consumption of memory logic less than Power consumption of memory logic less than
processing logicprocessing logic Example: IBM POWER4Example: IBM POWER4
Two cores based on PowerPCTwo cores based on PowerPC
POWER4 Chip POWER4 Chip OrganizationOrganization
Pentium Evolution (1)Pentium Evolution (1) 80808080
first general purpose microprocessorfirst general purpose microprocessor 8 bit data path8 bit data path Used in first personal computer – AltairUsed in first personal computer – Altair
80868086 much more powerfulmuch more powerful 16 bit16 bit instruction cache, prefetch few instructionsinstruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC8088 (8 bit external bus) used in first IBM PC
8028680286 16 Mbyte memory addressable16 Mbyte memory addressable up from 1Mbup from 1Mb
8038680386 32 bit32 bit Support for multitaskingSupport for multitasking
Pentium Evolution (2)Pentium Evolution (2) 8048680486
sophisticated powerful cache and sophisticated powerful cache and instruction pipelininginstruction pipelining
built in maths co-processorbuilt in maths co-processor PentiumPentium
SuperscalarSuperscalar Multiple instructions executed in parallelMultiple instructions executed in parallel
Pentium ProPentium Pro Increased superscalar organizationIncreased superscalar organization Aggressive register renamingAggressive register renaming branch predictionbranch prediction data flow analysisdata flow analysis speculative executionspeculative execution
Pentium Evolution (3)Pentium Evolution (3) Pentium IIPentium II
MMX technologyMMX technology graphics, video & audio processinggraphics, video & audio processing
Pentium IIIPentium III Additional floating point instructions for 3D graphicsAdditional floating point instructions for 3D graphics
Pentium 4Pentium 4 Note Arabic rather than Roman numeralsNote Arabic rather than Roman numerals Further floating point and multimedia enhancementsFurther floating point and multimedia enhancements
ItaniumItanium 64 bit64 bit see chapter 15see chapter 15
Itanium 2Itanium 2 Hardware enhancements to increase speedHardware enhancements to increase speed
PowerPCPowerPC 1975, 801 minicomputer project (IBM) RISC 1975, 801 minicomputer project (IBM) RISC Berkeley RISC I processorBerkeley RISC I processor 1986, IBM commercial RISC workstation product, RT 1986, IBM commercial RISC workstation product, RT
PC.PC. Not commercial successNot commercial success Many rivals with comparable or better performanceMany rivals with comparable or better performance
1990, IBM RISC System/60001990, IBM RISC System/6000 RISC-like superscalar machineRISC-like superscalar machine POWER architecturePOWER architecture
IBM alliance with Motorola (68000 microprocessors), IBM alliance with Motorola (68000 microprocessors), and Apple, (used 68000 in Macintosh)and Apple, (used 68000 in Macintosh)
Result is PowerPC architectureResult is PowerPC architecture Derived from the POWER architectureDerived from the POWER architecture Superscalar RISCSuperscalar RISC Apple MacintoshApple Macintosh Embedded chip applicationsEmbedded chip applications
PowerPC Family (1)PowerPC Family (1) 601:601:
Quickly to market. 32-bit machineQuickly to market. 32-bit machine 603:603:
Low-end desktop and portable Low-end desktop and portable 32-bit32-bit Comparable performance with 601Comparable performance with 601 Lower cost and more efficient implementationLower cost and more efficient implementation
604:604: Desktop and low-end serversDesktop and low-end servers 32-bit machine32-bit machine Much more advanced superscalar designMuch more advanced superscalar design Greater performanceGreater performance
620:620: High-end serversHigh-end servers 64-bit architecture64-bit architecture
PowerPC Family (2)PowerPC Family (2)
740/750:740/750: Also known as G3Also known as G3 Two levels of cache on chipTwo levels of cache on chip
G4:G4: Increases parallelism and internal speedIncreases parallelism and internal speed
G5:G5: Improvements in parallelism and internal Improvements in parallelism and internal
speed speed 64-bit organization64-bit organization