Download ppt - Computer Evolution and Performance. ENIAC - background Electronic Numerical Integrator And Computer Electronic Numerical Integrator And Computer Eckert

Computer Evolution Computer Evolution and Performanceand Performance

ENIAC - backgroundENIAC - background

Electronic Numerical Integrator And Electronic Numerical Integrator And ComputerComputer

Eckert and MauchlyEckert and Mauchly University of PennsylvaniaUniversity of Pennsylvania Trajectory tables for weapons Trajectory tables for weapons Started 1943Started 1943 Finished 1946Finished 1946

Too late for war effortToo late for war effort Used until 1955Used until 1955

ENIAC - detailsENIAC - details

Decimal (not binary)Decimal (not binary) 20 accumulators of 10 digits20 accumulators of 10 digits Programmed manually by switchesProgrammed manually by switches 18,000 vacuum tubes18,000 vacuum tubes 30 tons30 tons 15,000 square feet15,000 square feet 140 kW power consumption140 kW power consumption 5,000 additions per second5,000 additions per second

von Neumann/Turingvon Neumann/Turing Stored Program conceptStored Program concept Main memory storing programs and dataMain memory storing programs and data ALU operating on binary dataALU operating on binary data Control unit interpreting instructions Control unit interpreting instructions

from memory and executingfrom memory and executing Input and output equipment operated by Input and output equipment operated by

control unitcontrol unit Princeton Institute for Advanced Studies Princeton Institute for Advanced Studies

IASIAS Completed 1952Completed 1952

Structure of von Structure of von Neumann machineNeumann machine

IAS - detailsIAS - details 1000 x 40 bit words1000 x 40 bit words

Binary numberBinary number 2 x 20 bit instructions2 x 20 bit instructions

Set of registers (storage in CPU)Set of registers (storage in CPU) Memory Buffer RegisterMemory Buffer Register Memory Address RegisterMemory Address Register Instruction RegisterInstruction Register Instruction Buffer RegisterInstruction Buffer Register Program CounterProgram Counter AccumulatorAccumulator Multiplier QuotientMultiplier Quotient

Structure of IAS – detailStructure of IAS – detail

Cont..Cont..

MBR – contains a word to be stored in memory MBR – contains a word to be stored in memory or sent to the I/O unit, or is used to receive a or sent to the I/O unit, or is used to receive a word from memory or from the I/O unit.word from memory or from the I/O unit.

MAR – specifies the address in memory of the MAR – specifies the address in memory of the word to be written from or read into MBR.word to be written from or read into MBR.

IR – contains 8-bit opcode instruction being IR – contains 8-bit opcode instruction being executed.executed.

PC – contains the address of the next PC – contains the address of the next instructioninstruction

AC and MQ – hold temporarily operands.AC and MQ – hold temporarily operands.

Commercial ComputersCommercial Computers

1947 - Eckert-Mauchly Computer 1947 - Eckert-Mauchly Computer CorporationCorporation

UNIVAC I (Universal Automatic UNIVAC I (Universal Automatic Computer)Computer)

US Bureau of Census 1950 calculationsUS Bureau of Census 1950 calculations Became part of Sperry-Rand CorporationBecame part of Sperry-Rand Corporation Late 1950s - UNIVAC IILate 1950s - UNIVAC II

FasterFaster More memoryMore memory

IBMIBM

Punched-card processing equipmentPunched-card processing equipment 1953 - the 7011953 - the 701

IBM’s first stored program computerIBM’s first stored program computer Scientific calculationsScientific calculations

1955 - the 7021955 - the 702 Business applicationsBusiness applications

Lead to 700/7000 seriesLead to 700/7000 series

TransistorsTransistors

Replaced vacuum tubesReplaced vacuum tubes SmallerSmaller CheaperCheaper Less heat dissipationLess heat dissipation Solid State deviceSolid State device Made from Silicon (Sand)Made from Silicon (Sand) Invented 1947 at Bell LabsInvented 1947 at Bell Labs William Shockley et al.William Shockley et al.

Transistor Based Transistor Based ComputersComputers

Second generation machinesSecond generation machines NCR & RCA produced small NCR & RCA produced small

transistor machinestransistor machines IBM 7000IBM 7000 DEC - 1957DEC - 1957

Produced PDP-1Produced PDP-1

MicroelectronicsMicroelectronics

Literally - “small electronics”Literally - “small electronics” A computer is made up of gates, A computer is made up of gates,

memory cells and interconnectionsmemory cells and interconnections These can be manufactured on a These can be manufactured on a

semiconductorsemiconductor e.g. silicon wafere.g. silicon wafer

Generations of ComputerGenerations of Computer Vacuum tube - 1946-1957Vacuum tube - 1946-1957 Transistor - 1958-1964Transistor - 1958-1964 Small scale integration - 1965 onSmall scale integration - 1965 on

Up to 100 devices on a chipUp to 100 devices on a chip Medium scale integration - to 1971Medium scale integration - to 1971

100-3,000 devices on a chip100-3,000 devices on a chip Large scale integration - 1971-1977Large scale integration - 1971-1977

3,000 - 100,000 devices on a chip3,000 - 100,000 devices on a chip Very large scale integration - 1978 -1991Very large scale integration - 1978 -1991

100,000 - 100,000,000 devices on a chip100,000 - 100,000,000 devices on a chip Ultra large scale integration – 1991 -Ultra large scale integration – 1991 -

Over 100,000,000 devices on a chipOver 100,000,000 devices on a chip

Moore’s LawMoore’s Law Increased density of components on chipIncreased density of components on chip Gordon Moore – co-founder of IntelGordon Moore – co-founder of Intel Number of transistors on a chip will double Number of transistors on a chip will double

every yearevery year Since 1970’s development has slowed a littleSince 1970’s development has slowed a little

Number of transistors doubles every 18 monthsNumber of transistors doubles every 18 months Cost of a chip has remained almost unchangedCost of a chip has remained almost unchanged Higher packing density means shorter electrical Higher packing density means shorter electrical

paths, giving higher performancepaths, giving higher performance Smaller size gives increased flexibilitySmaller size gives increased flexibility Reduced power and cooling requirementsReduced power and cooling requirements Fewer interconnections increases reliabilityFewer interconnections increases reliability

Growth in CPU Transistor Growth in CPU Transistor CountCount

IBM 360IBM 360 System 360 is the industry’s first planned family of computers. The System 360 is the industry’s first planned family of computers. The

model are compatible in the sense that a program written for one model are compatible in the sense that a program written for one model should be capable of being executed by another model in the model should be capable of being executed by another model in the series (diff only in time) series (diff only in time)

The characteristics of a family are as follows:The characteristics of a family are as follows: Similar instruction setSimilar instruction set In many cases, the exact same set of In many cases, the exact same set of

machine instructions is supported on all members of the family. machine instructions is supported on all members of the family. Thus a program that executes on one machine will also execute on Thus a program that executes on one machine will also execute on any other. any other.

Similar operating systemSimilar operating system The same basic operating system is The same basic operating system is available for all family members.available for all family members.

Increasing speedIncreasing speed The rate of instruction execution increase in The rate of instruction execution increase in going from lower to higher family members. going from lower to higher family members.

Cont..Cont..

Increasing I/O portIncreasing I/O port In going from lower to In going from lower to higher family members.higher family members.

Increasing memoryIncreasing memory In going from lower to In going from lower to higher family members.higher family members.

Increasing costIncreasing cost In going from lower to higher In going from lower to higher family members.family members.

DEC PDP-8DEC PDP-8

19641964 First minicomputer.First minicomputer. Did not need air conditioned roomDid not need air conditioned room Small enough to sit on a lab benchSmall enough to sit on a lab bench Cheap $16,000, IBM360 $100k. Cheap $16,000, IBM360 $100k. Embedded applications & OEMEmbedded applications & OEM Use BUS STRUCTURE- OmnibusUse BUS STRUCTURE- Omnibus

DEC - PDP-8 Bus DEC - PDP-8 Bus StructureStructure

Omnibus consists of 96 separate signal paths, used to carry control, address, and data signals. Because all system components share a common set of signal paths, their use must be controlled by the CPU.

Semiconductor MemorySemiconductor Memory In 1950s and 1960s, most computer memory was In 1950s and 1960s, most computer memory was

constructed from tiny rings of ferromagnetic constructed from tiny rings of ferromagnetic material.material.

In 1970, Fairchild produced the first relatively In 1970, Fairchild produced the first relatively capacious semiconductor memory. capacious semiconductor memory.

Size of a single coreSize of a single core i.e. 1 bit of magnetic core storagei.e. 1 bit of magnetic core storage

Holds 256 bits of memoryHolds 256 bits of memory Non-destructive readNon-destructive read Much faster than coreMuch faster than core Capacity approximately doubles each yearCapacity approximately doubles each year

IntelIntel

1971 – Intel developed 4004 1971 – Intel developed 4004 First microprocessorFirst microprocessor Contain all CPU components on a single chipContain all CPU components on a single chip 4 bit4 bit

Followed in 1972 by 8008Followed in 1972 by 8008 8 bit8 bit 4004, 8008 designed for specific applications4004, 8008 designed for specific applications

1974 - 80801974 - 8080 Intel’s first general purpose microprocessorIntel’s first general purpose microprocessor

Speeding it upSpeeding it up PipeliningPipelining On board cacheOn board cache On board L1 & L2 cacheOn board L1 & L2 cache Branch predictionBranch prediction – the processor looks ahead in the – the processor looks ahead in the

instruction code fetched from memory and predicts instruction code fetched from memory and predicts which branches, or group of instructions to be process which branches, or group of instructions to be process next.next.

Data flow analysisData flow analysis – the processor analyzes which – the processor analyzes which instructions are dependent on each other’s result, or instructions are dependent on each other’s result, or data, to create an optimized schedule of instructions. data, to create an optimized schedule of instructions.

Speculative executionSpeculative execution – using branch prediction and – using branch prediction and data flow analysis, some processor speculatively execute data flow analysis, some processor speculatively execute instructions ahead of their actual appearance in the instructions ahead of their actual appearance in the program execution, holding the results in temporary program execution, holding the results in temporary locations. This enables the processor to keep its locations. This enables the processor to keep its execution engines as busy as possible by executing execution engines as busy as possible by executing instruction that are likely to be needed. instruction that are likely to be needed.

Performance BalancePerformance Balance

Processor speed increasedProcessor speed increased Memory capacity increasedMemory capacity increased Memory speed lags behind Memory speed lags behind

processor speedprocessor speed

Logic and Memory Logic and Memory Performance GapPerformance Gap

SolutionsSolutions Increase the number of bits that are retrieved at one Increase the number of bits that are retrieved at one

time by making DRAMs “wider” rather than “deeper” time by making DRAMs “wider” rather than “deeper” and by using wide bus data paths.and by using wide bus data paths.

Change the DRAM interface to make it more efficient Change the DRAM interface to make it more efficient by including a cache or other buffering scheme on the by including a cache or other buffering scheme on the DRAM.DRAM.

Reduce the frequency of memory access by Reduce the frequency of memory access by incorporating increasingly complex and efficient cache incorporating increasingly complex and efficient cache structures between the processor and main memory. structures between the processor and main memory. Incorporation of one or more caches on the processor Incorporation of one or more caches on the processor chip.chip.

Increase the interconnect bandwidth between Increase the interconnect bandwidth between processors and memory by using higher-speed buses processors and memory by using higher-speed buses and using hierarchy of buses to buffer and structure and using hierarchy of buses to buffer and structure data flow. data flow.

I/O DevicesI/O Devices Peripherals with intensive I/O demandsPeripherals with intensive I/O demands Large data throughput demandsLarge data throughput demands Processors can handle thisProcessors can handle this Problem moving data Problem moving data Solutions:Solutions:

CachingCaching BufferingBuffering Higher-speed interconnection busesHigher-speed interconnection buses More elaborate bus structuresMore elaborate bus structures Multiple-processor configurationsMultiple-processor configurations

Typical I/O Device Data Typical I/O Device Data RatesRates

Key is BalanceKey is Balance

Processor componentsProcessor components Main memoryMain memory I/O devicesI/O devices Interconnection structuresInterconnection structures

Improvements in Chip Improvements in Chip Organization and Organization and

ArchitectureArchitecture Increase hardware speed of processorIncrease hardware speed of processor

Fundamentally due to shrinking logic gate sizeFundamentally due to shrinking logic gate size More gates, packed more tightly, increasing clock rateMore gates, packed more tightly, increasing clock rate Propagation time for signals reducedPropagation time for signals reduced

Increase size and speed of cachesIncrease size and speed of caches Dedicating part of processor chip Dedicating part of processor chip

Cache access times drop significantlyCache access times drop significantly

Change processor organization and Change processor organization and architecturearchitecture Increase effective speed of executionIncrease effective speed of execution ParallelismParallelism

Problems with Clock Speed and Login Problems with Clock Speed and Login DensityDensity PowerPower

Power density increases with density of logic and Power density increases with density of logic and clock speedclock speed

Dissipating heatDissipating heat RC delayRC delay

Speed at which electrons flow limited by resistance Speed at which electrons flow limited by resistance and capacitance of metal wires connecting themand capacitance of metal wires connecting them

Delay increases as RC product increasesDelay increases as RC product increases Wire interconnects thinner, increasing resistanceWire interconnects thinner, increasing resistance Wires closer together, increasing capacitanceWires closer together, increasing capacitance

Memory latencyMemory latency Memory speeds lag processor speedsMemory speeds lag processor speeds

Solution:Solution: More emphasis on organizational and architectural More emphasis on organizational and architectural

approachesapproaches

Intel Microprocessor Intel Microprocessor PerformancePerformance

Increased Cache Increased Cache CapacityCapacity

Typically two or three levels of cache Typically two or three levels of cache between processor and main memorybetween processor and main memory

Chip density increasedChip density increased More cache memory on chipMore cache memory on chip

Faster cache accessFaster cache access

Pentium chip devoted about 10% of Pentium chip devoted about 10% of chip area to cachechip area to cache

Pentium 4 devotes about 50%Pentium 4 devotes about 50%

More Complex Execution More Complex Execution LogicLogic

Enable parallel execution of Enable parallel execution of instructionsinstructions

Pipeline works like assembly linePipeline works like assembly line Different stages of execution of different Different stages of execution of different

instructions at same time along pipelineinstructions at same time along pipeline Superscalar allows multiple pipelines Superscalar allows multiple pipelines

within single processorwithin single processor Instructions that do not depend on one Instructions that do not depend on one

another can be executed in parallelanother can be executed in parallel

Diminishing ReturnsDiminishing Returns

Internal organization of processors Internal organization of processors complexcomplex Can get a great deal of parallelismCan get a great deal of parallelism Further significant increases likely to be Further significant increases likely to be

relatively modestrelatively modest Benefits from cache are reaching limitBenefits from cache are reaching limit Increasing clock rate runs into power Increasing clock rate runs into power

dissipation problem dissipation problem Some fundamental physical limits are being Some fundamental physical limits are being

reachedreached

New Approach – Multiple New Approach – Multiple CoresCores

Multiple processors on single chipMultiple processors on single chip Large shared cacheLarge shared cache

Within a processor, increase in performance Within a processor, increase in performance proportional to square root of increase in proportional to square root of increase in complexitycomplexity

If software can use multiple processors, doubling If software can use multiple processors, doubling number of processors almost doubles number of processors almost doubles performanceperformance

So, use two simpler processors on the chip rather So, use two simpler processors on the chip rather than one more complex processorthan one more complex processor

With two processors, larger caches are justifiedWith two processors, larger caches are justified Power consumption of memory logic less than Power consumption of memory logic less than

processing logicprocessing logic Example: IBM POWER4Example: IBM POWER4

Two cores based on PowerPCTwo cores based on PowerPC

POWER4 Chip POWER4 Chip OrganizationOrganization

Pentium Evolution (1)Pentium Evolution (1) 80808080

first general purpose microprocessorfirst general purpose microprocessor 8 bit data path8 bit data path Used in first personal computer – AltairUsed in first personal computer – Altair

80868086 much more powerfulmuch more powerful 16 bit16 bit instruction cache, prefetch few instructionsinstruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC8088 (8 bit external bus) used in first IBM PC

8028680286 16 Mbyte memory addressable16 Mbyte memory addressable up from 1Mbup from 1Mb

8038680386 32 bit32 bit Support for multitaskingSupport for multitasking

Pentium Evolution (2)Pentium Evolution (2) 8048680486

sophisticated powerful cache and sophisticated powerful cache and instruction pipelininginstruction pipelining

built in maths co-processorbuilt in maths co-processor PentiumPentium

SuperscalarSuperscalar Multiple instructions executed in parallelMultiple instructions executed in parallel

Pentium ProPentium Pro Increased superscalar organizationIncreased superscalar organization Aggressive register renamingAggressive register renaming branch predictionbranch prediction data flow analysisdata flow analysis speculative executionspeculative execution

Pentium Evolution (3)Pentium Evolution (3) Pentium IIPentium II

MMX technologyMMX technology graphics, video & audio processinggraphics, video & audio processing

Pentium IIIPentium III Additional floating point instructions for 3D graphicsAdditional floating point instructions for 3D graphics

Pentium 4Pentium 4 Note Arabic rather than Roman numeralsNote Arabic rather than Roman numerals Further floating point and multimedia enhancementsFurther floating point and multimedia enhancements

ItaniumItanium 64 bit64 bit see chapter 15see chapter 15

Itanium 2Itanium 2 Hardware enhancements to increase speedHardware enhancements to increase speed

PowerPCPowerPC 1975, 801 minicomputer project (IBM) RISC 1975, 801 minicomputer project (IBM) RISC Berkeley RISC I processorBerkeley RISC I processor 1986, IBM commercial RISC workstation product, RT 1986, IBM commercial RISC workstation product, RT

PC.PC. Not commercial successNot commercial success Many rivals with comparable or better performanceMany rivals with comparable or better performance

1990, IBM RISC System/60001990, IBM RISC System/6000 RISC-like superscalar machineRISC-like superscalar machine POWER architecturePOWER architecture

IBM alliance with Motorola (68000 microprocessors), IBM alliance with Motorola (68000 microprocessors), and Apple, (used 68000 in Macintosh)and Apple, (used 68000 in Macintosh)

Result is PowerPC architectureResult is PowerPC architecture Derived from the POWER architectureDerived from the POWER architecture Superscalar RISCSuperscalar RISC Apple MacintoshApple Macintosh Embedded chip applicationsEmbedded chip applications

PowerPC Family (1)PowerPC Family (1) 601:601:

Quickly to market. 32-bit machineQuickly to market. 32-bit machine 603:603:

Low-end desktop and portable Low-end desktop and portable 32-bit32-bit Comparable performance with 601Comparable performance with 601 Lower cost and more efficient implementationLower cost and more efficient implementation

604:604: Desktop and low-end serversDesktop and low-end servers 32-bit machine32-bit machine Much more advanced superscalar designMuch more advanced superscalar design Greater performanceGreater performance

620:620: High-end serversHigh-end servers 64-bit architecture64-bit architecture

PowerPC Family (2)PowerPC Family (2)

740/750:740/750: Also known as G3Also known as G3 Two levels of cache on chipTwo levels of cache on chip

G4:G4: Increases parallelism and internal speedIncreases parallelism and internal speed

G5:G5: Improvements in parallelism and internal Improvements in parallelism and internal

speed speed 64-bit organization64-bit organization