Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer...

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 11

ELEC 5200-001/6200-001ELEC 5200-001/6200-001Computer Architecture and DesignComputer Architecture and Design

Fall 2014Fall 2014 Performance of a Computer Performance of a Computer

(Chapter 4)(Chapter 4)Vishwani D. AgrawalVishwani D. Agrawal

James J. Danaher ProfessorJames J. Danaher ProfessorDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering

Auburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849http://www.eng.auburn.edu/~vagrawal

vagrawal@eng.auburn.edu

What is Performance?What is Performance?Response time: the time between the start and Response time: the time between the start and completion of a task.completion of a task.

Throughput: the total amount of work done in a Throughput: the total amount of work done in a given time.given time.

Some performance measures:Some performance measures:MIPS (million instructions per second).MIPS (million instructions per second).

MFLOPS (million floating point operations per second), also MFLOPS (million floating point operations per second), also GFLOPS, TFLOPS (10GFLOPS, TFLOPS (101212), etc.), etc.

SPEC (System Performance Evaluation Corporation) SPEC (System Performance Evaluation Corporation) benchmarks.benchmarks.

LINPACK benchmarks, floating point computing, used for LINPACK benchmarks, floating point computing, used for supercomputers.supercomputers.

Synthetic benchmarks.Synthetic benchmarks.

Small and Large NumbersSmall and Large NumbersSmall Large

1010-3-3 millimilli mm 101033 kilokilo kk

1010-6-6 micromicro μμ 101066 megamega MM

1010-9-9 nanonano nn 101099 gigagiga GG

1010-12-12 picopico pp 10101212 teratera TT

1010-15-15 femtofemto ff 10101515 petapeta PP

1010-18-18 attoatto 10101818 exaexa

1010-21-21 zeptozepto 10102121 zettazetta

1010-24-24 yoctoyocto 10102424 yottayotta

Computer Memory SizeComputer Memory Size

Number bits bytes

221010 1,0241,024 KK KbKb KBKB

222020 1,048,5761,048,576 MM MbMb MBMB

223030 1,073,741,8241,073,741,824 GG GbGb GBGB

224040 1,099,511,627,7761,099,511,627,776 TT TbTb TBTB

Units for Measuring PerformanceUnits for Measuring PerformanceTime in seconds (s), microseconds (Time in seconds (s), microseconds (μμs), s), nanoseconds (ns), or picoseconds (ps).nanoseconds (ns), or picoseconds (ps).Clock cycleClock cycle

Period of the hardware clockPeriod of the hardware clockExample: one clock cycle means 1 nanosecond for Example: one clock cycle means 1 nanosecond for a 1GHz clock frequency (or 1GHz clock rate)a 1GHz clock frequency (or 1GHz clock rate)

CPU time = (CPU clock cycles)/(clock CPU time = (CPU clock cycles)/(clock rate)rate)

Cycles per instruction (CPI): average Cycles per instruction (CPI): average number of clock cycles used to execute a number of clock cycles used to execute a computer instruction.computer instruction.

Components of PerformanceComponents of PerformanceComponents of Performance

CPU time for a programCPU time for a program Time (seconds, etc.)Time (seconds, etc.)

Instruction countInstruction count Instructions executed by Instructions executed by the programthe program

CPICPI Average number of Average number of clock cycles per clock cycles per instructioninstruction

Clock cycle timeClock cycle time Time period of clock Time period of clock (seconds, etc.)(seconds, etc.)

Time, While You Wait, or Pay ForTime, While You Wait, or Pay For

CPU timeCPU time is the time taken by CPU to is the time taken by CPU to execute the program. It has two execute the program. It has two components:components:– User CPU time User CPU time is the time to execute the is the time to execute the

instructions of the program.instructions of the program.– System CPU time System CPU time is the time used by the is the time used by the

operating system to run the program.operating system to run the program.

Elapsed time (wall clock time) Elapsed time (wall clock time) is theis the time time between the start and end of a program.between the start and end of a program.

Example: Unix “time” CommandExample: Unix “time” Command90.7u 12.9s 2:39 65%

90.7 + 12.9 ─────── × 100 = 65% 159

Computing CPU TimeComputing CPU Time

CPU time = Instruction count × CPI × Clock cycle time

Instruction count × CPI= ────────────────

Clock rate

Instructions Clock cycles 1 second= ──────── × ───────── × ────────

Program Instruction Clock rate

Comparing Computers C1 and C2Comparing Computers C1 and C2

Run the same program on C1 and C2. Suppose both Run the same program on C1 and C2. Suppose both computers execute the same number ( computers execute the same number ( N N ) of instructions:) of instructions:

C1: CPI = 2.0, clock cycle time = 1 nsC1: CPI = 2.0, clock cycle time = 1 ns

CPU time(C1) = CPU time(C1) = NN × 2.0 × 1 = 2.0× 2.0 × 1 = 2.0NN ns nsC2: CPI = 1.2, clock cycle time = 2 nsC2: CPI = 1.2, clock cycle time = 2 ns

CPU time(C2) = CPU time(C2) = NN × 1.2 × 2 = 2.4× 1.2 × 2 = 2.4NN ns ns

CPU time(C2)/CPU time(C1) = 2.4CPU time(C2)/CPU time(C1) = 2.4NN/2.0/2.0NN = 1.2, therefore, = 1.2, therefore, C1C1 is is 1.21.2 times faster than times faster than C2.C2.

Result can vary with the choice of program.Result can vary with the choice of program.

Comparing Program Codes I & IIComparing Program Codes I & IICode size for a program:Code size for a program:– Code I has 5 million instructionsCode I has 5 million instructions– Code II has 6 million instructionsCode II has 6 million instructions– Code I is more efficient. Code I is more efficient. Is it?Is it?

Suppose a computer has three Suppose a computer has three types of instructions: A, B and C.types of instructions: A, B and C.CPU cycles (code I) = 10 millionCPU cycles (code I) = 10 millionCPU cycles (code II) = 9 millionCPU cycles (code II) = 9 millionCode II is more efficient.Code II is more efficient.

CPI( I ) = 10/5 = 2CPI( I ) = 10/5 = 2CPI( II ) = 9/6 = 1.5CPI( II ) = 9/6 = 1.5Code II is more efficient.Code II is more efficient.

Caution:Caution: Code size is a misleading Code size is a misleading indicator of performance. indicator of performance.

Instr. Type CPI

Instruction count in million

Type A

Type B

Type C

II 22 11 22 55

IIII 44 11 11 66

Rating of a ComputerRating of a ComputerMIPS: million instructions per secondMIPS: million instructions per second

Instruction count of a programInstruction count of a programMIPS = ───────────────────MIPS = ───────────────────

Execution time Execution time × 10× 1066

MIPS rating of a computer is relative to a MIPS rating of a computer is relative to a program.program.Standard programs for performance rating:Standard programs for performance rating:

Synthetic benchmarksSynthetic benchmarksSPEC benchmarks (System Performance Evaluation SPEC benchmarks (System Performance Evaluation Corporation)Corporation)

Synthetic Benchmark ProgramsSynthetic Benchmark ProgramsArtificial programs that emulate a large set Artificial programs that emulate a large set of typical “real” programs.of typical “real” programs.Whetstone benchmark – Algol and Fortran.Whetstone benchmark – Algol and Fortran.Dhrystone benchmark – Ada and C.Dhrystone benchmark – Ada and C.Disadvantages:Disadvantages:– No clear agreement on what a typical No clear agreement on what a typical

instruction mix should be.instruction mix should be.– Benchmarks do not produce meaningful result.Benchmarks do not produce meaningful result.– Purpose of rating is defeated when compilers Purpose of rating is defeated when compilers

are written to optimize the performance rating.are written to optimize the performance rating.

AdaAda Lady Augusta Ada Byron, Countess of Lovelace (1815-1852), daughter of Lord Byron (the poet who spent some time in a Swiss jail – in Chillon, not too far from Lausanne...). She was the assistant and patron of Charles Babbage; she wrote programs for his “Analytical Engine.”

An original print from its time.http://www.cs.kuleuven.ac.be/~dirk/ada-belgium/pictures.html

Misleading CompilersMisleading CompilersConsider a computer with a clock rate of 1 GHz.Consider a computer with a clock rate of 1 GHz.Two compilers produce the following instruction Two compilers produce the following instruction mixes for a program:mixes for a program:

Code from

Instruction count (billions)

cycles

(seconds)

MIPS**Type

CTotal

Compiler 1Compiler 1 55 11 11 77 1010×10×1099 1.431.43 1010 700700

Compiler 2Compiler 2 1010 11 11 1212 1515×10×1099 1.251.25 1515 800800Instruction types – A: 1-cycle, B: 2-cycle, C: 3-cycle

* CPU time = CPU clock cycles/clock rate

** MIPS = (Total instruction count/CPU time) × 10 – 6

Peak and Relative MIPS RatingsPeak and Relative MIPS RatingsPeak MIPSPeak MIPS

Choose an instruction mix to minimize CPIChoose an instruction mix to minimize CPIThe rating can be too high and unrealistic for general programsThe rating can be too high and unrealistic for general programs

Relative MIPS: Use a reference computer systemRelative MIPS: Use a reference computer system

Time(ref)Time(ref)Relative MIPS = Relative MIPS = ────── ────── × MIPS(ref)× MIPS(ref)

TimeTime

Historically, VAX-11/ 780, believed to have aHistorically, VAX-11/ 780, believed to have a11 MIPS performance, was used as reference. MIPS performance, was used as reference.

Wĕbopēdia on Wĕbopēdia on MIPSMIPS

Acronym for Acronym for mmillion illion iinstructions nstructions pper er ssecondecond. An old . An old measure of a computer's speed and power, MIPS measure of a computer's speed and power, MIPS measures roughly the number of machine measures roughly the number of machine instructions that a computer can execute in one instructions that a computer can execute in one second.second.In fact, some people jokingly claim that MIPS really In fact, some people jokingly claim that MIPS really stands for stands for MMeaningless eaningless IIndicator of ndicator of PPerformance.erformance.Despite these problems, a MIPS rating can give Despite these problems, a MIPS rating can give you a general idea of a computer's speed. The IBM you a general idea of a computer's speed. The IBM PC/XT computer, for example, is rated at PC/XT computer, for example, is rated at ¼ MIPS¼ MIPS, , while Pentium-based PCs run at over while Pentium-based PCs run at over 100 MIPS100 MIPS. .

A 1994 MIPS Rating ChartA 1994 MIPS Rating Chart

Computer MIPS Price $/MIPS

1975 IBM mainframe1975 IBM mainframe 1010 $10M$10M 1M1M

1976 Cray-11976 Cray-1 160160 $20M$20M 125K125K

1979 DEC VAX1979 DEC VAX 11 $200K$200K 200K200K

1981 IBM PC1981 IBM PC 0.250.25 $3K$3K 12K12K

1984 Sun 21984 Sun 2 11 $10K$10K 10K10K

1994 Pentium PC1994 Pentium PC 6666 $3K$3K 4646

1995 Sony PCX video game1995 Sony PCX video game 500500 $500$500 11

1995 Microunity set-top1995 Microunity set-top 1,0001,000 $500$500 0.50.5 New

MFLOPS (megaFLOPS)MFLOPS (megaFLOPS)

Only floating point operations are counted:Only floating point operations are counted:– Float, real, double; add, subtract, multiply, divideFloat, real, double; add, subtract, multiply, divide

MFLOPS rating is relevant in scientific computing. For MFLOPS rating is relevant in scientific computing. For example, programs like a compiler will measure almost 0 example, programs like a compiler will measure almost 0 MFLOPS.MFLOPS.

Sometimes misleading due to different implementations. Sometimes misleading due to different implementations. For example, a computer that does not have a floating-point For example, a computer that does not have a floating-point divide, will register many FLOPS for a division.divide, will register many FLOPS for a division.

Number of floating-point operations in a programMFLOPS = ─────────────────────────────────

Execution time × 106

Supercomputer PerformanceSupercomputer Performance

Gigaflops

Teraflops

Petaflops

Exaflops

Megaflops

Top Supercomputers, June 2012Top Supercomputers, June 2012www.top500.org

Rank Name Location CoresClock GHz

Max. Pflops

Power MW

Eff. Pflops/MJoule

1 Titan/CrayOak

Ridge560,640 2.2 27.11 8.21 3.30

2 Sequoia IBM USA 1,572864 1.6 16.30 7.89 2.07

computerFujitsu Japan

795,024 2.0 10.50 12.66 0.83

4 Mira IBM USA 786,432 1.6 8.16 3.95 2.07

5 SuperMUCIBM

Germany147,456 2.7 2.90 3.52 0.82

N. Leavitt, “Big Iron Moves Toward Exascale Computing,” Computer, vol. 45,no. 11, pp. 14-17, Nov. 2012.

The FutureThe Future

Erik P. DeBenedictis of Sandia National Laboratories theorizes that a zettaflops (1021) (one sextillion FLOPS) computer is required to accomplish full weather modeling, which could cover a two week time span accurately. Such systems might be built around 2030.

http://en.wikipedia.org/wiki/Supercomputer

PerformancePerformance

Performance is measured for a given program or a Performance is measured for a given program or a set of programs:set of programs:

Av. eAv. execution timexecution time = (1/ = (1/nn) ) ΣΣ Execution time Execution time ((program i program i ))

Av. execution time = Av. execution time = [ [ ∏∏ Execution time Execution time ((program i program i )) ]]1/1/nn

Performance is inverse of execution time:Performance is inverse of execution time:

PerformancePerformance = 1/( = 1/(Execution timeExecution time))

Geometric vs. Arithmetic MeanGeometric vs. Arithmetic MeanReference computer times of n programs: r1, . . . , rnReference computer times of n programs: r1, . . . , rnTimes of n programs on the computer under evaluation: Times of n programs on the computer under evaluation: T1, . . . , TnT1, . . . , TnNormalized times: T1/r1, . . . , Tn/rnNormalized times: T1/r1, . . . , Tn/rnGeometric meanGeometric mean == {(T1/r1) . . . (Tn/rn)}{(T1/r1) . . . (Tn/rn)}1/n1/n

{T1 . . . Tn}{T1 . . . Tn}1/n1/n

= = UsedUsed{r1 . . . rn}{r1 . . . rn}1/n1/n

Arithmetic meanArithmetic mean = = {(T1/r1)+ . . . +(Tn/rn)}/n{(T1/r1)+ . . . +(Tn/rn)}/n{T1+ . . . +Tn}/n{T1+ . . . +Tn}/n

≠ ≠ Not usedNot used{r1+ . . . +rn}/n{r1+ . . . +rn}/n

J. E. Smith, “Characterizing Computer Performance with a Single J. E. Smith, “Characterizing Computer Performance with a Single Number,” Number,” Comm. ACMComm. ACM, vol. 31, no. 10, pp. 1202-1206, Oct. 1988., vol. 31, no. 10, pp. 1202-1206, Oct. 1988.

SPEC BenchmarksSPEC BenchmarksSystem Performance Evaluation Corporation System Performance Evaluation Corporation (SPEC)(SPEC)

SPEC89SPEC89– 10 programs10 programs– SPEC performance ratio relative to VAX-11/780SPEC performance ratio relative to VAX-11/780– One program, matrix300, dropped because One program, matrix300, dropped because

compilers could be engineered to improve its compilers could be engineered to improve its performance.performance.

– www.spec.org

SPEC89 Performance Ratio forSPEC89 Performance Ratio forIBM Powerstation 550IBM Powerstation 550

compiler

enhanced compiler

SPEC95 BenchmarksSPEC95 BenchmarksEight integer and ten floating point Eight integer and ten floating point programs, programs, SPECint95SPECint95 and and SPECfp95SPECfp95..

Each program run time is normalized with Each program run time is normalized with respect to the run time of respect to the run time of Sun Sun SPARCstation 10/40SPARCstation 10/40 – the ratio is called – the ratio is called SPEC ratioSPEC ratio..

SPECint95SPECint95 and and SPECfp95SPECfp95 summary summary measurements are the geometric means of measurements are the geometric means of SPEC ratios.SPEC ratios.

SPEC CPU2000 BenchmarksSPEC CPU2000 BenchmarksTwelve integer and 14 floating point Twelve integer and 14 floating point programs, programs, CINT2000CINT2000 and and CFP2000CFP2000..

Each program run time is normalized to Each program run time is normalized to obtain a obtain a SPEC ratioSPEC ratio with respect to the run with respect to the run time on time on Sun Ultra 5_10 with a 300MHz Sun Ultra 5_10 with a 300MHz processorprocessor..

CINT2000CINT2000 and and CFP2000CFP2000 summary summary measurements are the geometric means measurements are the geometric means of SPEC ratios.of SPEC ratios.

Reference CPU: Sun Ultra 5_10 Reference CPU: Sun Ultra 5_10 300MHz Processor300MHz Processor

CINT2000

CFP2000

CINT2000: 3.4GHz Pentium 4, HT CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard)Technology (D850MD Motherboard)

Base ratio

Opt. ratio

SPECint2000_base = 1341SPECint2000 = 1389

Source: www.spec.org

Two Benchmark ResultsTwo Benchmark Results

Baseline: A uniform configuration not Baseline: A uniform configuration not optimized for specific program:optimized for specific program:

Same compiler with same settings and flags used Same compiler with same settings and flags used for all benchmarksfor all benchmarks

Other restrictionsOther restrictions

Peak: Run is optimized for obtaining the Peak: Run is optimized for obtaining the peak performance for each benchmark peak performance for each benchmark program.program.

CINT2000: 1.7GHz Pentium 4CINT2000: 1.7GHz Pentium 4(D850MD Motherboard)(D850MD Motherboard)

0100200300400500600700800900

Base ratio

Opt. ratio

SPECint2000_base = 579SPECint2000 = 588

CFP2000: 1.7GHz Pentium 4 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard)(D850MD Motherboard)

gel art

Base ratio

Opt. ratio

SPECfp2000_base = 648SPECfp2000 = 659

Additional SPEC BenchmarksAdditional SPEC Benchmarks

SPECweb99: measures the performance of a SPECweb99: measures the performance of a computer in a networked environment.computer in a networked environment.

Energy efficiency mode: Besides the execution Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a programs is also measured. Energy efficiency of a benchmark program is given by:benchmark program is given by:

1/(Execution time)1/(Execution time)Energy efficiency Energy efficiency == ────────────────────────

Power in wattsPower in watts

== Program units/jouleProgram units/joule

Energy EfficiencyEnergy Efficiency

Efficiency averaged on Efficiency averaged on nn benchmark programs: benchmark programs:

nnEfficiencyEfficiency == (( ΠΠ Efficiency Efficiencyii ))

1/1/nn

i i =1=1

where Efficiencywhere Efficiencyii is the efficiency for program is the efficiency for program ii..

Relative efficiency:Relative efficiency:

Efficiency of a computerEfficiency of a computerRelative efficiency = ─────────────────Relative efficiency = ─────────────────

Eff. of reference Eff. of reference computercomputer

SPEC2000 Relative Energy EfficiencySPEC2000 Relative Energy Efficiency

Pentium M@1.6/0.6GHz Energy-efficient procesor

Pentium 4-M@2.4GHz (Reference)

Pentium III-M@1.2GHz

Always max. clock

Laptop adaptive clk.

Min. power min. clock

Energy and Time PerspectivesEnergy and Time PerspectivesClock cycle is the unit of computing work.Clock cycle is the unit of computing work.

Cycle rate, f cycles per secondCycle rate, f cycles per secondf is the rate of doing computing workf is the rate of doing computing work

Hardware speed, similar to mph for a carHardware speed, similar to mph for a car

Cycle efficiency, Cycle efficiency, ηη cycles per joule cycles per jouleηη is the computing work per energy unit is the computing work per energy unit

Hardware efficiency, similar to mpg for a carHardware efficiency, similar to mpg for a car

Results from recent work:Results from recent work:– A. Shinde, “Managing Performance and Efficiency of a Processor,” A. Shinde, “Managing Performance and Efficiency of a Processor,”

MEE Project ReportMEE Project Report, Auburn Univ., Dec. 2012., Auburn Univ., Dec. 2012.– A. Shinde and V. D. Agarwal, “Managing Performance and Efficiency A. Shinde and V. D. Agarwal, “Managing Performance and Efficiency

of a Processor,” of a Processor,” Proc. 45th IEEE Southeastern Symposium on Proc. 45th IEEE Southeastern Symposium on System TheorySystem Theory, Baylor Univ., TX, March 2013, pp. 59-62., Baylor Univ., TX, March 2013, pp. 59-62.

Energy/Cycle for an 8-bit Adder in Energy/Cycle for an 8-bit Adder in 90nm CMOS Technology (PTM)90nm CMOS Technology (PTM)

K. Kim, “Ultra Low Power CMOS Design” K. Kim, “Ultra Low Power CMOS Design” PhD DissertationPhD Dissertation, Auburn , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.University, Dept. of ECE, Auburn, Alabama, May 2011.

Delay of an 8-bit Adder in

90nm CMOS Technology (PTM)

K. Kim, “Ultra Low Power CMOS Design” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.

Pentium M ProcessorPentium M Processor

Published data: H. Hanson, K. Rajamani, S. Keckler, F. Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” DVFS: Analysis with an Intel Pentium M,” Proc.Proc. International Symp. Low Power Electronics and DesignInternational Symp. Low Power Electronics and Design, , 2007, pp. 219-224.2007, pp. 219-224.

VDD = 1.2VVDD = 1.2V

Maximum clock rate = 1.8GHzMaximum clock rate = 1.8GHz

Critical path delay, td = 1/1.8GHz = 555.56psCritical path delay, td = 1/1.8GHz = 555.56ps

Power consumption = 120WPower consumption = 120W

Energy per cycle, EPC = 120/(1.8GHz) = 66.67nJEnergy per cycle, EPC = 120/(1.8GHz) = 66.67nJ

Cycle Efficiency and Frequency for Pentium M

Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 104242

Example of Power Management For a program that executes in 1.8 billion clock For a program that executes in 1.8 billion clock

cycles.cycles.

Voltage VDD

Frequency f

Cycle Efficiency,η

Execution Time

second

Total Energy

Consumed

Powerf/η

1.2 V1800

megacycles/s15

megacycles/joule1.0 120 Joules 120W

0.6 V277

megacycles/s70

megacycles/joule6.5 25 Joules 39.6W

200 mV54.5

megacycles/s660

megacycles/joule33 2.72 Joules 0.083W

Ways of Improving PerformanceWays of Improving Performance

Increase clock rate.Increase clock rate.

Improve processor organization for lower CPIImprove processor organization for lower CPIPipeliningPipelining

Instruction-level parallelism (ILP): MIMD (Scalar)Instruction-level parallelism (ILP): MIMD (Scalar)

Data-parallelism: SIMD (Vector)Data-parallelism: SIMD (Vector)

multiprocessingmultiprocessing

Compiler enhancements that lower the instruction Compiler enhancements that lower the instruction count or generate instructions with lower average count or generate instructions with lower average CPI (e.g., by using simpler instructions).CPI (e.g., by using simpler instructions).

Limits of PerformanceLimits of PerformanceExecution time of a program on a Execution time of a program on a computer is 100 s:computer is 100 s:

80 s for multiply operations80 s for multiply operations

20 s for other operations20 s for other operations

Improve multiply Improve multiply nn times: times: 8080Execution time = (── + 20 ) secondsExecution time = (── + 20 ) seconds nn

Limit: Even if Limit: Even if nn = = ∞∞, execution time cannot , execution time cannot be reduced below 20 s.be reduced below 20 s.

Amdahl’s LawAmdahl’s LawThe execution time of a The execution time of a

system, in general, has two system, in general, has two fractions – a fractionfractions – a fraction f fenhenh that that

can be speeded up by factor can be speeded up by factor nn, ,

and the remaining fraction 1 - and the remaining fraction 1 - ffenhenh that cannot be improved. that cannot be improved.

Thus, the possible speedup is:Thus, the possible speedup is:

G. M. Amdahl, “Validity of the G. M. Amdahl, “Validity of the

Single Processor Approach to Single Processor Approach to

Achieving Large-Scale Achieving Large-Scale

Computing Capabilities,” Computing Capabilities,” Proc. Proc.

AFIPS Spring Joint Computer AFIPS Spring Joint Computer

ConfConf., Atlantic City, NJ, April ., Atlantic City, NJ, April

1967, pp. 483-485.1967, pp. 483-485.

Old timeSpeedup = ──────

New time

1 = ──────────

1 – fenh + fenh/n

Gene Myron Amdahl born 1922

http://en.wikipedia.org/wiki/Gene_Amdahl

Wisconsin Integrally Synchronized Wisconsin Integrally Synchronized Computer (WISC), 1950-51Computer (WISC), 1950-51

Parallel Processors: Shared MemoryParallel Processors: Shared Memory

Parallel ProcessorsParallel ProcessorsShared Memory, Infinite BandwidthShared Memory, Infinite Bandwidth

N processorsN processors

Single processor: non-memory execution time = Single processor: non-memory execution time = αα

Memory access time = 1 – Memory access time = 1 – αα

N processor run time, T(N)= 1 – N processor run time, T(N)= 1 – αα + + αα/N/N

T(1) T(1) 11 N N

Speedup = Speedup = ——— = —————— = —————————— = —————— = ———————

T(N)T(N) 1 – 1 – αα + + αα/N/N (1 – (1 – αα)N + )N + αα

Maximum speedup = 1/(1 – Maximum speedup = 1/(1 – αα), when N = ∞), when N = ∞

Run TimeRun Time

1 – α

1 2 3 4 5 6 7

Number of processors (N)

T(N) = 1 – α + α/N

SpeedupSpeedup

11 2 3 4 5 6

Ideal, N(α = 1)

N(1 – α)N + α

ExampleExample

10% memory accesses, i.e., 10% memory accesses, i.e., αα = 0.9 = 0.9

Maximum speedup=Maximum speedup= 1/(1 – a)1/(1 – a)

== 1.0/0.1 = 10, 1.0/0.1 = 10, when N = ∞when N = ∞

What is the speedup with 10 What is the speedup with 10 processors?processors?

Parallel ProcessorsParallel ProcessorsShared Memory, Finite BandwidthShared Memory, Finite Bandwidth

Memory access time = (1 – Memory access time = (1 – αα)N )N

N processor run time, T(N) = (1 – N processor run time, T(N) = (1 – αα)N + )N + αα/N/N

Speedup = Speedup = ———————— = ———————— = ——————————————

(1 – (1 – αα)N + )N + αα/N/N (1 – (1 – αα)N)N22 + + αα

Run TimeRun Time

1 – α

1 2 3 4 5 6 7

T(N) = (1 – α)N + α/N(1 – α)N

Minimum Run TimeMinimum Run Time

Minimize N processor run time,Minimize N processor run time,

T(N) = (1 – T(N) = (1 – αα)N + )N + αα/N/N

∂∂T(N)/∂N = 0T(N)/∂N = 0

1 – 1 – αα – – αα/N/N22 = 0, N = [ = 0, N = [αα/(1 – /(1 – αα)])]½½

Min. T(N) = 2Min. T(N) = 2[[αα(1 – (1 – αα)])]½½, because , because ∂∂22T(N)/∂NT(N)/∂N22 > 0. > 0.

Maximum speedup = 1/T(N) = 0.5Maximum speedup = 1/T(N) = 0.5[[αα(1 – (1 – αα)])]-½-½

Example: Example: αα = 0.9 = 0.9Maximum speedup = 1.67, when N = 3Maximum speedup = 1.67, when N = 3

SpeedupSpeedup

1 2 3 4 5 6

Ideal, N

N(1 – α)N2 + α

Parallel Processors: Distributed MemoryParallel Processors: Distributed Memory

Inter-connectio

nnetwork

Parallel ProcessorsParallel ProcessorsDistributed MemoryDistributed Memory

Memory access time = 1 – Memory access time = 1 – αα, same as single processor, same as single processor

Communication overhead = Communication overhead = ββ(N – 1)(N – 1)

N processor run time, T(N) = N processor run time, T(N) = ββ(N – 1) + 1/N(N – 1) + 1/N

11 N N

Speedup = Speedup = ———————— = ——————————————— = ———————

ββ(N – 1) + 1/N(N – 1) + 1/N ββN(N – 1) + 1N(N – 1) + 1

Minimum Run TimeMinimum Run TimeMinimize N processor run time,Minimize N processor run time,

T(N) = T(N) = ββ(N – 1) + 1/N(N – 1) + 1/N

∂∂T(N)/∂N = 0T(N)/∂N = 0

ββ – 1/N – 1/N22 = 0, N = = 0, N = ββ-½-½

Min. T(N) = 2Min. T(N) = 2ββ½½ – – ββ, because , because ∂∂22T(N)/∂NT(N)/∂N22 > 0. > 0.

Maximum speedup = 1/T(N) = 1/(2Maximum speedup = 1/T(N) = 1/(2ββ½½ – – ββ))

Example: Example: ββ = 0.01, Maximum speedup: = 0.01, Maximum speedup:N = 10N = 10

T(N) = 0.19T(N) = 0.19

Speedup = 5.26Speedup = 5.26

Run TimeRun Time

01 10 20 30

T(N) = β(N – 1) + 1/N

β(N – 1)

SpeedupSpeedup

2 4 6 8 10 12

Ideal, N

NβN(N – 1) + 1

Further ReadingFurther ReadingG. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” Scale Computing Capabilities,” Proc. AFIPS Spring Joint Computer ConfProc. AFIPS Spring Joint Computer Conf., ., Atlantic City, NJ, Apr. 1967, pp. 483-485.Atlantic City, NJ, Apr. 1967, pp. 483-485.

J. L. Gustafson, “Reevaluating Amdahl’s Law,” J. L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACMComm. ACM, vol. 31, no. 5, pp. , vol. 31, no. 5, pp. 532-533, May 1988.532-533, May 1988.

M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” ComputerComputer, vol. , vol. 41, no. 7, pp. 33-38, July 2008.41, no. 7, pp. 33-38, July 2008.

D. H. Woo and H.-H. S. Lee, “Extending Amdahl’s Law for Energy-Efficient D. H. Woo and H.-H. S. Lee, “Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era,” Computing in the Many-Core Era,” ComputerComputer, vol. 41, no. 12, pp. 24-31, Dec. , vol. 41, no. 12, pp. 24-31, Dec. 2008.2008.

S. M. Pieper, J. M. Paul and M. J. Schulte, “A New Era of Performance S. M. Pieper, J. M. Paul and M. J. Schulte, “A New Era of Performance Evaluation,” Evaluation,” ComputerComputer, vol. 40, no. 9, pp. 23-30, Sep. 2007., vol. 40, no. 9, pp. 23-30, Sep. 2007.

S. Gal-On and M. Levy, “Measuring Multicore Performance,” S. Gal-On and M. Levy, “Measuring Multicore Performance,” ComputerComputer, vol. 41, , vol. 41, no. 11, pp. 99-102, November 2008.no. 11, pp. 99-102, November 2008.

S. Williams, A. Waterman and D. Patterson, “Roofline: An Insightful Visual S. Williams, A. Waterman and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Performance Model for Multicore Architectures,” Comm. ACMComm. ACM, vol. 52, no. 4, pp. , vol. 52, no. 4, pp. 65-76, Apr. 2009.65-76, Apr. 2009.

U. Vishkin, “Is Multicore Hardware for General-Purpose Parallel Processing U. Vishkin, “Is Multicore Hardware for General-Purpose Parallel Processing Broken?” Broken?” Comm. ACMComm. ACM, vol. 57, no. 4, pp. 35-39, Apr. 2014., vol. 57, no. 4, pp. 35-39, Apr. 2014.

Fall 2014, Nov 10... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer...

Documents

Spr 2015, Jan 26... ELEC 5200-001/6200-001 Lecture 3 1 ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Instruction Set Architecture

ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2009 Review of VHDL

Type RP2 & RS2 2 Element Infrared H eaterssite.electricsuppliesonline.com/documents/marley/5200-2396-001.pdf · partno.5200-2396-001 02/07 ecr 36 934 excess element lead wires could

1Fall 2015, Aug 19... ELEC 5200-001/6200-001 Lecture 2 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2015 History of Computers (Chapter

Fall 09, Aug 19 ELEC 5200-001/6200-001 Lecture 2 (from Prof. Nelson's and Prof. Stroud's course material) 1 ELEC 5200-001/6200-001 Computer Architecture

TomTom GO Käyttöopasdownload.tomtom.com/open/manuals/gox20x/refman/... · , GO 520, GO 620, GO 5200, GO 6200 Laitteen päivitys Wi-Fi-yhteyden avulla Voit nyt päivittää laitteesi

Spring 2016, Jan 13 ELEC 5200-001/6200-001 Lecture 1 1 ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2016 Introduction Vishwani D. Agrawal

Spr 2015, Feb 9... ELEC 5200-001/6200-001 Lecture 4 1 ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Compiling and Executing Programs

001-69178 SPR-5200 and SPR-4600 Installation and Operation ... · SunPower Corporation Document # 001‐69178 Rev** 1 SPR-5200 and SPR-4600 Installation and Operation Manual

ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2009 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2009 Modeling for Synthesis

Computer Architecture and Design ELEC 5200/6200 Class Project Overview

ELEC 5200/6200 Computer Architecture and Design Spring 2017 · 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing

Chapter 5 Processor Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides We're ready to look at an implementation of the MIPS Simplified

Spring 2008, Jan. 14 ELEC 5200-001/6200-001 Lecture 2 1 ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2007 Introduction Vishwani D. Agrawal

Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Instruction-Level Parallelism

ADVANCED WINDOW CORP. Series 5200 / 6200 …advancedwindow.com/newbrochure.pdfSeries 5200 / 6200 Double Hung Window . ... VI NY L RE P L AC EM EN T WI N DO WS ADVANCED WINDOW CORP

Fall 2015, Nov 18... ELEC 5200-001/6200-001 Lecture 10 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2015 Performance of a Computer (Chapter

ELEC 5200/6200 Computer Architecture and Design Spring 2017uguin/teaching/E6200_Spring_2017/lectures/lec3_isa.pdfDesigning a Computer 1/8/2017 ELEC 5200-001/6200-001 Lecture 3 2 Memory

Floating Point Numbers - eng.auburn.edunelsovp/courses/elec5200_6200/ELEC5200_6200... · Floating Point Numbers. ELEC 5200/6200 - From P-H slides We need a way to represent a wide

Fall 2014, Oct 13... ELEC 5200-001/6200-001 Lecture 8 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Microprogramming (Appendix D)