View
220
Download
2
Category
Preview:
Citation preview
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 11
ELEC 5200-001/6200-001ELEC 5200-001/6200-001Computer Architecture and DesignComputer Architecture and Design
Fall 2014Fall 2014 Performance of a Computer Performance of a Computer
(Chapter 4)(Chapter 4)Vishwani D. AgrawalVishwani D. Agrawal
James J. Danaher ProfessorJames J. Danaher ProfessorDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849http://www.eng.auburn.edu/~vagrawal
vagrawal@eng.auburn.edu
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 22
What is Performance?What is Performance?Response time: the time between the start and Response time: the time between the start and completion of a task.completion of a task.
Throughput: the total amount of work done in a Throughput: the total amount of work done in a given time.given time.
Some performance measures:Some performance measures:MIPS (million instructions per second).MIPS (million instructions per second).
MFLOPS (million floating point operations per second), also MFLOPS (million floating point operations per second), also GFLOPS, TFLOPS (10GFLOPS, TFLOPS (101212), etc.), etc.
SPEC (System Performance Evaluation Corporation) SPEC (System Performance Evaluation Corporation) benchmarks.benchmarks.
LINPACK benchmarks, floating point computing, used for LINPACK benchmarks, floating point computing, used for supercomputers.supercomputers.
Synthetic benchmarks.Synthetic benchmarks.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 33
Small and Large NumbersSmall and Large NumbersSmall Large
1010-3-3 millimilli mm 101033 kilokilo kk
1010-6-6 micromicro μμ 101066 megamega MM
1010-9-9 nanonano nn 101099 gigagiga GG
1010-12-12 picopico pp 10101212 teratera TT
1010-15-15 femtofemto ff 10101515 petapeta PP
1010-18-18 attoatto 10101818 exaexa
1010-21-21 zeptozepto 10102121 zettazetta
1010-24-24 yoctoyocto 10102424 yottayotta
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 44
Computer Memory SizeComputer Memory Size
Number bits bytes
221010 1,0241,024 KK KbKb KBKB
222020 1,048,5761,048,576 MM MbMb MBMB
223030 1,073,741,8241,073,741,824 GG GbGb GBGB
224040 1,099,511,627,7761,099,511,627,776 TT TbTb TBTB
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 55
Units for Measuring PerformanceUnits for Measuring PerformanceTime in seconds (s), microseconds (Time in seconds (s), microseconds (μμs), s), nanoseconds (ns), or picoseconds (ps).nanoseconds (ns), or picoseconds (ps).Clock cycleClock cycle
Period of the hardware clockPeriod of the hardware clockExample: one clock cycle means 1 nanosecond for Example: one clock cycle means 1 nanosecond for a 1GHz clock frequency (or 1GHz clock rate)a 1GHz clock frequency (or 1GHz clock rate)
CPU time = (CPU clock cycles)/(clock CPU time = (CPU clock cycles)/(clock rate)rate)
Cycles per instruction (CPI): average Cycles per instruction (CPI): average number of clock cycles used to execute a number of clock cycles used to execute a computer instruction.computer instruction.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 66
Components of PerformanceComponents of PerformanceComponents of Performance
Units
CPU time for a programCPU time for a program Time (seconds, etc.)Time (seconds, etc.)
Instruction countInstruction count Instructions executed by Instructions executed by the programthe program
CPICPI Average number of Average number of clock cycles per clock cycles per instructioninstruction
Clock cycle timeClock cycle time Time period of clock Time period of clock (seconds, etc.)(seconds, etc.)
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 77
Time, While You Wait, or Pay ForTime, While You Wait, or Pay For
CPU timeCPU time is the time taken by CPU to is the time taken by CPU to execute the program. It has two execute the program. It has two components:components:– User CPU time User CPU time is the time to execute the is the time to execute the
instructions of the program.instructions of the program.– System CPU time System CPU time is the time used by the is the time used by the
operating system to run the program.operating system to run the program.
Elapsed time (wall clock time) Elapsed time (wall clock time) is theis the time time between the start and end of a program.between the start and end of a program.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 88
Example: Unix “time” CommandExample: Unix “time” Command90.7u 12.9s 2:39 65%
Use
r C
PU
tim
ein
sec
on
ds
Sys
tem
CP
U t
ime
in s
eco
nd
s
Ela
pse
d t
ime
In m
in:s
ec
CP
U t
ime
as p
erce
nt
of
elap
sed
tim
e
90.7 + 12.9 ─────── × 100 = 65% 159
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 99
Computing CPU TimeComputing CPU Time
CPU time = Instruction count × CPI × Clock cycle time
Instruction count × CPI= ────────────────
Clock rate
Instructions Clock cycles 1 second= ──────── × ───────── × ────────
Program Instruction Clock rate
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1010
Comparing Computers C1 and C2Comparing Computers C1 and C2
Run the same program on C1 and C2. Suppose both Run the same program on C1 and C2. Suppose both computers execute the same number ( computers execute the same number ( N N ) of instructions:) of instructions:
C1: CPI = 2.0, clock cycle time = 1 nsC1: CPI = 2.0, clock cycle time = 1 ns
CPU time(C1) = CPU time(C1) = NN × 2.0 × 1 = 2.0× 2.0 × 1 = 2.0NN ns nsC2: CPI = 1.2, clock cycle time = 2 nsC2: CPI = 1.2, clock cycle time = 2 ns
CPU time(C2) = CPU time(C2) = NN × 1.2 × 2 = 2.4× 1.2 × 2 = 2.4NN ns ns
CPU time(C2)/CPU time(C1) = 2.4CPU time(C2)/CPU time(C1) = 2.4NN/2.0/2.0NN = 1.2, therefore, = 1.2, therefore, C1C1 is is 1.21.2 times faster than times faster than C2.C2.
Result can vary with the choice of program.Result can vary with the choice of program.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1111
Comparing Program Codes I & IIComparing Program Codes I & IICode size for a program:Code size for a program:– Code I has 5 million instructionsCode I has 5 million instructions– Code II has 6 million instructionsCode II has 6 million instructions– Code I is more efficient. Code I is more efficient. Is it?Is it?
Suppose a computer has three Suppose a computer has three types of instructions: A, B and C.types of instructions: A, B and C.CPU cycles (code I) = 10 millionCPU cycles (code I) = 10 millionCPU cycles (code II) = 9 millionCPU cycles (code II) = 9 millionCode II is more efficient.Code II is more efficient.
CPI( I ) = 10/5 = 2CPI( I ) = 10/5 = 2CPI( II ) = 9/6 = 1.5CPI( II ) = 9/6 = 1.5Code II is more efficient.Code II is more efficient.
Caution:Caution: Code size is a misleading Code size is a misleading indicator of performance. indicator of performance.
Instr. Type CPI
AA 11
BB 22
CC 33
Code
Instruction count in million
Type A
Type B
Type C
Total
II 22 11 22 55
IIII 44 11 11 66
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1212
Rating of a ComputerRating of a ComputerMIPS: million instructions per secondMIPS: million instructions per second
Instruction count of a programInstruction count of a programMIPS = ───────────────────MIPS = ───────────────────
Execution time Execution time × 10× 1066
MIPS rating of a computer is relative to a MIPS rating of a computer is relative to a program.program.Standard programs for performance rating:Standard programs for performance rating:
Synthetic benchmarksSynthetic benchmarksSPEC benchmarks (System Performance Evaluation SPEC benchmarks (System Performance Evaluation Corporation)Corporation)
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1313
Synthetic Benchmark ProgramsSynthetic Benchmark ProgramsArtificial programs that emulate a large set Artificial programs that emulate a large set of typical “real” programs.of typical “real” programs.Whetstone benchmark – Algol and Fortran.Whetstone benchmark – Algol and Fortran.Dhrystone benchmark – Ada and C.Dhrystone benchmark – Ada and C.Disadvantages:Disadvantages:– No clear agreement on what a typical No clear agreement on what a typical
instruction mix should be.instruction mix should be.– Benchmarks do not produce meaningful result.Benchmarks do not produce meaningful result.– Purpose of rating is defeated when compilers Purpose of rating is defeated when compilers
are written to optimize the performance rating.are written to optimize the performance rating.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1414
AdaAda Lady Augusta Ada Byron, Countess of Lovelace (1815-1852), daughter of Lord Byron (the poet who spent some time in a Swiss jail – in Chillon, not too far from Lausanne...). She was the assistant and patron of Charles Babbage; she wrote programs for his “Analytical Engine.”
An original print from its time.http://www.cs.kuleuven.ac.be/~dirk/ada-belgium/pictures.html
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1515
Misleading CompilersMisleading CompilersConsider a computer with a clock rate of 1 GHz.Consider a computer with a clock rate of 1 GHz.Two compilers produce the following instruction Two compilers produce the following instruction mixes for a program:mixes for a program:
Code from
Instruction count (billions)
CPU
clock
cycles
CPI
CPU
time*
(seconds)
MIPS**Type
AType
BType
CTotal
Compiler 1Compiler 1 55 11 11 77 1010×10×1099 1.431.43 1010 700700
Compiler 2Compiler 2 1010 11 11 1212 1515×10×1099 1.251.25 1515 800800Instruction types – A: 1-cycle, B: 2-cycle, C: 3-cycle
* CPU time = CPU clock cycles/clock rate
** MIPS = (Total instruction count/CPU time) × 10 – 6
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1616
Peak and Relative MIPS RatingsPeak and Relative MIPS RatingsPeak MIPSPeak MIPS
Choose an instruction mix to minimize CPIChoose an instruction mix to minimize CPIThe rating can be too high and unrealistic for general programsThe rating can be too high and unrealistic for general programs
Relative MIPS: Use a reference computer systemRelative MIPS: Use a reference computer system
Time(ref)Time(ref)Relative MIPS = Relative MIPS = ────── ────── × MIPS(ref)× MIPS(ref)
TimeTime
Historically, VAX-11/ 780, believed to have aHistorically, VAX-11/ 780, believed to have a11 MIPS performance, was used as reference. MIPS performance, was used as reference.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1717
Wĕbopēdia on Wĕbopēdia on MIPSMIPS
Acronym for Acronym for mmillion illion iinstructions nstructions pper er ssecondecond. An old . An old measure of a computer's speed and power, MIPS measure of a computer's speed and power, MIPS measures roughly the number of machine measures roughly the number of machine instructions that a computer can execute in one instructions that a computer can execute in one second.second.In fact, some people jokingly claim that MIPS really In fact, some people jokingly claim that MIPS really stands for stands for MMeaningless eaningless IIndicator of ndicator of PPerformance.erformance.Despite these problems, a MIPS rating can give Despite these problems, a MIPS rating can give you a general idea of a computer's speed. The IBM you a general idea of a computer's speed. The IBM PC/XT computer, for example, is rated at PC/XT computer, for example, is rated at ¼ MIPS¼ MIPS, , while Pentium-based PCs run at over while Pentium-based PCs run at over 100 MIPS100 MIPS. .
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1818
A 1994 MIPS Rating ChartA 1994 MIPS Rating Chart
Computer MIPS Price $/MIPS
1975 IBM mainframe1975 IBM mainframe 1010 $10M$10M 1M1M
1976 Cray-11976 Cray-1 160160 $20M$20M 125K125K
1979 DEC VAX1979 DEC VAX 11 $200K$200K 200K200K
1981 IBM PC1981 IBM PC 0.250.25 $3K$3K 12K12K
1984 Sun 21984 Sun 2 11 $10K$10K 10K10K
1994 Pentium PC1994 Pentium PC 6666 $3K$3K 4646
1995 Sony PCX video game1995 Sony PCX video game 500500 $500$500 11
1995 Microunity set-top1995 Microunity set-top 1,0001,000 $500$500 0.50.5 New
Yor
k T
imes
, Apr
il 20
, 199
4
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 1919
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2020
MFLOPS (megaFLOPS)MFLOPS (megaFLOPS)
Only floating point operations are counted:Only floating point operations are counted:– Float, real, double; add, subtract, multiply, divideFloat, real, double; add, subtract, multiply, divide
MFLOPS rating is relevant in scientific computing. For MFLOPS rating is relevant in scientific computing. For example, programs like a compiler will measure almost 0 example, programs like a compiler will measure almost 0 MFLOPS.MFLOPS.
Sometimes misleading due to different implementations. Sometimes misleading due to different implementations. For example, a computer that does not have a floating-point For example, a computer that does not have a floating-point divide, will register many FLOPS for a division.divide, will register many FLOPS for a division.
Number of floating-point operations in a programMFLOPS = ─────────────────────────────────
Execution time × 106
Supercomputer PerformanceSupercomputer Performance
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2121
Gigaflops
Teraflops
Petaflops
Exaflops
http
://e
n.w
ikip
edia
.org
/wik
i/Sup
erco
mpu
ter
Megaflops
Top Supercomputers, June 2012Top Supercomputers, June 2012www.top500.org
Rank Name Location CoresClock GHz
Max. Pflops
Power MW
Eff. Pflops/MJoule
1 Titan/CrayOak
Ridge560,640 2.2 27.11 8.21 3.30
2 Sequoia IBM USA 1,572864 1.6 16.30 7.89 2.07
3K
computerFujitsu Japan
795,024 2.0 10.50 12.66 0.83
4 Mira IBM USA 786,432 1.6 8.16 3.95 2.07
5 SuperMUCIBM
Germany147,456 2.7 2.90 3.52 0.82
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2222
N. Leavitt, “Big Iron Moves Toward Exascale Computing,” Computer, vol. 45,no. 11, pp. 14-17, Nov. 2012.
The FutureThe Future
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2323
Erik P. DeBenedictis of Sandia National Laboratories theorizes that a zettaflops (1021) (one sextillion FLOPS) computer is required to accomplish full weather modeling, which could cover a two week time span accurately. Such systems might be built around 2030.
http://en.wikipedia.org/wiki/Supercomputer
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2424
PerformancePerformance
Performance is measured for a given program or a Performance is measured for a given program or a set of programs:set of programs:
Av. eAv. execution timexecution time = (1/ = (1/nn) ) ΣΣ Execution time Execution time ((program i program i ))
oror
Av. execution time = Av. execution time = [ [ ∏∏ Execution time Execution time ((program i program i )) ]]1/1/nn
Performance is inverse of execution time:Performance is inverse of execution time:
PerformancePerformance = 1/( = 1/(Execution timeExecution time))
i =1
n
i =1
n
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2525
Geometric vs. Arithmetic MeanGeometric vs. Arithmetic MeanReference computer times of n programs: r1, . . . , rnReference computer times of n programs: r1, . . . , rnTimes of n programs on the computer under evaluation: Times of n programs on the computer under evaluation: T1, . . . , TnT1, . . . , TnNormalized times: T1/r1, . . . , Tn/rnNormalized times: T1/r1, . . . , Tn/rnGeometric meanGeometric mean == {(T1/r1) . . . (Tn/rn)}{(T1/r1) . . . (Tn/rn)}1/n1/n
{T1 . . . Tn}{T1 . . . Tn}1/n1/n
= = UsedUsed{r1 . . . rn}{r1 . . . rn}1/n1/n
Arithmetic meanArithmetic mean = = {(T1/r1)+ . . . +(Tn/rn)}/n{(T1/r1)+ . . . +(Tn/rn)}/n{T1+ . . . +Tn}/n{T1+ . . . +Tn}/n
≠ ≠ Not usedNot used{r1+ . . . +rn}/n{r1+ . . . +rn}/n
J. E. Smith, “Characterizing Computer Performance with a Single J. E. Smith, “Characterizing Computer Performance with a Single Number,” Number,” Comm. ACMComm. ACM, vol. 31, no. 10, pp. 1202-1206, Oct. 1988., vol. 31, no. 10, pp. 1202-1206, Oct. 1988.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2626
SPEC BenchmarksSPEC BenchmarksSystem Performance Evaluation Corporation System Performance Evaluation Corporation (SPEC)(SPEC)
SPEC89SPEC89– 10 programs10 programs– SPEC performance ratio relative to VAX-11/780SPEC performance ratio relative to VAX-11/780– One program, matrix300, dropped because One program, matrix300, dropped because
compilers could be engineered to improve its compilers could be engineered to improve its performance.performance.
– www.spec.org
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2727
SPEC89 Performance Ratio forSPEC89 Performance Ratio forIBM Powerstation 550IBM Powerstation 550
0
100
200
300
400
500
600
700
800g
cc
esp
ress
o
spic
e
do
cuc
nas
a7 li
eqn
tott
mat
rix3
00
fpp
pp
tom
catv
compiler
enhanced compiler
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2828
SPEC95 BenchmarksSPEC95 BenchmarksEight integer and ten floating point Eight integer and ten floating point programs, programs, SPECint95SPECint95 and and SPECfp95SPECfp95..
Each program run time is normalized with Each program run time is normalized with respect to the run time of respect to the run time of Sun Sun SPARCstation 10/40SPARCstation 10/40 – the ratio is called – the ratio is called SPEC ratioSPEC ratio..
SPECint95SPECint95 and and SPECfp95SPECfp95 summary summary measurements are the geometric means of measurements are the geometric means of SPEC ratios.SPEC ratios.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 2929
SPEC CPU2000 BenchmarksSPEC CPU2000 BenchmarksTwelve integer and 14 floating point Twelve integer and 14 floating point programs, programs, CINT2000CINT2000 and and CFP2000CFP2000..
Each program run time is normalized to Each program run time is normalized to obtain a obtain a SPEC ratioSPEC ratio with respect to the run with respect to the run time on time on Sun Ultra 5_10 with a 300MHz Sun Ultra 5_10 with a 300MHz processorprocessor..
CINT2000CINT2000 and and CFP2000CFP2000 summary summary measurements are the geometric means measurements are the geometric means of SPEC ratios.of SPEC ratios.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3030
Reference CPU: Sun Ultra 5_10 Reference CPU: Sun Ultra 5_10 300MHz Processor300MHz Processor
0
500
1000
1500
2000
2500
3000
3500g
zip
vp
rg
cc
mc
fc
raft
yp
ars
er
eo
np
erl
bm
kg
ap
vo
rte
xb
zip
2tw
olf
wu
pw
ise
sw
imm
gri
da
pp
lum
es
ag
alg
el
art
eq
ua
ke
fac
ere
ca
mm
plu
ca
sfm
a3
ds
ixtr
ac
ka
ps
i
CINT2000
CFP2000
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3131
CINT2000: 3.4GHz Pentium 4, HT CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard)Technology (D850MD Motherboard)
0
500
1000
1500
2000
2500
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
Base ratio
Opt. ratio
SPECint2000_base = 1341SPECint2000 = 1389
Source: www.spec.org
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3232
Two Benchmark ResultsTwo Benchmark Results
Baseline: A uniform configuration not Baseline: A uniform configuration not optimized for specific program:optimized for specific program:
Same compiler with same settings and flags used Same compiler with same settings and flags used for all benchmarksfor all benchmarks
Other restrictionsOther restrictions
Peak: Run is optimized for obtaining the Peak: Run is optimized for obtaining the peak performance for each benchmark peak performance for each benchmark program.program.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3333
CINT2000: 1.7GHz Pentium 4CINT2000: 1.7GHz Pentium 4(D850MD Motherboard)(D850MD Motherboard)
0100200300400500600700800900
1000
gzi
p
vpr
gcc
mcf
craf
ty
par
ser
eon
per
lbm
k
gap
vort
ex
bzi
p2
two
lf
Base ratio
Opt. ratio
SPECint2000_base = 579SPECint2000 = 588
Source: www.spec.org
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3434
CFP2000: 1.7GHz Pentium 4 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard)(D850MD Motherboard)
0
200
400
600
800
1000
1200
1400w
up
wis
esw
im
mg
rid
app
lum
esa
gal
gel art
equ
ake
face
rec
amm
plu
cas
fma3
dsi
xtra
ck
apsi
Base ratio
Opt. ratio
SPECfp2000_base = 648SPECfp2000 = 659
Source: www.spec.org
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3535
Additional SPEC BenchmarksAdditional SPEC Benchmarks
SPECweb99: measures the performance of a SPECweb99: measures the performance of a computer in a networked environment.computer in a networked environment.
Energy efficiency mode: Besides the execution Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a programs is also measured. Energy efficiency of a benchmark program is given by:benchmark program is given by:
1/(Execution time)1/(Execution time)Energy efficiency Energy efficiency == ────────────────────────
Power in wattsPower in watts
== Program units/jouleProgram units/joule
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3636
Energy EfficiencyEnergy Efficiency
Efficiency averaged on Efficiency averaged on nn benchmark programs: benchmark programs:
nnEfficiencyEfficiency == (( ΠΠ Efficiency Efficiencyii ))
1/1/nn
i i =1=1
where Efficiencywhere Efficiencyii is the efficiency for program is the efficiency for program ii..
Relative efficiency:Relative efficiency:
Efficiency of a computerEfficiency of a computerRelative efficiency = ─────────────────Relative efficiency = ─────────────────
Eff. of reference Eff. of reference computercomputer
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3737
SPEC2000 Relative Energy EfficiencySPEC2000 Relative Energy Efficiency
0
1
2
3
4
5
6
SP
EC
INT
20
00
SP
EC
FP
20
00
SP
EC
INT
20
00
SP
EC
FP
20
00
SP
EC
INT
20
00
SP
EC
FP
20
00
Pentium M@1.6/0.6GHz Energy-efficient procesor
Pentium 4-M@2.4GHz (Reference)
Pentium III-M@1.2GHz
Always max. clock
Laptop adaptive clk.
Min. power min. clock
Energy and Time PerspectivesEnergy and Time PerspectivesClock cycle is the unit of computing work.Clock cycle is the unit of computing work.
Cycle rate, f cycles per secondCycle rate, f cycles per secondf is the rate of doing computing workf is the rate of doing computing work
Hardware speed, similar to mph for a carHardware speed, similar to mph for a car
Cycle efficiency, Cycle efficiency, ηη cycles per joule cycles per jouleηη is the computing work per energy unit is the computing work per energy unit
Hardware efficiency, similar to mpg for a carHardware efficiency, similar to mpg for a car
Results from recent work:Results from recent work:– A. Shinde, “Managing Performance and Efficiency of a Processor,” A. Shinde, “Managing Performance and Efficiency of a Processor,”
MEE Project ReportMEE Project Report, Auburn Univ., Dec. 2012., Auburn Univ., Dec. 2012.– A. Shinde and V. D. Agarwal, “Managing Performance and Efficiency A. Shinde and V. D. Agarwal, “Managing Performance and Efficiency
of a Processor,” of a Processor,” Proc. 45th IEEE Southeastern Symposium on Proc. 45th IEEE Southeastern Symposium on System TheorySystem Theory, Baylor Univ., TX, March 2013, pp. 59-62., Baylor Univ., TX, March 2013, pp. 59-62.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3838
Energy/Cycle for an 8-bit Adder in Energy/Cycle for an 8-bit Adder in 90nm CMOS Technology (PTM)90nm CMOS Technology (PTM)
K. Kim, “Ultra Low Power CMOS Design” K. Kim, “Ultra Low Power CMOS Design” PhD DissertationPhD Dissertation, Auburn , Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.University, Dept. of ECE, Auburn, Alabama, May 2011.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 3939
Delay of an 8-bit Adder in
90nm CMOS Technology (PTM)
K. Kim, “Ultra Low Power CMOS Design” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4040
Pentium M ProcessorPentium M Processor
Published data: H. Hanson, K. Rajamani, S. Keckler, F. Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” DVFS: Analysis with an Intel Pentium M,” Proc.Proc. International Symp. Low Power Electronics and DesignInternational Symp. Low Power Electronics and Design, , 2007, pp. 219-224.2007, pp. 219-224.
VDD = 1.2VVDD = 1.2V
Maximum clock rate = 1.8GHzMaximum clock rate = 1.8GHz
Critical path delay, td = 1/1.8GHz = 555.56psCritical path delay, td = 1/1.8GHz = 555.56ps
Power consumption = 120WPower consumption = 120W
Energy per cycle, EPC = 120/(1.8GHz) = 66.67nJEnergy per cycle, EPC = 120/(1.8GHz) = 66.67nJ
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4141
Cycle Efficiency and Frequency for Pentium M
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 104242
Example of Power Management For a program that executes in 1.8 billion clock For a program that executes in 1.8 billion clock
cycles.cycles.
Voltage VDD
Frequency f
MHz
Cycle Efficiency,η
Execution Time
second
Total Energy
Consumed
Powerf/η
1.2 V1800
megacycles/s15
megacycles/joule1.0 120 Joules 120W
0.6 V277
megacycles/s70
megacycles/joule6.5 25 Joules 39.6W
200 mV54.5
megacycles/s660
megacycles/joule33 2.72 Joules 0.083W
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4343
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4444
Ways of Improving PerformanceWays of Improving Performance
Increase clock rate.Increase clock rate.
Improve processor organization for lower CPIImprove processor organization for lower CPIPipeliningPipelining
Instruction-level parallelism (ILP): MIMD (Scalar)Instruction-level parallelism (ILP): MIMD (Scalar)
Data-parallelism: SIMD (Vector)Data-parallelism: SIMD (Vector)
multiprocessingmultiprocessing
Compiler enhancements that lower the instruction Compiler enhancements that lower the instruction count or generate instructions with lower average count or generate instructions with lower average CPI (e.g., by using simpler instructions).CPI (e.g., by using simpler instructions).
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4545
Limits of PerformanceLimits of PerformanceExecution time of a program on a Execution time of a program on a computer is 100 s:computer is 100 s:
80 s for multiply operations80 s for multiply operations
20 s for other operations20 s for other operations
Improve multiply Improve multiply nn times: times: 8080Execution time = (── + 20 ) secondsExecution time = (── + 20 ) seconds nn
Limit: Even if Limit: Even if nn = = ∞∞, execution time cannot , execution time cannot be reduced below 20 s.be reduced below 20 s.
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4646
Amdahl’s LawAmdahl’s LawThe execution time of a The execution time of a
system, in general, has two system, in general, has two fractions – a fractionfractions – a fraction f fenhenh that that
can be speeded up by factor can be speeded up by factor nn, ,
and the remaining fraction 1 - and the remaining fraction 1 - ffenhenh that cannot be improved. that cannot be improved.
Thus, the possible speedup is:Thus, the possible speedup is:
G. M. Amdahl, “Validity of the G. M. Amdahl, “Validity of the
Single Processor Approach to Single Processor Approach to
Achieving Large-Scale Achieving Large-Scale
Computing Capabilities,” Computing Capabilities,” Proc. Proc.
AFIPS Spring Joint Computer AFIPS Spring Joint Computer
ConfConf., Atlantic City, NJ, April ., Atlantic City, NJ, April
1967, pp. 483-485.1967, pp. 483-485.
Old timeSpeedup = ──────
New time
1 = ──────────
1 – fenh + fenh/n
Gene Myron Amdahl born 1922
http://en.wikipedia.org/wiki/Gene_Amdahl
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4747
Wisconsin Integrally Synchronized Wisconsin Integrally Synchronized Computer (WISC), 1950-51Computer (WISC), 1950-51
Parallel Processors: Shared MemoryParallel Processors: Shared Memory
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4848
P P
P P
P P
M
Parallel ProcessorsParallel ProcessorsShared Memory, Infinite BandwidthShared Memory, Infinite Bandwidth
N processorsN processors
Single processor: non-memory execution time = Single processor: non-memory execution time = αα
Memory access time = 1 – Memory access time = 1 – αα
N processor run time, T(N)= 1 – N processor run time, T(N)= 1 – αα + + αα/N/N
T(1) T(1) 11 N N
Speedup = Speedup = ——— = —————— = —————————— = —————— = ———————
T(N)T(N) 1 – 1 – αα + + αα/N/N (1 – (1 – αα)N + )N + αα
Maximum speedup = 1/(1 – Maximum speedup = 1/(1 – αα), when N = ∞), when N = ∞
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 4949
Run TimeRun Time
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5050
α
1 – α
1 2 3 4 5 6 7
No
rma
lize
d ru
n ti
me
, T(N
)
Number of processors (N)
α/N
T(N) = 1 – α + α/N
SpeedupSpeedup
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5151
6
5
4
3
2
11 2 3 4 5 6
Sp
eed
up, T
(1)/
T(N
)
Number of processors (N)
Ideal, N(α = 1)
N(1 – α)N + α
ExampleExample
10% memory accesses, i.e., 10% memory accesses, i.e., αα = 0.9 = 0.9
Maximum speedup=Maximum speedup= 1/(1 – a)1/(1 – a)
== 1.0/0.1 = 10, 1.0/0.1 = 10, when N = ∞when N = ∞
What is the speedup with 10 What is the speedup with 10 processors?processors?
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5252
Parallel ProcessorsParallel ProcessorsShared Memory, Finite BandwidthShared Memory, Finite Bandwidth
N processorsN processors
Single processor: non-memory execution time = Single processor: non-memory execution time = αα
Memory access time = (1 – Memory access time = (1 – αα)N )N
N processor run time, T(N) = (1 – N processor run time, T(N) = (1 – αα)N + )N + αα/N/N
11 NN
Speedup = Speedup = ———————— = ———————— = ——————————————
(1 – (1 – αα)N + )N + αα/N/N (1 – (1 – αα)N)N22 + + αα
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5353
Run TimeRun Time
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5454
α
1 – α
1 2 3 4 5 6 7
No
rma
lize
d ru
n ti
me
, T(N
)
Number of processors (N)
α/N
T(N) = (1 – α)N + α/N(1 – α)N
Minimum Run TimeMinimum Run Time
Minimize N processor run time,Minimize N processor run time,
T(N) = (1 – T(N) = (1 – αα)N + )N + αα/N/N
∂∂T(N)/∂N = 0T(N)/∂N = 0
1 – 1 – αα – – αα/N/N22 = 0, N = [ = 0, N = [αα/(1 – /(1 – αα)])]½½
Min. T(N) = 2Min. T(N) = 2[[αα(1 – (1 – αα)])]½½, because , because ∂∂22T(N)/∂NT(N)/∂N22 > 0. > 0.
Maximum speedup = 1/T(N) = 0.5Maximum speedup = 1/T(N) = 0.5[[αα(1 – (1 – αα)])]-½-½
Example: Example: αα = 0.9 = 0.9Maximum speedup = 1.67, when N = 3Maximum speedup = 1.67, when N = 3
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5555
SpeedupSpeedup
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5656
6
5
4
3
2
1
1 2 3 4 5 6
Sp
eed
up, T
(1)/
T(N
)
Number of processors (N)
Ideal, N
N(1 – α)N2 + α
Parallel Processors: Distributed MemoryParallel Processors: Distributed Memory
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5757
P P
P P
P P
M
Inter-connectio
nnetwork
M
M
M
M
M
Parallel ProcessorsParallel ProcessorsDistributed MemoryDistributed Memory
N processorsN processors
Single processor: non-memory execution time = Single processor: non-memory execution time = αα
Memory access time = 1 – Memory access time = 1 – αα, same as single processor, same as single processor
Communication overhead = Communication overhead = ββ(N – 1)(N – 1)
N processor run time, T(N) = N processor run time, T(N) = ββ(N – 1) + 1/N(N – 1) + 1/N
11 N N
Speedup = Speedup = ———————— = ——————————————— = ———————
ββ(N – 1) + 1/N(N – 1) + 1/N ββN(N – 1) + 1N(N – 1) + 1
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5858
Minimum Run TimeMinimum Run TimeMinimize N processor run time,Minimize N processor run time,
T(N) = T(N) = ββ(N – 1) + 1/N(N – 1) + 1/N
∂∂T(N)/∂N = 0T(N)/∂N = 0
ββ – 1/N – 1/N22 = 0, N = = 0, N = ββ-½-½
Min. T(N) = 2Min. T(N) = 2ββ½½ – – ββ, because , because ∂∂22T(N)/∂NT(N)/∂N22 > 0. > 0.
Maximum speedup = 1/T(N) = 1/(2Maximum speedup = 1/T(N) = 1/(2ββ½½ – – ββ))
Example: Example: ββ = 0.01, Maximum speedup: = 0.01, Maximum speedup:N = 10N = 10
T(N) = 0.19T(N) = 0.19
Speedup = 5.26Speedup = 5.26
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 5959
Run TimeRun Time
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 6060
01 10 20 30
No
rma
lize
d ru
n ti
me
, T(N
)
Number of processors (N)
1/N
T(N) = β(N – 1) + 1/N
β(N – 1)
1
SpeedupSpeedup
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 6161
12
10
8
6
4
2
2 4 6 8 10 12
Sp
eed
up, T
(1)/
T(N
)
Number of processors (N)
Ideal, N
NβN(N – 1) + 1
Fall 2014, Nov 10 . . .Fall 2014, Nov 10 . . . ELEC 5200-001/6200-001 Lecture 10ELEC 5200-001/6200-001 Lecture 10 6262
Further ReadingFurther ReadingG. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” Scale Computing Capabilities,” Proc. AFIPS Spring Joint Computer ConfProc. AFIPS Spring Joint Computer Conf., ., Atlantic City, NJ, Apr. 1967, pp. 483-485.Atlantic City, NJ, Apr. 1967, pp. 483-485.
J. L. Gustafson, “Reevaluating Amdahl’s Law,” J. L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACMComm. ACM, vol. 31, no. 5, pp. , vol. 31, no. 5, pp. 532-533, May 1988.532-533, May 1988.
M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” ComputerComputer, vol. , vol. 41, no. 7, pp. 33-38, July 2008.41, no. 7, pp. 33-38, July 2008.
D. H. Woo and H.-H. S. Lee, “Extending Amdahl’s Law for Energy-Efficient D. H. Woo and H.-H. S. Lee, “Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era,” Computing in the Many-Core Era,” ComputerComputer, vol. 41, no. 12, pp. 24-31, Dec. , vol. 41, no. 12, pp. 24-31, Dec. 2008.2008.
S. M. Pieper, J. M. Paul and M. J. Schulte, “A New Era of Performance S. M. Pieper, J. M. Paul and M. J. Schulte, “A New Era of Performance Evaluation,” Evaluation,” ComputerComputer, vol. 40, no. 9, pp. 23-30, Sep. 2007., vol. 40, no. 9, pp. 23-30, Sep. 2007.
S. Gal-On and M. Levy, “Measuring Multicore Performance,” S. Gal-On and M. Levy, “Measuring Multicore Performance,” ComputerComputer, vol. 41, , vol. 41, no. 11, pp. 99-102, November 2008.no. 11, pp. 99-102, November 2008.
S. Williams, A. Waterman and D. Patterson, “Roofline: An Insightful Visual S. Williams, A. Waterman and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Performance Model for Multicore Architectures,” Comm. ACMComm. ACM, vol. 52, no. 4, pp. , vol. 52, no. 4, pp. 65-76, Apr. 2009.65-76, Apr. 2009.
U. Vishkin, “Is Multicore Hardware for General-Purpose Parallel Processing U. Vishkin, “Is Multicore Hardware for General-Purpose Parallel Processing Broken?” Broken?” Comm. ACMComm. ACM, vol. 57, no. 4, pp. 35-39, Apr. 2014., vol. 57, no. 4, pp. 35-39, Apr. 2014.
Recommended