56
4521 Digital System Architecture 1 June 11, 2022 Lecture 2a Computer Performance and Cost Pradondet Nilagupta (Based on lecture notes by Prof. Rand y Katz & Prof. Jan M. Rabaey , UC Berkeley)

204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

Embed Size (px)

Citation preview

Page 1: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 1April 20, 2023

Lecture 2a Computer Performance and Cost

Pradondet Nilagupta

(Based on lecture notes by Prof. Randy Katz

& Prof. Jan M. Rabaey , UC Berkeley)

Page 2: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 2April 20, 2023

Review: What is Computer Architecture?

Technology

ApplicationsComputer

Architect

Interfaces

Machine Organization

Measurement &Evaluation

ISA

AP

I

Link

I/O

Cha

n

Regs

IR

Page 3: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 3April 20, 2023

The Architecture Process

New concepts created

EstimateCost &

PerformanceSort

Good ideasMediocre

ideasBad ideas

Page 4: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 4April 20, 2023

Performance Measurement and Evaluation

Many dimensions to computer performance

– CPU execution time by instruction or sequence

– floating point

– integer

– branch performance

– Cache bandwidth

– Main memory bandwidth

– I/O performance bandwidth seeks pixels or polygons per second

Relative importance depends on applications

P

$

M

Page 5: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 5April 20, 2023

Evaluation Tools Benchmarks, traces, & mixes

– macrobenchmarks & suites application execution time

– microbenchmarks measure one aspect of performance

– traces replay recorded accesses

– cache, branch, register Simulation at many levels

– ISA, cycle accurate, RTL, gate, circuit trade fidelity for simulation rate

Area and delay estimation Analysis

– e.g., queuing theory

MOVE 39%BR 20%LOAD 20%STORE 10%ALU 11%

LD 5EA3ST 31FF….LD 1EA2….

Page 6: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 6April 20, 2023

Benchmarks

Microbenchmarks– measure one performance

dimension cache bandwidth main memory bandwidth procedure call overhead FP performance

– weighted combination of microbenchmark performance is a good predictor of application performance

– gives insight into the cause of performance bottlenecks

Macrobenchmarks– application execution time

measures overall performance, but on just one application

Perf. Dimensions

App

licat

ions

Mic

ro

Macro

Page 7: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 7April 20, 2023

Some Warnings about Benchmarks Benchmarks measure the

whole system

– application

– compiler

– operating system

– architecture

– implementation Popular benchmarks typically

reflect yesterday’s programs

– computers need to be designed for tomorrow’s programs

Benchmark timings often very sensitive to

– alignment in cache

– location of data on disk

– values of data Benchmarks can lead to

inbreeding or positive feedback

– if you make an operation fast (slow) it will be used more (less) often

so you make it faster (slower)

– and it gets used even more (less)

• and so on…

Page 8: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 8April 20, 2023

Architectural Performance Laws and Rules of Thumb

Page 9: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 9April 20, 2023

Measurement and Evaluation

Architecture is an iterative process:• Searching the space of possible designs• Make selections• Evaluate the selections made

Good measurement tools are required to accurately evaluate the selection.

Page 10: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 10April 20, 2023

Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels)

– ISA, RTL, Gate, Circuit

Queuing Theory Rules of Thumb Fundamental Laws

Page 11: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 11April 20, 2023

Measuring and Reporting Performance

What do we mean by one Computer is faster than another?– program runs less time

Response time or execution time– time that users see the output

Throughput – total amount of work done in a given time

Page 12: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 12April 20, 2023

Performance

“Increasing and decreasing” ?????

We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time .

improve performance = increase performance

improve execution time = decrease execution time

Page 13: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 13April 20, 2023

Measuring Performance

Definition of time

Wall Clock time Response time Elapsed time

– A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead

Page 14: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 14April 20, 2023

Does Anybody Really Know What Time it is?

User CPU Time (Time spent in program) System CPU Time (Time spent in OS) Elapsed Time (Response Time = 159 Sec.) (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is

CPU time. 45% of the time spent in I/O or running other programs

UNIX Time Command: 90.7u 12.9s 2:39 65%

Page 15: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 15April 20, 2023

example

UNIX time command

90.7u 12.95 2:39 65%

user CPU time is 90.7 sec

system CPU time is 12.9 sec

elapsed time is 2 min 39 sec. (159 sec)

% of elapsed time that is CPU time is 90.7 + 12.9 = 65%

159

Page 16: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 16April 20, 2023

Time

CPU time– time the CPU is computing

– not including the time waiting for I/O or running other program

User CPU time– CPU time spent in the program

System CPU time – CPU time spent in the operating system performing task requested

by the program decrease execution time

CPU time = User CPU time + System CPU time

Page 17: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 17April 20, 2023

Performance

System Performance– elapsed time on unloaded system

CPU performance– user CPU time on an unloaded system

Page 18: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 18April 20, 2023

Programs to Evaluate Processor Performance

Real Program Kernel

– Time critical excerpts

Toy benchmark– 10-100 lines of code, Sieve of Erastasthens, Puzzles, Quick Sort

Synthetic benchmark– Attempt to match average frequencies of real workloads

– similar to kernel

– Whetstone, Dhrystone

– not even pieces of real program, but kernel might be.

Page 19: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 19April 20, 2023

Benchmark Suite

collecting of benchmark, try to measure the performance of processor with a variety of application– SPECint92, SPECfp92

– see fig. 1.9

Page 20: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 20April 20, 2023

Benchmarking Games

Differing configurations used to run the same workload on two systems

Compiler wired to optimize the workload Test specification written to be biased towards one

machine Synchronized CPU/IO intensive job sequence used Workload arbitrarily picked Very small benchmarks used Benchmarks manually translated to optimize

performance

Page 21: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 21April 20, 2023

Common Benchmarking Mistakes (I)

Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Inaccuracies due to sampling ignored

Page 22: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 22April 20, 2023

Common Benchmarking Mistakes (II)

Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient (cold start) performance Using device utilizations for performance comparisons Collecting too much data but doing too little analysis

Page 23: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 23April 20, 2023

SPEC: System Performance Evaluation Cooperative

First Round 1989– 10 programs yielding a single number

Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating point progra

ms)

– Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(

b,a,c)”

wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200

nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

Third Round 1995– Single flag setting for all programs; new set of programs “benchmarks u

seful for 3 years”

Page 24: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 24April 20, 2023

SPEC: System Performance Evaluation Cooperative First Round 1989

– 10 programs yielding a single number (“SPECmarks”)

Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating

point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model

610:

spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=

memcpy(b,a,c)”

wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200

nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

Page 25: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 25April 20, 2023

SPEC: System Performance Evaluation Cooperative Third Round 1995

– new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)

– “benchmarks useful for 3 years”

– Single flag setting for all programs: SPECint_base95, SPECfp_base95

Page 26: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 26April 20, 2023

How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks execution

time: (Ti)/n or (Wi*Ti)

Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ (1/Ri) or n/(Wi/Ri)

Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10)– Arithmetic mean impacted by choice of reference machine

Use the geometric mean for comparison:(Ti)^1/n– Independent of chosen machine– but not good metric for total execution time

Page 27: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 27April 20, 2023

SPEC First Round One program: 99% of time in single line of code New front-end compiler could improve dramatically

Benchmark

SP

EC

Pe

rf

0

100

200

300

400

500

600

700

800gcc

epresso

spice

doduc

nasa7 li

eqntott

matrix300

fpppp

tomcatv

Page 28: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 28April 20, 2023

Impact of Means on SPECmark89 for IBM 550(without and with special compiler option) Ratio to VAX: Time: Weighted Time:

Program Before After Before After Before Aftergcc 30 29 49 51 8.91 9.22espresso 35 34 65 67 7.64 7.86spice 47 47 510 510 5.69 5.69doduc 46 49 41 38 5.81 5.45nasa7 78 144 258 140 3.43 1.86li 34 34 183 183 7.86 7.86eqntott 40 40 28 28 6.68 6.68matrix30078 730 58 6 3.43 0.37fpppp 90 87 34 35 2.97 3.07tomcatv 33 138 20 19 2.01 1.94Mean 54 72 124 108 54.42 49.99

Geometric Arithmetic Weighted Arith.

Ratio 1.33 Ratio 1.16 Ratio 1.09

Page 29: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 29April 20, 2023

The Bottom Line: Performance (and Cost)

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth

Plane

Boeing 747

BAD/Sud Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pass/hr)

72

44

Page 30: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 30April 20, 2023

The Bottom Line: Performance (and Cost)

"X is n times faster than Y" means

ExTime(Y) Performance(X)

--------- = --------------- = n

ExTime(X) Performance(Y)

• Performance is the reciprocal of execution time

• Speed of Concorde vs. Boeing 747

• Throughput of Boeing 747 vs. Concorde

Page 31: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 31April 20, 2023

Performance Terminology

“X is n% faster than Y” means:ExTime(Y) Performance(X) n

--------- = -------------- = 1 + -----

ExTime(X) Performance(Y) 100

n = 100(Performance(X) - Performance(Y))

Performance(Y)

n = 100(ExTime(Y) - ExTime(X))

ExTime(X)

Page 32: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 32April 20, 2023

Example

Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

n = 100(ExTime(Y) - ExTime(X))

ExTime(X)

n = 100(15 - 10)

10

n = 50%

Page 33: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 33April 20, 2023

SpeedupSpeedup due to enhancement E: ExTime w/o E Performance w/ E

Speedup(E) = ------------- = -------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fractionenhanced of

the task by a factor Speedupenhanced , and the remainder of

the task is unaffected, then what is

ExTime(E) = ?

Speedup(E) = ?

Page 34: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 34April 20, 2023

Amdahl’s Law

States that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time faster mode can be used

Page 35: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 35April 20, 2023

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

Page 36: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 36April 20, 2023

Example of Amdahl’s Law

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedupoverall =

ExTimenew =

Page 37: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 37April 20, 2023

Example of Amdahl’s Law

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedupoverall = 1

0.95= 1.053

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

Page 38: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 38April 20, 2023

Corollary: Make The Common Case Fast

All instructions require an instruction fetch, only a fraction require a data fetch/store.

– Optimize instruction access over data access Programs exhibit locality

– 90% of time in 10% of code

– Spatial Locality

– Temporal Locality

Access to small memories is faster

– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

Reg'sCache

Memory Disk / Tape

Page 39: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 39April 20, 2023

RISC Philosophy

The simple case is usually the most frequent and the easiest to optimize!

Do simple, fast things in hardware and be sure the rest can be handled correctly in software

Page 40: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 40April 20, 2023

Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second

Page 41: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 41April 20, 2023

% of Instructions Responsible for 80-90% of Instructions Executed

0

10

20

30

40

50

60

70

80

compress li ear

Page 42: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 42April 20, 2023

Aspects of CPU PerformanceCPU time = Seconds = Instructions x Cycles x

Seconds

(User) Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

(User) Program Program Instruction Cycle

Instr. Cnt CPI Clock RateProgram

Compiler

Instr. Set

Organization

Technology

Page 43: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 44April 20, 2023

Marketing Metrics

MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

• Machines with different instruction sets ?

• Programs with different instruction mixes ?

– Dynamic frequency of instructions

• Uncorrelated with performance

MFLOP/s = FP Operations / Time * 10^6

• Machine dependent

• Often not where time is spent Normalized:

add,sub,compare,mult 1

divide, sqrt 4

exp, sin, . . . 8

Normalized:

add,sub,compare,mult 1

divide, sqrt 4

exp, sin, . . . 8

Page 44: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 45April 20, 2023

Cycles Per Instruction

CPU time = CycleTime * CPIi * Iii = 1

n

Avg. CPI = CPI * F where F = I i = 1

n

i i i i

Instruction Count

“Instruction Frequency”

Invest Resources where time is Spent!

CPI = Clock Cycles for a Program/Instr. Count

“Average Cycles per Instruction”

Page 45: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 46April 20, 2023

Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

1.5 is the Average CPI for this instruction mix

Page 46: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 47April 20, 2023

Organizational Trade-offs

Instruction Mix

Cycle Time

CPI

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

Page 47: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 48April 20, 2023

Base Machine (Reg / Reg)

Op Freq Cycles

ALU 50% 1

Load 20% 2

Store 10% 2

Branch 20% 2Typical Mix

Trade-off ExampleAdd register / memory operations:

– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this to pay off?

Page 48: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 49April 20, 2023

Example SolutionExec Time = Instr Cnt x CPI x Clock

Op Freq Cycles

ALU .50 1 .5

Load .20 2 .4

Store .10 2 .2

Branch .20 2 .3

Reg/Mem

1.00 1.5

Page 49: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 50April 20, 2023

Example SolutionExec Time = Instr Cnt x CPI x Clock

Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

CPINew must be normalized to new instruction frequency

CyclesNew

InstructionsNew

Page 50: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 51April 20, 2023

Example SolutionExec Time = Instr Cnt x CPI x Clock

Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPInew x ClockNew

1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

Page 51: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 52April 20, 2023

Example SolutionExec Time = Instr Cnt x CPI x Clock

Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew

1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

1.5 = 1.7 – X

0.2 = X

ALL loads must be eliminated for this to be a win!

Page 52: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 53April 20, 2023

How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks e

xecution time: SUM(Ti)/n or SUM(Wi*Ti) Harmonic mean (weighted harmonic mean) of rates (e

.g., MFLOPS) tracks execution time: n/SUM(1/Ri) or n/SUM(Wi/Ri)

Normalized execution time is handy for scaling performance

But do not take the arithmetic mean of normalized execution time, use the geometric mean (prod(Ri)^1/n)

Page 53: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 54April 20, 2023

Means

n

iiT

n 1

1Arithmetic

mean

n

i iR

n

1

1

Geometric mean

n

i ri

inn

i ri

i

T

T

nT

T

1

1

1

log1

exp

Harmonic mean

Consistent independent of reference

Can be weighted. Represents total execution time

Page 54: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 55April 20, 2023

Which Machine is “Better”?

Computer A Computer B Computer CProgram P1(sec) 1 10 20Program P2 1000 100 20Total Time 1001 110 40

Page 55: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 56April 20, 2023

Weighted Arithmetic Mean

Assume three weighting schemes

P1/P2 Comp A Comp B Comp C.5/.5 500.5 55.0 20.909/.091 91.82 18.8 20.999/.001 2 10.09 20

Page 56: 204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 57April 20, 2023

Performance Evaluation Given sales is a function of performance relative to the co

mpetition, big investment in improving product as reported by performance summary

Good products created then have:– Good benchmarks– Good ways to summarize performance

If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins!

Execution time is the measure of performance