204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta

204521 Digital System Architecture 1April 20, 2023

Lecture 2a Computer Performance and Cost

Pradondet Nilagupta

(Based on lecture notes by Prof. Randy Katz

& Prof. Jan M. Rabaey , UC Berkeley)


Review: What is Computer Architecture?

Technology

ApplicationsComputer

Architect

Interfaces

Machine Organization

Measurement &Evaluation

ISA

AP

I

Link

I/O

Cha

n

Regs

IR


The Architecture Process

New concepts created

EstimateCost &

PerformanceSort

Good ideasMediocre

ideasBad ideas


Performance Measurement and Evaluation

Many dimensions to computer performance

– CPU execution time by instruction or sequence

– floating point

– integer

– branch performance

– Cache bandwidth

– Main memory bandwidth

– I/O performance bandwidth seeks pixels or polygons per second

Relative importance depends on applications

P

$

M


Evaluation Tools Benchmarks, traces, & mixes

– macrobenchmarks & suites application execution time

– microbenchmarks measure one aspect of performance

– traces replay recorded accesses

– cache, branch, register Simulation at many levels

– ISA, cycle accurate, RTL, gate, circuit trade fidelity for simulation rate

Area and delay estimation Analysis

– e.g., queuing theory

MOVE 39%BR 20%LOAD 20%STORE 10%ALU 11%

LD 5EA3ST 31FF….LD 1EA2….


Benchmarks

Microbenchmarks– measure one performance

dimension cache bandwidth main memory bandwidth procedure call overhead FP performance

– weighted combination of microbenchmark performance is a good predictor of application performance

– gives insight into the cause of performance bottlenecks

Macrobenchmarks– application execution time

measures overall performance, but on just one application

Perf. Dimensions

App

licat

ions

Mic

ro

Macro


Some Warnings about Benchmarks Benchmarks measure the

whole system

– application

– compiler

– operating system

– architecture

– implementation Popular benchmarks typically

reflect yesterday’s programs

– computers need to be designed for tomorrow’s programs

Benchmark timings often very sensitive to

– alignment in cache

– location of data on disk

– values of data Benchmarks can lead to

inbreeding or positive feedback

– if you make an operation fast (slow) it will be used more (less) often

so you make it faster (slower)

– and it gets used even more (less)

• and so on…


Architectural Performance Laws and Rules of Thumb


Measurement and Evaluation

Architecture is an iterative process:• Searching the space of possible designs• Make selections• Evaluate the selections made

Good measurement tools are required to accurately evaluate the selection.


Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels)

– ISA, RTL, Gate, Circuit

Queuing Theory Rules of Thumb Fundamental Laws


Measuring and Reporting Performance

What do we mean by one Computer is faster than another?– program runs less time

Response time or execution time– time that users see the output

Throughput – total amount of work done in a given time


Performance

“Increasing and decreasing” ?????

We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time .

improve performance = increase performance

improve execution time = decrease execution time


Measuring Performance

Definition of time

Wall Clock time Response time Elapsed time

– A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead


Does Anybody Really Know What Time it is?

User CPU Time (Time spent in program) System CPU Time (Time spent in OS) Elapsed Time (Response Time = 159 Sec.) (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is

CPU time. 45% of the time spent in I/O or running other programs

UNIX Time Command: 90.7u 12.9s 2:39 65%


example

UNIX time command

90.7u 12.95 2:39 65%

user CPU time is 90.7 sec

system CPU time is 12.9 sec

elapsed time is 2 min 39 sec. (159 sec)

% of elapsed time that is CPU time is 90.7 + 12.9 = 65%

159


Time

CPU time– time the CPU is computing

– not including the time waiting for I/O or running other program

User CPU time– CPU time spent in the program

System CPU time – CPU time spent in the operating system performing task requested

by the program decrease execution time

CPU time = User CPU time + System CPU time


Performance

System Performance– elapsed time on unloaded system

CPU performance– user CPU time on an unloaded system


Programs to Evaluate Processor Performance

Real Program Kernel

– Time critical excerpts

Toy benchmark– 10-100 lines of code, Sieve of Erastasthens, Puzzles, Quick Sort

Synthetic benchmark– Attempt to match average frequencies of real workloads

– similar to kernel

– Whetstone, Dhrystone

– not even pieces of real program, but kernel might be.


Benchmark Suite

collecting of benchmark, try to measure the performance of processor with a variety of application– SPECint92, SPECfp92

– see fig. 1.9


Benchmarking Games

Differing configurations used to run the same workload on two systems

Compiler wired to optimize the workload Test specification written to be biased towards one

machine Synchronized CPU/IO intensive job sequence used Workload arbitrarily picked Very small benchmarks used Benchmarks manually translated to optimize

performance


Common Benchmarking Mistakes (I)

Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Inaccuracies due to sampling ignored


Common Benchmarking Mistakes (II)

Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient (cold start) performance Using device utilizations for performance comparisons Collecting too much data but doing too little analysis


SPEC: System Performance Evaluation Cooperative

First Round 1989– 10 programs yielding a single number

Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating point progra

ms)

– Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(

b,a,c)”

wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200

nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

Third Round 1995– Single flag setting for all programs; new set of programs “benchmarks u

seful for 3 years”


SPEC: System Performance Evaluation Cooperative First Round 1989

– 10 programs yielding a single number (“SPECmarks”)

Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating

point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model

610:

spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=

memcpy(b,a,c)”

wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200

nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas


SPEC: System Performance Evaluation Cooperative Third Round 1995

– new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)

– “benchmarks useful for 3 years”

– Single flag setting for all programs: SPECint_base95, SPECfp_base95


How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks execution

time: (Ti)/n or (Wi*Ti)

Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ (1/Ri) or n/(Wi/Ri)

Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10)– Arithmetic mean impacted by choice of reference machine

Use the geometric mean for comparison:(Ti)^1/n– Independent of chosen machine– but not good metric for total execution time


SPEC First Round One program: 99% of time in single line of code New front-end compiler could improve dramatically

Benchmark

SP

EC

Pe

rf

0

100

200

300

400

500

600

700

800gcc

epresso

spice

doduc

nasa7 li

eqntott

matrix300

fpppp

tomcatv


Impact of Means on SPECmark89 for IBM 550(without and with special compiler option) Ratio to VAX: Time: Weighted Time:

Program Before After Before After Before Aftergcc 30 29 49 51 8.91 9.22espresso 35 34 65 67 7.64 7.86spice 47 47 510 510 5.69 5.69doduc 46 49 41 38 5.81 5.45nasa7 78 144 258 140 3.43 1.86li 34 34 183 183 7.86 7.86eqntott 40 40 28 28 6.68 6.68matrix30078 730 58 6 3.43 0.37fpppp 90 87 34 35 2.97 3.07tomcatv 33 138 20 19 2.01 1.94Mean 54 72 124 108 54.42 49.99

Geometric Arithmetic Weighted Arith.

Ratio 1.33 Ratio 1.16 Ratio 1.09


The Bottom Line: Performance (and Cost)

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth

Plane

Boeing 747

BAD/Sud Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pass/hr)

72

44


The Bottom Line: Performance (and Cost)

"X is n times faster than Y" means

ExTime(Y) Performance(X)

--------- = --------------- = n

ExTime(X) Performance(Y)

• Performance is the reciprocal of execution time

• Speed of Concorde vs. Boeing 747

• Throughput of Boeing 747 vs. Concorde


Performance Terminology

“X is n% faster than Y” means:ExTime(Y) Performance(X) n

--------- = -------------- = 1 + -----

ExTime(X) Performance(Y) 100

n = 100(Performance(X) - Performance(Y))

Performance(Y)

n = 100(ExTime(Y) - ExTime(X))

ExTime(X)


Example

Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

n = 100(ExTime(Y) - ExTime(X))

ExTime(X)

n = 100(15 - 10)

10

n = 50%


SpeedupSpeedup due to enhancement E: ExTime w/o E Performance w/ E

Speedup(E) = ------------- = -------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fractionenhanced of

the task by a factor Speedupenhanced , and the remainder of

the task is unaffected, then what is

ExTime(E) = ?

Speedup(E) = ?


Amdahl’s Law

States that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time faster mode can be used


Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced


Example of Amdahl’s Law

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedupoverall =

ExTimenew =


Example of Amdahl’s Law

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedupoverall = 1

0.95= 1.053

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold


Corollary: Make The Common Case Fast

All instructions require an instruction fetch, only a fraction require a data fetch/store.

– Optimize instruction access over data access Programs exhibit locality

– 90% of time in 10% of code

– Spatial Locality

– Temporal Locality

Access to small memories is faster

– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

Reg'sCache

Memory Disk / Tape


RISC Philosophy

The simple case is usually the most frequent and the easiest to optimize!

Do simple, fast things in hardware and be sure the rest can be handled correctly in software


Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second


% of Instructions Responsible for 80-90% of Instructions Executed

0

10

20

30

40

50

60

70

80

compress li ear


Aspects of CPU PerformanceCPU time = Seconds = Instructions x Cycles x

Seconds

(User) Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

(User) Program Program Instruction Cycle

Instr. Cnt CPI Clock RateProgram

Compiler

Instr. Set

Organization

Technology


Marketing Metrics

MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

• Machines with different instruction sets ?

• Programs with different instruction mixes ?

– Dynamic frequency of instructions

• Uncorrelated with performance

MFLOP/s = FP Operations / Time * 10^6

• Machine dependent

• Often not where time is spent Normalized:

add,sub,compare,mult 1

divide, sqrt 4

exp, sin, . . . 8

Normalized:

add,sub,compare,mult 1

divide, sqrt 4

exp, sin, . . . 8


Cycles Per Instruction

CPU time = CycleTime * CPIi * Iii = 1

n

Avg. CPI = CPI * F where F = I i = 1

n

i i i i

Instruction Count

“Instruction Frequency”

Invest Resources where time is Spent!

CPI = Clock Cycles for a Program/Instr. Count

“Average Cycles per Instruction”


Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

1.5 is the Average CPI for this instruction mix


Organizational Trade-offs

Instruction Mix

Cycle Time

CPI

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units


Base Machine (Reg / Reg)

Op Freq Cycles

ALU 50% 1

Load 20% 2

Store 10% 2

Branch 20% 2Typical Mix

Trade-off ExampleAdd register / memory operations:

– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this to pay off?


Example SolutionExec Time = Instr Cnt x CPI x Clock

Op Freq Cycles

ALU .50 1 .5

Load .20 2 .4

Store .10 2 .2

Branch .20 2 .3

Reg/Mem

1.00 1.5



Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

CPINew must be normalized to new instruction frequency

CyclesNew

InstructionsNew




ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPInew x ClockNew

1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)




ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew

1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

1.5 = 1.7 – X

0.2 = X

ALL loads must be eliminated for this to be a win!


How to Summarize Performance Arithmetic mean (weighted arithmetic mean) tracks e

xecution time: SUM(Ti)/n or SUM(Wi*Ti) Harmonic mean (weighted harmonic mean) of rates (e

.g., MFLOPS) tracks execution time: n/SUM(1/Ri) or n/SUM(Wi/Ri)

Normalized execution time is handy for scaling performance

But do not take the arithmetic mean of normalized execution time, use the geometric mean (prod(Ri)^1/n)


Means

n

iiT

n 1

1Arithmetic

mean

n

i iR

n

1

1

Geometric mean

n

i ri

inn

i ri

i

T

T

nT

T

1

1

1

log1

exp

Harmonic mean

Consistent independent of reference

Can be weighted. Represents total execution time


Which Machine is “Better”?

Computer A Computer B Computer CProgram P1(sec) 1 10 20Program P2 1000 100 20Total Time 1001 110 40


Weighted Arithmetic Mean

Assume three weighting schemes

P1/P2 Comp A Comp B Comp C.5/.5 500.5 55.0 20.909/.091 91.82 18.8 20.999/.001 2 10.09 20


Performance Evaluation Given sales is a function of performance relative to the co

mpetition, big investment in improving product as reported by performance summary

Good products created then have:– Good benchmarks– Good ways to summarize performance

If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins!

Execution time is the measure of performance

Documents

204521 Digital System Architecture 1 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 28 ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta