02_Performance_ECE552-2014 - complete.pdf

Embed Size (px)

Citation preview

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    1/13

    ECE 552: PerformanceProf. Natalie Enright Jerger

    Lecture notes based on slides created by Amir Roth of University ofPennsylvania with sources that included University of Wisconsin slides byMark Hill, Guri Sohi, Jim Smith, and David Wood.

    Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood withsources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith,Sohi, Vijaykumar, and Wood

    Before we start... To have a meaningful discussion

    about modern architectures Must discuss metricsto evaluate them

    Discuss how metricsare impacted by Moores Law

    Moores Law: Devices per chip double

    every 18-24 months

    Empirical Evaluation Metrics

    Performance

    Page 1

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    2/13

    Cost

    Power

    Reliability

    Often more important in combination thanindividually

    Performance/cost (MIPS/$)

    Performance/power (MIPS/W)

    Basis for Design decisions

    Purchasing decisions

    Performance Performance metrics

    Latency

    Throughput

    Reporting performance

    Benchmarking and averaging

    CPU performance equationand

    performance trends Two definitions

    Latency (execution time):

    Throughput (bandwidth):

    Page 2

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    3/13

    Very different: throughput can exploit parallelism,latency cannot

    Often contradictory

    Choose definition that matches goals (most frequentlythroughput)

    Latency/Throughput Example

    Example: move people from A to B, 10 miles Car: capacity = 5, speed = 60 miles/hour

    Bus: capacity = 60, speed = 20 miles/hour

    Latency:

    Throughput:

    Performance Improvement

    Processor A is X times faster than processor B if Latency(P,A) =

    Throughput(P,A) =

    Processor A is X% faster than processor B if

    Page 3

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    4/13

    Latency(P,A) =

    Throughput(P,A) =

    Car/bus example Latency?

    Throughput?

    What Is P in Latency(P,A)? Program

    Latency(A) makes no sense, processor executessome program

    But which one? Actual target workload?

    Some representative benchmark program(s)?

    Page 4

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    5/13

    Some small kernel benchmarks (micro-benchmarks)

    Adding/Averaging Performance Numbers

    You can add latencies, but not throughput

    Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A)

    Throughput(P1+P2,A) != Throughput(P1,A) +Throughput(P2,A)

    1 km @ 30 kph + 1 km @ 90 kph

    0.033 hours at 30 kph + 0.011 hours at 90 kph

    Throughput(P1+P2,A) =

    2 / [(1/ Throughput(P1,A)) + (1/ Throughput(P2,A))]

    Same goes for means (averages) Arithmetic: (1/N) * !P=1..NLatency(P)

    Harmonic: N / !P=1..N1/Throughput(P)

    Page 5

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    6/13

    Geometric: N!"P=1..NSpeedup(P)

    CPU Performance Equation Multiple aspects to performance: helps to

    isolate them

    Latency(P,A) = seconds / program =

    Instructions / program:

    Cycles / instruction:

    Seconds / cycle:

    For low latency (better performance) minimize all three

    Hard: often pull against the other

    Page 6

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    7/13

    Cycles per Instruction (CPI) This course is mostly about improving CPI

    Cycle/instruction for average instruction

    IPC= 1/CPI

    Different instructions have different cycle costs

    E.g., integer add typically takes 1 cycle, FP divide takes > 10

    Assumes you know something about insn frequencies

    CPI Example A program executes equal integer, FP, and

    memory operations

    Cycles per instruction type:

    integer = 1, memory = 2, FP = 3

    What is the CPI?

    Caveat: this sort of calculation ignores

    dependences completely

    Back-of-the-envelope arguments only

    Measuring CPI

    How are CPI & execution-time actually measured?

    Page 7

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    8/13

    Execution time: time (Unix): wall clock + CPU + system

    CPI = CPU time / (clock frequency * dynamic insn count)

    How is dynamic instruction count measured?

    Want CPI breakdowns (CPICPU, CPIMEM, etc.) to see what to fix

    CPI breakdowns Hardware event counters

    Calculate CPI using counter frequencies/event costs

    Cycle-level micro-architecture simulation (e.g., SimpleScalar)

    + Measures breakdown exactly provided

    + Models micro-architecture faithfully

    + Runs realistic workloads (some)

    Method of choice for many micro-architects (and you)

    Improving CPI This course is more about improving CPI than frequency

    Historically, clock accounts for 70%+ of performanceimprovement

    Achieved via deeper pipelines

    This has changed

    Deep pipelining is not power efficient

    Physical speed limits are approaching

    1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 3.8GHz: 2004,5GHz: 2008

    Intel Core 2: 1.8-3.2GHz: 2008

    Techniques we will look at

    Caching, speculation, multiple issue, out-of-order issue,multiprocessing, more#

    Page 8

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    9/13

    Moore helps because CPI reduction requirestransistors Definition of parallelism -- more transistors

    But best example is caches

    Another CPI Example Assume a processor with insn frequencies and costs

    Integer ALU: 50%, 1 cycle

    Load: 20%, 5 cycle

    Store: 10%, 1 cycle

    Branch: 20%, 2 cycle Which change would improve performance more?

    A. Branch prediction to reduce branch cost to 1 cycle?

    B. A bigger data cache to reduce load cost to 3 cycles?

    Compute CPI

    Base =

    A =

    B =

    CPI Example 3Operation Frequency Cycles

    ALU 45% 1

    Load 20% 1

    Store 15% 2

    Branch 20% 2

    Page 9

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    10/13

    You can reduce store to 1 cycle But slow down clock by 20%

    Old CPI =

    New CPI =

    Speedup = Old time/New time

    Now, if ALU ops were 2 cycles originally and storeswere 1 cycle

    and you could reduce ALU to 1 cycle, while slowingdown clock by 20%,

    This optimization:

    Example of Amdahls law- you dont want to

    speedup a small fraction to the detriment of the rest

    Page 10

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    11/13

    Performance Rule of Thumb:Amdahls Law

    f fraction that can run in parallel (be sped up)

    1-f fraction that must run serially

    Speed-up =

    Time

    1#CPU

    s

    n

    Time

    1#CPUs

    n

    Page 11

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    12/13

    Pretty good idealscaling for a modest number of cores

    Large number of cores require a lot of parallelism

    Amdahls Law not just about parallelism f !fraction that you can speed up

    Your performance will always be limited by (1-f) part

    Summary Latency = seconds/program =

    (instructions/program) * (cycles/instruction) * (seconds/cycle)

    Instructions/program: dynamic instruction count

    Function of program, compiler, instruction set architecture (ISA)

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 2 4 6 8 10 12 14 16

    0

    8

    16

    24

    32

    40

    48

    56

    64

    0 8 16 24 32 40 48 56 64

    Page 12

  • 8/9/2019 02_Performance_ECE552-2014 - complete.pdf

    13/13

    Cycles/instruction: CPI

    Function of program, compiler, ISA, micro-architecture

    Seconds/cycle: clock period

    Function of micro-architecture, technology parameters

    To improve performance, optimize each component

    Focus mostly on CPI in this course

    Other Metrics Will (try to) come back to

    Cost

    Power

    Reliability

    Interested in learning more: Grad school

    Page 13