62
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Embed Size (px)

Citation preview

Page 1: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon®

ProcessorsIntel® Software College

Page 2: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

2

Objective

At the successful completion of this module, you will be able to

• Use the VTune™ Performance Analyzer to identify micro-architectural bottlenecks in software running on Intel® Core™ 2 Duo Xeon® processors

• Address the performance bottleneck for Intel® Core™ 2 Duo Xeon® processors

Page 3: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

3

Agenda

Core® micro-architecture review

Event basics

Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks

Summary

Page 4: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

4

Next Generation Micro ArchitectureIntel® Core™ 2 Duo Processor

FSB

Shared L2 = 4MB

CPU-0Core

CPU-1Core

CPU-0L1D=32KB

CPU-0L1I=32KB

L0/L1 DTLBPMH

CPU-1L1D=32KB

CPU-1L1I=32KB

L0/L1 DTLBPMH

Page 5: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

5

Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache/Memory

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

Architecture Block and Instruction Flow

Page 6: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

6

Agenda

Core® micro-architecture review

Event basics

Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks

Summary

Page 7: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

7

VTune™ Analyzer Event Basics

Events Versus Samples

A performance counter increments on the CPU every time an event occurs

A sample of the execution context is recorded every time a performance counter overflows

Events = samples * sample after value

Page 8: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

8

VTune™ Analyzer Event Basics Retired Versus Non-Retired Events

Retired events include only events that occur due to instructions that are committed to the machine state.

• For example, when measuring the Loads Retired event, a load that occurs on a mispredicted execution path is not counted

Most retired events can also be precise events.

• No event skid

Page 9: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

9

VTune™ Analyzer Event Basics Event Skid

Events can appear a few lines after they actually occur in the disassembly source view, which is due to interrupt latency.

Page 10: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

10

VTune™ Analyzer Event Basics Precise Events

Do not suffer from event skid

Use hardware to record the address where the event occurs

Reduce the number of events you can collect at once

Page 11: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

11

VTune™ Analyzer Event Basics Precise Events (cont.)

On:

Off:

Page 12: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

12

VTune™ Analyzer Event Basics Event Ratios

Calculate common processor performance metrics

Built in to VTune™ analyzer

Page 13: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

13

VTune™ Analyzer Event Basics Clockticks and Instructions Retired

Clockticks measure CPU cycles

Clockticks/processor frequency = time in seconds

Instructions retired = the number of instructions committed to the processor state (executed completely)

Cycles per instruction (CPI) = clockticks / instructions retired

High CPI usually indicates opportunities for optimization.

Page 14: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

14

VTune™ Analyzer Event Basics Clockticks Versus Non-halted Clockticks

Clockticks = halted + non-halted cycles (but no sleep cycles)• The clockticks event measures cycles when the physical processor

is not in any sleep modes.

• The non-halted clockticks event measures the cycles that a logical processor is not asleep or halted.

If you measure clockticks on a Hyper-Threaded technology-enabled system while running a single-threaded application, you will see a lot of samples around the halt instruction in processor.sys.

Page 15: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

15

Agenda

Core® micro-architecture review

Event basics

Performance tuning for Intel® Core™ 2 Duo Xeon® processors

• Events for performance

• Performance optimization methodology

• X86 cycle accounting

Summary

Page 16: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

16

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Performance Events along µ-op Flow (1)

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache /Memory

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

Page 17: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

17

Memory Access (Examples)

• Latencies• L1 miss hits L2 ~ 10 cycles• L2 miss, access to memory ~300 cycles (server/FBD)• L2 miss, access to memory ~165 cycles (Desk/DDR2)

• Cache Bandwidth• Bandwidth to cache ~ 8.5 bytes/cycle

• Memory Bandwidth• Desktop ~ 6 GB/sec/socket (linux*)• Server ~3.5 GB/sec/socket

Page 18: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

18

Performance Events for the Front EndEVENT P Description EVENT P Description

CPU_CLK_UNHALTED BUS_DRDY_CLOCKS.ALL_AGENTS all busy bus cycles

INST_RETIRED.ANY_P P BUS_DRDY_CLOCKS.THIS_AGENTall busy bus cycles due to writes

INST_RETIRED.LOADS MEM_LOAD_RETIRED.L2_LINE_MISS P L2 demand misses

INST_RETIRED.STORES MMX2_PRE_MISS.T1SW prefetch to L1 inst

BUS_TRANS_ANY all bus transactions MMX2_PRE_MISS.T2SW prefetch to L2 inst

BUS_TRANS_MEM bus trans to memory MMX2_PRE_MISS.STORESNon Temporal Stores executed

BUS_TRANS_BURST whole $lines to mem L2_LINES_IN.SELF.DEMANDL2$lines in for rfo, load, sw prefetch

BUS_TRANS_BRDwhole line reads from mem L2_LINES_IN.SELF.PREFETCH

L2$lines in for hw prefetch

BUS_TRANS_WB writebacks (no NT writes) L2_LINES_OUT.SELF.DEMANDdemanded L2$Lines evicted

BUS_TRANS_RFO$lines in for RFO (no HW pref) L2_LINES_OUT.SELF.PREFETCH

HW prefetch L2$lines evicted

Memory BW = 64*Bus_Trans_Mem*freq/Cpu_Clk_Unhalted

Page 19: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

19

Lab Activity 1:Calculating the Memory Access Bandwidth

In this lab, you will calculate the bandwidth of memory with the performance counter events using the VTune™ analyzer

Page 20: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

20

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Performance Events along µ-op Flow (2)

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

Resource_Stalls measures here

transfer from Decode

Page 21: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

21

Performance Events of Resource _Stallsµ-op flow to OOO engine blocked by downstream causeResource_Stalls.BR_MISS_CLEAR • pipeline stalls due to flushing mispredicted branches• Combine in Resource_stalls.CLEAR• Mispredicted branch followed by fp inst

Resource_Stalls.ROB_FULL• 96 instructions in ROB

Resource_Stalls.LD_ST• All Store or Load buffers in use

Resource_Stalls.RS_FULL• 32 instructions waiting for inputs in Reservation Station

Page 22: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

22

Measuring Instruction Starvation

There really is no good way to do this• Anti Correlate with Resource_stalls.RS_full

There could be• Cycles Decode queue is empty• Cycles RS is empty• Cycles ROB is empty

Page 23: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

23

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Performance Events along µ-op Flow (3)

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

Rs_uops_dispatched measures at Execution Other stalls measures at Execution

Page 24: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

24

Measuring Efficiency in the Execution Stage

OOO engine optimizes instruction issue to functional units from Reservation Station

• They wait there until their inputs are available

• RS_UOPS_DISPATCHED measures number of µ-ops dispatched from RS on each cycle

There are chains preventing OOO engine from executing in parallel

• Partial Register Stall

• Partial Flag Register Stall

• Domain bypass

• Others…

Page 25: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

25

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Performance Events along µ-op Flow (4)

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

µ-ops retired measures at Retirement

Page 26: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

26

Retirement vs Dispatch

Which counters to work on first?

• For loops, difference is due to OOO execution

• Fewer false positives when “Stalls” are measured at Dispatch

• Retirement is generally more important than Dispatch

Page 27: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

27

Performance Optimization Methodology

This style of optimization has 2 components

• Minimizing instruction count (path length)• A sort of “tree height” minimization

• Minimizing deviations from ideal execution• Generically thought of as “stall cycles”

Treating both equally is critical

Page 28: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

28

Stalls, Execution Imperfection and Performance Analysis

Stall cycles are used to indicate less than perfect execution• An architectural decomposition of “stalls” can be used to guide the

selection of architectural events• The IP correlation of “stalls” and arch events then guides the

optimization effort

Stalls have 4 basic components in x86• Front End stalls

• Execution stage instruction starvation (Front End)

• Mispredicted branch pipeline flushing• Execution stalls

• (Waiting on input/Scoreboard, L2 miss, BW, DTLB, glass jaws etc)

• Cycles wasted executing instructions that are not retired

Page 29: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

29

X86 Cycle Accounting and SW Optimization

Cpu_clk_unhalted = “stalls” + dispatch = “stalls” + non_ret_dispatch + ret_dispatch

Traditional Stall Removal

Reduce Branch MispredictionsPGO

Improve Optimization to Reduce Instruction Count,Split Loops to Increase ILP

Resource_stalls.br_miss_clear will estimate stalls due to Pipeline Flush

Page 30: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

30

Cycle Accounting on X86

Cycles = “stalls” + dispatch• An equality by definition

Cycles ~ CPU_CLK_UNHALTED.CORE

• For cpu intensive applications/sampling

Stall Cycles = Cycles with NO uops Dispatched= RS_UOPS_DISPATCH.CYCLES_NONE

Dispatch Cycle=RS_UOPS_DISPATCH

Page 31: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

31

Cycle Accounting on X86 (cont.)

Dispatch ~ cycles_dispatch_retiring_uops + cycles_dispatch_non_retiring_uops

• Assumes no overlap of retired/non retired uops • Worst Case Senario

Non retired uops = rs_uops_dispatched – (uops_retired.any + Uops_retired.fused)

• Non retired uop cycles ~ non retired uops/avg_uops_per_cycle

Fractional Wasted Work = rs_uops_dispatched / (uops_retired.any + uops_retired.fused) - 1

Page 32: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

32

Pulling Cycle Accounting Together

Cycle Accounting

0

0.2

0.4

0.6

0.8

1

1.2

Executing

Stalls

Illustrative Example Only, Not Real Data

Page 33: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

33

Decomposing Stalls: Elephants First

Pipeline Flush = Resource_Stalls.Br_Miss_Clear/cyclesL2 Hits = ( MEM_LOAD_RETIRED.L1D_LINE_MISS -

MEM_LOAD_RETIRED.L2_LINE_MISS )* 10/cyclesDTLB/L2 Miss = event count* penalty/cyclesFE + Scoreboard = Stalls – all of the above

Stall Decomposition

0

0.2

0.4

0.6

0.8

1

1.2

1 2

Executing

FE + Scoreboard

Pipeline Flush

DTLB

L2 Hits

L2 Misses

Stall Total

Illustrative Example Only, Not Real Data

Page 34: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

34

Decomposing Unstalled CyclesDecomposing Unstalled Cycles

0.75

0.8

0.85

0.9

0.95

1

1.05

1

Uops Retiring

OOO Bursts

Non_retired

Stalls

Non_Retired = (( 1 – (Uops_retired.any+Uops_retired.fused)/RS_Uops_Dispatched) *

RS_Uops_Dispatched.Cycles_None / CPU_CLK_UNHALTED.CORE

OOO Bursts = Uops_Retired.Any - Stalls – Non_RetiredIllustrative Example Only, Not Real Data

Page 35: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

35

Pulling it All Together

Risks Over-counting / Minimizing FE + Scoreboard

But Offers a Guide to Execution Inefficiencies

Cycle Decomposition

0

0.2

0.4

0.6

0.8

1

1.2

1

Uops Retiring

OOO Bursts

Non_retired

FE +Scoreboard

Pipeline Flush

DTLB

L2 Hits

L2 Misses

Illustrative Example Only, Not Real Data

Page 36: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

36

The “Big 4” Events for Performance

CYCLES, STALLS, UNPREFETCHED LOADS and BANDWIDTH

CPU_CLK_UNHALTED.CORE

RS_UOPS_DISPATCHED.CYCLES.NONE

MEM_LOAD_RETIRED.L2_LINE_MISS

BUS_TRANS_ANY.SELF

Page 37: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

37

Architectural Pitfalls: The Ants

Issue Performance Counter

Approx. Penalty (cycles)

store to unknown addr preceeds load Load_Blocks.ADR ~5

store forwarding 4 bytes from middle of 8 Load_Blocks.Overlap_Store ~6

store to known address precedes load offset by N*4096 Load_Blocks. Overlap_Store ~6

load from 2 cachelines (not in L1D) Load_Blocks.UNTIL_RETIRE ~22

load from 2 cachelines with preceding store(not in L1D Load_Blocks.UNTIL_RETIRE ~20

Length Changing Prefix (16 bit imm) ILD_STALLS

ILD_STALLS, or ~6 per

Contribute to “FE + Scoreboard”And don’t forget Micro-Fusion, Macro-fusion, etc..

Page 38: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

38

A Heuristic Break-down for Stall Analysis

the “Big 4 (L2 cache)”, L1D cache

…………

Front End Stalls

Stalls?Stalls?

Resource Stalls

Exe Unit Stalls

Retirement Efficiencyand others

…………

RS related and RAT related

………… Register related, Domain related

………… Instructions decoding, LCP…

Page 39: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

39

A Heuristic Break-down for Stall Analysis (cont.)

Stall Components Counters Name Solutions

Front End

L2 cache MEM_LOAD_RETIRED.L2_LINE_MISS Alignment

DTLB MEM_LOAD_RETIRED.DTLB_MISS SW prefetch

L1 data cache MEM_LOAD_RETIRED.L1D_LINE_MISS  

Instruction Queue INST_QUEUE.FULL Decode pattern

Branch prediction RESOURCE_STALLS.BR_MISS_CLEAR PGO, Removing uncertainty or brach

       

Execution Core

Reservation station RESOURCE_STALLS.RS_FULL  

ReOrder Buffer RAT_STALLS.ROB_READ_PORT  

  RESOURCE_STALLS.ROB_FULL  

Dispatching RS_UOPS_DISPATCHED  

Partial updating RAT_STALLS.FLAGS Whole register update

  RAT_STALLS.PARTIAL_CYCLES  

Domain swing RESOURCE_STALLS.FPCW  

  FP_MMX_TRANS.TO_MMX  

  FP_MMX_TRANS.TO_FP  

       

Memory   BUS_TRANS_ANY  

Page 40: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

40

Lab Activity 2:Using SW tools to reduce the instruction counts (path length)

In this lab, you will practice the use of Intel compiler vectorization switch to reduce the instruction counts.

Page 41: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

41

Lab Activity 3:Addressing the performance bottleneck in Front End

In this lab, you will identify and address the performance issue caused in the Front End of the processor by the “Big 4” events analysis.

Page 42: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

42

Lab Activity 4:Addressing the performance bottleneck in Execution Core

In this lab, you will identify and address the performance issue caused in the execution core of the processor.

Page 43: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

43

A Loop Methodology

• Identify hot functions and raise optimization• Fix alignments, split loops to enhance vectorization

• Identify BW limited functions• Merge BW loops with FP limited loops

• Identify L2 misses and add sw prefetch

• Optimize flow through OOO Engine• Use loop splitting to assist here

Page 44: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

44

More Detailed Event Selection Hierarchy

FIRST PASS EVENTS Sample After Value

CPU_CLK_UNHALTED.CORE 2,000,000

RS_UOPS_DISPATCHED.CYCLES_NONE 2,000,000

UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED 2,000,000

RS_UOPS_DISPATCHED 2,000,000

MEM_LOAD_RETIRED.L2_LINE_MISS 10,000

INST_RETIRED.ANY_P 2,000,000

Loops

BUS_TRANS_ANY.SELF 100,000

BUS_TRANS_ANY.ALL_AGENTS 100,000

Branch Dominated

RESOURCE_STALLS.BR_MISS_CLEAR 2,000,000

SAV values selected so ratio of samples ~ absorbs penalty

Page 45: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

45

More Detailed Event Selection Hierarchy (cont.)

SECOND LEVEL EVENTS Sample After Value

MEM_LOAD_RETIRED.DTLB_MISS 20,000

MEM_LOAD_RETIRED.L1_LINE_MISS 200,000

BR_CND_EXEC BR_CND_EXEC_MISPRED 2,000,000

BR_CALL_EXEC BR_CALL_EXEC_MISPRED 200,000

RESOURCE_STALLS.RS_FULL (anti correlate) 2,000,000

ILD_STALLS 200,000

LOAD_BLOCK.STORE_OVERLAP 200,000

SAV values selected so ratio of samples ~ absorbs penalty

EX: L1 miss/L2_hit penalty is 10 cycles

Page 46: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

46

Summary

• Utilize CoreTM micro-architecture for software performance• Front end• OOO execution core

• Use the VTune™ analyzer to identify micro-architectural bottlenecks in your software.

• Use a cycles accounting methodology to improve the performance.

Page 47: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

47

Page 48: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

48

Micro-Architecture ComparisonIntel NetBurst™++ NGMA**

Pipeline Stages 31 14

Threads per core 2 1

L1 Cache Org. (12K uop Trace Cache/16K Data) (32K I/32K Data)

L2 Cache Org. 2 x 2MB 1 x 4MB (shared)

Instr. Decoders 1 4

Integer Units 2 (2x core freq) 3 (1x core freq)

SIMD Units 2 x 64-bits 3 x 128-bits

SIMD Inst. Issued per Clock 1 3

FP Units 3 (Add/Mul/Div) 3 (Add/Mul/Div)

FP Inst. Issued per clock 1 Up to 2

(Add + Mul or Div)

Power 135W 80W

++ Cedar Mill/Dempsey** NGMA = Next Generation Micro-Architecture (Conroe/Woodcrest) = per core

Page 49: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

49

Execution Unit Comparisons

FP Add/Mul/DivInteger

Shift/Rotate SIMD

Port

Port

IntegerMultiply

SIMD

IntegerArithmetic

IntegerArithmetic

2x Core Freq

Intel NetBurst® Micro-Architecture

NGMA

Port

0P

ort

1 FP Add

SIMD

Port

5

IntegerArithmetic

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

2

Load

Port

4

Store

Page 50: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

50

DTLB Structure

DTLB component entries ways sets miss event ~ miss penalty

L0 small page 16 4 4 Dtlb_Misses.L0_miss 2

L1 small page 256 4 64 Dtlb_Misses.L1_miss typical ~ 10

L0 Large Page 16 4 4 Dtlb_Misses.L0_miss_LG 2

L1 Large Page 32 4 8 Dtlb_Misses.L1_miss_LG typical ~ 11-12

HW Page Walks PMH.Walks ~PMH.Cycles

DTLB Access Penalty

0

5

10

15

20

25

0 200 400 600 800 1000 1200

number of pages accessed

cycle

s

L2 $ Hit, L1DTLB Miss

L1 $ Hit, L1DTLB Miss

L1 $ Hit, L1DTLB Hit

Disclaimer: Data is from a pointer chasing microbenchmark and for illustrative purposes only

Page 51: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

51

PEBS Usage and Issues• Using Precise Event Based Sampling captures architectural

state at the time of the event occurrence

• Basic Block Execution = average of inst_retired over the BB

• However inst_retired does not give a flat distribution within a basic block.• Therefore the average over the basic block should be used

Page 52: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

52

Manipulating the XML File

<EVENT>

<HELPID>CB08</HELPID>

<CODE>0xCB</CODE> event number

<UMASK>0x08</UMASK> event mask or user mask

<OTHER>0x53</OTHER> Cmask, Inv etc

<COMMON>0x601001</COMMON> bitmask for groups event is in…add 2 to put in “favorites”

<WEIGHT>0</WEIGHT>

<COUNTER>0</COUNTER> counters that can be used..precise events must use 0

<NAME>MEM_LOAD_RETIRED.L2_LINE_MISS</NAME>

<DESCRIPTION>L2 cache line missed by retired loads (precise event).</DESCRIPTION>

<HELP_FILE>pmm.chm</HELP_FILE>

<OVERFLOW>10000</OVERFLOW> default SAV

<PRECISE_EVENT>yes</PRECISE_EVENT> identifier for precise events

</EVENT>

Page 53: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

53

DL’s New Favorite <EVENT>

<HELPID>A000</HELPID>

<CODE>0xA0</CODE>

<UMASK>0x00</UMASK>

<OTHER>0x1D3</OTHER> setting cmask = 1 and inv = 1

<COMMON>0x503</COMMON>

<WEIGHT>0</WEIGHT>

<COUNTER>0</COUNTER> forcing counter 0

<NAME>RS_UOPS_DISPATCHED_c1_inv</NAME> new name

<DESCRIPTION>Uops Dispatched from the RS</DESCRIPTION>

<HELP_FILE>pmm.chm</HELP_FILE>

<OVERFLOW>2000000</OVERFLOW>

</EVENT>

Cycles Where NO Uops are Dispatched From RS

Page 54: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

54

Loop Distribution for Resource Management

For(i…i++){inst1

inst2 inst3

.

.

. instN (final store) }

For(i..i+=blk){for(j=I;j<blk;j++){

ints1inst2.instMstore_intermediate[j-i]}

for(j=I;j<blk;j++){load_intermediate[j-i]]instM+1.instN (final store)}

}

Shorter Loops -> Greater Unrolling -> Greater ILP

Page 55: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

55

Cycle Accounting on X86• Non retired uop cycles ~

non retired uops / avg_uops_per_cycle

~ rs_uops_dispatched:c1*( 1 - (uops_retired.any + uops_retired.fused)

/rs_uops_dispatched )

CPU_CLK_UNHALTED = Stalls + non_retired + effective = rs_uops_dispatched:c1:i1 +

rs_uops_dispatched:c1* ( 1 – (uops_retired.any +uops_retired.fused)

/ rs_uops_dispatched ) + Effective_cycles

Page 56: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

56

Methodology Overview

The traditional view of performance tuning on X86 processors has focused on instruction retirement

The OOO engine has always been viewed as an impenetrable and incomprehensible beast

This is perhaps not the best perspective

Page 57: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

57

Four Component HW Prefetcher

• L1 Cache Prefetch (first in Intel® Core Duo Processor)• DCU or Streaming prefetcher

• DCU = Data Cache Unit

• IP prefetch• Repeated stride load at frequently executed IP

• L2 Prefetch (similar to Pentium™ 4 processor)

Page 58: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

58

VTune™ Analyzer Edit Event

See Backup Slides for Creating New Pre-Edited Events in XML File

Page 59: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

59

Some Features of the PMU

CMASKINV

EN

INT

PC E

OS

USR umask Event #

Value to be compared against Invert from GE to LT

Enable Counters

APIC Interupt Enable

Pin Control

Count on changing edge

Count Ring 3 execution

Count Ring 0 execution

Setting CMASK = 1 and INV = 1 for RS_uops_dispatched Counts Cycles Where

NO UOPS WERE DISPATCHED == Stalls RS_UOPS_DISPATCHED.CYCLES_NONE

Page 60: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

60

A Methodology?

Total Cycles ~ CPU_CLK_UNHALTED

RS_UOPS_DISPATCH:c1

RS_UOPS_DISPATCH:c1:i1

CPU_CLK_UNHALTED can be decomposed into execution and stall cycles in the OOO engine

Requires >99% CPU Utilization OR User PL only/sampling

EVENTS COUNT EVEN DURING HALTED CYCLES

Page 61: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

61

VTune™ Analyzer Event Basics

Thread Specific and Independent Event Categories

Thread Specific (TS) – Sample count is per logical processor.

Thread Independent (TI) – Sample count is per physical processor.

• All events are attributed to logical processor 0 – WATCH OUT: The Addresses Might Be Incorrect!

Thread specific ESCR limited (TS-E) – Sample count is per logical processor but only data for one logical processor can be captured in a single run.

If not specified, the event is TS.

Page 62: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

62

The Distribution of uops/cycleemon -q -t0 -C \(RS_UOPS_DISPATCHED:v\) -f $1_uop_count.txt $1

Up to N uops/cycle

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c1:i1:v\) -F $1_uop_count.txt $1

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c2:i1:v\) -F $1_uop_count.txt $1

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c3:i1:v\) -F $1_uop_count.txt $1

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c4:i1:v\) -F $1_uop_count.txt $1

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c5:i1:v\) -F $1_uop_count.txt $1

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c6:i1:v\) -F $1_uop_count.txt $1

emon -q -t0 -C \(RS_UOPS_DISPATCHED:c7:i1:v\) -F $1_uop_count.txt $1

Subtract the N-1 value

uops dispatched per cycle

02000000000

40000000006000000000

800000000010000000000

1200000000014000000000

1600000000018000000000

0 2 4 6 8

Series1

Replace with Vtune graph

Distributionof theInstructionLevelParallelism (example:a[i] = exp(x[i]); in a simple loop)