33
PTU – Intel® Performance Tuning Utility

Nehalem Intel PTU Guide

Embed Size (px)

Citation preview

Page 1: Nehalem Intel PTU Guide

PTU – Intel® Performance Tuning Utility

Page 2: Nehalem Intel PTU Guide

10/23/20092

Intel® PTU Sample ViewBasic block grouping and execution flow analysis

Page 3: Nehalem Intel PTU Guide

10/23/20093

Core 2 Architecture - Basic Blocks

L2

MS

Instruction Fetch16 bytes/clk

Alloc & Rename4 µops/clk

ReOrder Buffer 96 entries

Port 0ALU/SSE

FMul

DCU32 Kb

Reservation Station 32 entries

Port 1ALU/SSE

FAdd

Port 5ALU/SSEBranch

Port 2LOADS

Port 3STA

Next IP

FS

B

I$

Length Decoding6 inst/clk

IQ / LSD18 entries

BPU32 bytes/clk

BAC

MOB32 LB + 20 SB

Router

DTLB

DPL

L2Q/XQ

Decoders 4:1:1:17 µops/clk

IDQ11 entries

4 µops/clk

Port 4STD

dispatch 6 µops/clk

FB (8 entries)

Result BusRRF

Page 4: Nehalem Intel PTU Guide

10/23/20094

Execution

Reservation station (32 entries)F

P

FMUL

FDIV

FADD

FSHUF

FPROM

SIM

D SIALU2

SISHIFT

SIALU1 SIMUL

SISHUF

INT

ALU1

SHIFT1

JEU

ALU2 ALU3

SHIFT2

LEA

IMUL

Re

su

lt B

us

ROB

LOADAGU

STOREAGU

STOREDATA(MIU)

DCU

Port 5Port 0 Port 1 Port 4Port 2 Port 3 M

OBLB

32 entriesSB

20 entries

DTLB

L2 / FSB

PMH

Page 5: Nehalem Intel PTU Guide

10/23/20095

Core™ Architecture PMU

The PMU of Intel® Core™ architecture offers some 700 events ( including sub-events ) compared to– Netburst™ Architecture: Some 140

Event counters (5 in total) are not identical– Three frequently used (“basic”) events INST_RETIRED.ANY,

CPU_CLK_UNHALTED.CORE and CPU_CLK_UNHALTED.REFhave dedicated counter registers (1:1 mapping)

– Some events are only counted by PMC0 ( precise and at-retirement events not belonging to the three events above )

– All other events are counted either by PMC0 or PMC1

Counter structure allows collection of up to 5 events in one run - however, depending on the event selection, typically 2-3 only– This is important to remember when running VTune/PTU

Sampling

Page 6: Nehalem Intel PTU Guide

10/23/20096

PTU – Background and Motivation

A replacement for Intel® VTune™ Performance Analyzer ?

• No ! - but an alternative

• A lot of components are shared ( e.g. Linux* Sampling drivers )

• A free tool but requires Intel® VTune™ Performance Analyzer license

A tool for experienced developers – Intel internally and externally

• Less focus on user guidance and help functionality as VTune

• Focus is on getting results fast; pragmatic view

• No registration/RPM/registry – simply unpack and start !

• Same look and feel / same GUI - on all platforms ( Eclipse-based )

• Has been Intel-internal tool only but now made available to everybody !

A test environment for sophisticated new features/prototypes available in VTune (potentially) only in future releases

• Includes pre-launch platform support

Make simple things easy, make extraordinary possible

Page 7: Nehalem Intel PTU Guide

10/23/20097

Intel® PTU Analysis Methods at a Glance

Event Based Sampling (PMU hardware Events)

• Optional event Multiplexing – more events collected in one run

• Includes sophisticated Data Access Profiling

• Includes features to compare multiple runs

Statistical Call Graph Sampling (with loop detection)

• No instrumentation / very low overhead

• Statistical (pragmatical) approach – thus not 100% precise

Instrumentation-based Call Graph/Count Analysis

• Instrumentation similar to VTune Callgraph but using PIN technology

• Intrusive (factor 2 or more) but 100% precise

Heap / Memory Access Profiling

Page 8: Nehalem Intel PTU Guide

10/23/20098

Structural Components

Data collectors – command utilities

- Driver based and non-driver based collectors

- Collection controllers

Data Viewers – command utilities

- Data converters

- Symbol resolver

- Filtering, Comparing and Displaying

Eclipsed-based GUI (“a plug-in for Eclipse”)

- Windows* and Linux* - single look and feel

- Pre-built Eclipse package part of distribution

* Other names and brands may be claimed as the property of others.

Page 9: Nehalem Intel PTU Guide

10/23/20099

PTU Architecture – High Level Overview

Eclipse* GUI

SEP

Command LineCollectorsvtsarun,vtssrun

Command LineViewersvtsaview,

vtsadiff, vtssview

ExperimentDirectory

ConfigurationCollection

Command LineQueries

Data

Data conversion(symbols, tb5 -> database)

* Other names and brands may be claimed as the property of others.Kernel Driver PMU

Hardware

Page 10: Nehalem Intel PTU Guide

10/23/200910

Command Line utilities

vtsarun – Sampling collector

vtssrun – Statistical Call graph collector

vtcgrun – Exact Call graph collector

vthprun – Heap profiler collector

vtsaview – Sampling viewer

vtdpview – Data profiler viewer

vtsadiff – Hotspot difference viewer

vtssrun – Statistical Call graph viewer

vtcgview – Exact Call graph viewer

vthpview – Heap profiler viewer

Collectors Viewers

Command line utilities can be used independent of graphical interfaceGUI functionality is completely based on these “batch applications”

Page 11: Nehalem Intel PTU Guide

10/23/200911

Some Interesting Usability Features

Basic blocks as underlying data representation

• Aggregation, manipulation per basic block

Drill-down to Source, Disassembly and Control Flow graph view for a procedure

• All three (can be) displayed simultaneously

Results difference (text mode, csv export, GUI mode)

• GUI for different binaries, and for identical sources

CPU Frequency Changing

• Ability to modify CPU frequency via GUI

Configuration per PMU, export/import

Sampling Hotspot View expansion to show events per CPU

Event over IP chart

Assembler display choice ( Intel versus AT&T )

Page 12: Nehalem Intel PTU Guide

10/23/200912

PTU Workspace & Projects

Main concepts

Eclipse* Workspace

– Project

• Experiment

– Modules (for experiment sharing)

– Sources (for experiment sharing)

• Project defines the application and its workload (parameters)

• Simply zipping up an experiment creates a self contained test case

PTU Project is workload specification

* Other names and brands may be claimed as the property of others.

Page 13: Nehalem Intel PTU Guide

10/23/200913

PTU Profile Configurations

• Workspace, Project, Profile Configuration are in GUI world only

– Command line tools know only about Experiments

• 5 predefined configurations come in PTU distribution

– Basic Statistical Callgraph, Basic Sampling and others

– You can create your own configurations as well

Project + Profile Configuration produces the Experiment

* Other names and brands may be claimed as the property of others.

• Project is definition of the workload

• Profile Configuration is definition of collection method

– You apply a Profile Configuration to a Project to collect Experiment

– GUI generates a command line for collector and invokes it

Page 14: Nehalem Intel PTU Guide

10/23/200914

Project vs. Profiling Configuration

Experiment ResultProject 3

Experiment ResultProject 2

Project 1

Configuration 2Configuration 1

Configurations

Projects

• Profiling Configuration is orthogonal to the Project concept

• Profiling Configuration created by user can be applied to any Project

Page 15: Nehalem Intel PTU Guide

10/23/200915

Event based Sampling

Use Basic Sampling predefined configuration or create your own with events you need in the experiment

Granularity in base display is function, module, basic block or instruction (RVA)

Standard drill down to source/asm by row in spread sheet

Page 16: Nehalem Intel PTU Guide

10/23/200916

Add the Control Flow Graph with the Tool bar Button

Page 17: Nehalem Intel PTU Guide

10/23/200917

All Three Views at once

Page 18: Nehalem Intel PTU Guide

10/23/200918

Events over IP histogram

• In Hotspot view right click on row of interest, appropriate item invokes

Events over IP histogram for selected module

• Drill-down to highlighted region of histogram to source and/or disassembly view

Page 19: Nehalem Intel PTU Guide

10/23/200919

Data Access Profiling

Intel® Pentium® processors, Intel® Core™2 processor family and Itanium® Processor family have Precise Event Based Sampling (PEBS)

Precise events have hardware to capture the Instruction Pointer of the instruction that caused the event

• Ex: On Intel® Core™2 Processor family, an L2 load miss that retrieves a cacheline can be identified with MEM_LOAD_RETIRED.L2_LINE_MISS

Further the values of all registers are also captured in a buffer

• At completion of the instruction on Intel® Core™2 Processor Family

• Prior to execution of the instruction on Pentium® 4 Processors

Coupled with the disassembly, the addresses of load and store instructions can be reconstructed

Page 20: Nehalem Intel PTU Guide

10/23/200920

Basic Data Access Profiling – Easy to Use Pre-Defined Event Set

Page 21: Nehalem Intel PTU Guide

10/23/200921

Data Access Result: Here Heavy Contention on marked LineMultiple Threads Accessing Different Offsets Indicate False Sharing

Page 22: Nehalem Intel PTU Guide

10/23/200922

But the reality is like this …

This was the model that we used so far…

Aggregate over time

Aggregate over time + over one more

dimension

From one Hotspot View to two of them…

Page 23: Nehalem Intel PTU Guide

10/23/200923

Filtering - projecting slices onto another dimension

Sorting – repositioning segments of the axes

Applying granularity – changing scale of the axis

Intel® PTU Data Access profiling operates with two hotspots views: Code and Data hotspots

Easy Navigation between both “Dimensions”

Page 24: Nehalem Intel PTU Guide

10/23/200924

Differences of EBS Measurments

Intel® PTU 3.x supports an analysis of differences of experiments

This requires

• Event names must be the same

• Load Modules have the same names– They can be the same, with data taken on different machines

– They can be different but built from the same source

• Allowing differences to be analyzed down to source view

– They can be completely different (sources and binaries)

• PTU will compare functions with the same names for modules with the same names

Single click on first experiment

Ctrl single click on second experiment

Right click and select compare

Page 25: Nehalem Intel PTU Guide

10/23/200925

Data blocked 2X2 unrolled Matrix Multiply compiled at -O2

Page 26: Nehalem Intel PTU Guide

10/23/200926

Data blocked 2X2 unrolled Matrix Multiply compiled at -O3 –QxT

Page 27: Nehalem Intel PTU Guide

10/23/200927

Only Significant Difference is Cycle CountCreate Difference DisplayControl click to select 2 experiments

Right click to select “Compare Experiments”

Page 28: Nehalem Intel PTU Guide

10/23/200928

Differences of SamplesDifferences in Cycles Shown in msec to Correct for Comparison of Machines at Different Frequencies

Page 29: Nehalem Intel PTU Guide

10/23/200929

Same Source can Display Difference per Source Line

Page 30: Nehalem Intel PTU Guide

10/23/200930

Statistical Call Graph

Along with standard features supports multithreaded apps and 32 bit binaries on 64 bit OS’s

Easy to identify path to the hotspot

Page 31: Nehalem Intel PTU Guide

10/23/200931

Exact Call Graph

• Reconstructs the application’s Call graph by instrumenting function entry and exit points using the PIN dynamic instrumentation framework

• Records a caller-callee pair together with timing and call count information

• Text viewer provides the following information:

– List of functions with their parents, children, and timing information

– Flat data: list of functions with timing and call information with no relationships

– Per-thread view for multithreaded applications

• Collection overhead for computationally-intensive console applications 1.8-2x

Page 32: Nehalem Intel PTU Guide

10/23/200932

Call Count GUI View

• Displays a limited set of exact call graph data: a function name, module name, and call count, which is a number of times the function was called

Page 33: Nehalem Intel PTU Guide

10/23/200933

Summary

• Intel® PTU - Easy-to-use performance debugging tool

• Focus on getting results quickly

• The tool is easy to use but the performance methodology supported perfectly by this tool requires some experience – it is not a tool for the beginner !!

• Using it and providing feedback /submitting bugs you help make it better for you and contribute to future Performance Suite

– User Forum at whatif.intel.com

Call to Action:

• Download, install and use Intel® PTU !