40
Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon .edu Performance Research Lab University of Oregon

Kernel-level Measurement for Integrated Parallel Performance Views

Embed Size (px)

DESCRIPTION

Kernel-level Measurement for Integrated Parallel Performance Views. A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon. Outline. Motivations Objectives ZeptoOS project KTAU Architecture - PowerPoint PPT Presentation

Citation preview

Page 1: Kernel-level Measurement for Integrated Parallel Performance Views

Kernel-level Measurement for Integrated Parallel Performance

Views

A. Nataraj, A. Malony, S. Shende, A. Morris{anataraj,malony,sameer,amorris}@cs.uoregon.edu

Performance Research LabUniversity of Oregon

Page 2: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 2

Outline

Motivations Objectives ZeptoOS project KTAU Architecture Case Studies - the Performance Views Perturbation Study KTAU improvements Future work Acknowledgements

Page 3: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 3

Motivation

Application performance is a consequence of User-level execution OS-level operation

Good tools exist for observing user-level performance User-level events Communication events Execution time measures Hardware performance

Fewer tools exist to observe OS-level aspects Ideally would like to do both simultaneously

OS-level influences on application performance

Page 4: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 4

Scale and Performance Sensitivity

HPC systems continue to scale to larger processor counts Application performance more performance sensitive OS factors can lead to performance bottlenecks

[Petrini’03, Jones’03, …] System/application performance effects are complex Isolating system-level factors is non-trivial

Require comprehensive performance understanding Observation of all performance factors Relative contributions and interrelationship Can we correlate OS and application performance?

Page 5: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 5

Phase Performance Effects

Waiting timedue to OS

Overheadaccumulates

Page 6: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 6

Direct applications invoke the OS

for certain services syscalls and internal OS

routines called from syscalls

Indirect OS operations w/o explicit

invocation by application preemptive scheduling (HW) interrupt handling OS-background activity can occur at any time

Program - OS Interactions

Page 7: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 7

Program - OS Interactions (continued)

Direct interactions easier to handle Synchronous with user-code In process-context

Indirect interactions more difficult Usually asynchronous Usually in interrupt-context Harder to measure

where are the boundaries? Harder to correlate and integrate with application

measurements

Page 8: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 8

Performance Perspectives

Kernel-wide Aggregate kernel activity of all active processes Understand overall OS behavior Identify and remove kernel hot spots Cannot show application-specific OS actions

Process-centric OS performance in specific application context Virtualization and mapping performance to process Programs, daemons, and system services interactions Expose sources of performance problems Tune OS for specific workload and application for OS

Page 9: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 9

Existing Approaches

User-space only measurement tools Many tools only work at user-level Cannot observe system-level performance influences

Kernel-level only measurement tools Most only provide the kernel-wide perspective

lack proper mapping/virtualization Some provide process-centric views

cannot integrate OS and user-level measurements

Page 10: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 10

Existing Approaches (continued)

Combined or integrated user/kernel measurement tools A few tools allow fine-grained measurement Can correlate kernel and user-level performance Typically focus only on direct OS interactions Indirect interactions not normally merged Do not explicitly recognize parallel workloads

MPI ranks, OpenMP threads, …

Need an integrated approach to parallel performance observation and analyses that support both perspectives

Page 11: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 11

High-Level Objectives

Support low-overhead OS performance measurement at multiple levels of function and detail

Provide both kernel-wide and process-centric perspectives of OS performance

Merge user-level and kernel-level performance information across all program-OS interactions

Provide online information and the ability to function without a daemon where possible

Support both profiling and tracing for kernel-wide and process-centric views in parallel systems

Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data

Page 12: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 12

ZeptoOS

DOE OS/RTS for Extreme Scale Scientific Computation Effective OS/Runtime for petascale systems Funded ZeptoOS project

Argonne National Lab and University of Oregon

What are the fundamental limits and advanced designs required for petascale Operating System Suites? Behaviour at large scales Management and optimization of OS suites Collective operations Fault tolerance OS performance analysis

Page 13: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 13

ZeptoOS and TAU/KTAU

Lots of fine-grained OS measurement is required for each component of the ZeptoOS work

How and why do the various OS source and configuration changes affect parallel applications?

How do we correlate performance data between OS components Parallel application and OS

Solution: TAU/KTAU An integrated methodology and framework to

measure performance of applications and OS kernel

Page 14: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 14

ZeptoOS Strategy

“Small Linux on big computers” IBM BG/L and other systems (e.g., Cray XT3)

Argonne Modified Linux on BG/L I/O nodes (ION) Modified Linux for BG/L compute nodes (TBD) Specialized I/O daemon on I/O node (ZOID) (TBD)

Oregon KTAU

integration of TAU infrastructure in Linux Kernel integration with ZeptoOS and installation on BG/L ION port to other 32-bit and 64-bit Linux platforms

Page 15: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 15

KTAU Architecture

Page 16: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 16

Controlled Experiments Controlled Experiments

Exercise kernel in controlled fashion Check if KTAU produces the expected correct and meaningful views

Test machines Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node,

Redhat Enterprise Linux 2.4(ktau) Benchmarks

NPB LU, SP applications [NPB]

Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution.

LMBENCH [LMBENCH]

Suite of micro-benchmarks exercising Linux kernel Scenarios - Interrupt load, Scheduling load, Exceptions

Page 17: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 17

Approx. 25 secs dilation in Total Inclusive time.

Why?Approx. 16 secs dilation in MPSP() Exclusive time.

Why?

User-level profile does not tell the whole story!

Controlled Experiments

Observing Interrupts

User-level Inclusive Time User-level Exclusive Time

Benchmark: NPB-SP application on 16 nodes

Page 18: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 18

Controlled Experiments

Observing Interrupts

User+OS Inclusive Time User+OS Exclusive Time

Kernel-Space Time Taken by:

1.do_softirq()

2. schedule()

3. do_IRQ()

4. sys_poll()

5. icmp_rcv()

6. icmp_reply()

MPSP excl. time difference only 4 secs.

Excl-time view clearly identifies the culprits.

1. schedule()

2. do_IRQ()

3. icmp_reply()

4. do_softirq()

Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv.

Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out.

Page 19: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 19

Controlled Experiments

Observing Scheduling NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) Simulate daemon interference using “toy” daemon Daemon periodically wakes-up and performs activity What does KTAU show - different views…

A: Aggregated Kernel-wide View (Each row is single host)

Page 20: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 20

Controlled Experiments

Observing Scheduling

B: Process-level View

(Each row is single process on host 8)

Select node 8 and take a closer look …

2 NPB LU processes

‘Toy’ Daemon activity

Page 21: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 21

Controlled Experiments

Observing Scheduling

C: Voluntary / Involuntary Scheduling (Each row is single MPI rank)

Instrumentation to differentiate voluntary/involuntary schedule Experiment re-run on 4-processor SMP Local slowdown - preemptive scheduling Remote slowdown - voluntary scheduling (waiting!)

Pre-empted out by ‘Toy’ daemon

Other 3 yield cpu voluntarily and wait!

Page 22: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 22

LM

BE

NC

H P

age-

Fau

lt

Call Group Relations

Program-OS Call Graph

Controlled Experiments

Observing Exceptions

Page 23: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 23

Merging App / OS Traces

MPI_Send OS Routines

Fine-grained Tracing

Shows detail inside interrupts and bottom halves

Using VAMPIR Trace Visualization [VAMPIR]

Controlled Experiments

Tracing

Page 24: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 24

Larger-Scale Runs Run parallel benchmarks on larger-scale (128 dual-cpu nodes)

Identify (and remove) system-level performance issues Understand perturbation overheads introduced by KTAU

NPB benchmark: LU Application [NPB]

Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution.

ASC benchmark: Sweep3D [Sweep3d]

Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh.

Test machine: Chiba-City Linux cluster (ANL) 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node,

Linux 2.6.14.2 (ktau) kernel, connected by Ethernet

Page 25: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 25

Larger-Scale Runs

Experienced problems on Chiba by chance Initially ran NPB-LU and Sweep3D codes on 128x1

configuration Then ran on 64x2 configuration Extreme performance hit (72% slower!) with the 64x2 runs Used KTAU views to identify and solve issues iteratively Eventually brought performance gap to 13% for LU and 9%

for Sweep.

Page 26: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 26

Larger-scale Runs

Two ranks - relatively very low MPI_Recv() time.

Two ranks - MPI_Recv() diff. from Mean in

OS-SCHED.

User-level MPI_Recv MPI_Recv OS Interactions

64x2 Configuration

Page 27: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 27

Larger-scale Runs

Two ranks have very low voluntary

scheduling durations.

(Same) Two ranks have very large

preemptive scheduling.

Voluntary Scheduling Preemptive Scheduling

Note: x-axis log scale

Page 28: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 28

Larger-scale Runs

NPB LU processes PID:4066, PID:4068

active. No other significant activity!

Why the Pre-emption?

64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.

ccn10 Node-level View Interrupt Activity

Page 29: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 29

Larger-scale Runs

Many more OS-TCP CallsApprox. 100% longer

100% More background OS-TCP activity in

Compute phase.More imbalance!

Use ‘Merged’ performance data to identify overlap, imbalance.Why does purely compute bound region have lots of I/O?

TCP within Compute : Time TCP within Compute : Calls

1. I/O Compute Overlap- Explicit MPI or OS buffering - Tcp Sends

2. Imbalance in workload- Tcp Recvs from ontime finishers

Page 30: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 30

Larger-scale Runs

OS-TCP in SMP Costlier

IRQ-Balancing blindly distributes interrupts and bottom-halves. E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1

Cache issues! [COMSWARE]

Cost / Call of OS-level TCP

Page 31: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 31

Perturbation Study

Five different Configurations Baseline: Vanilla kernel, un-instrumented benchmark Ktau-Off: Kernel instrumentations compiled-in. But turned

Off - (disabled probe effect) Prof-All: All kernel instrumentations turned On Prof-Sched: Only scheduler subssystem’s instrumentations

turned on Prof-All+TAU: ProfAll, + user-level Tau instrumentation

enabled NPB LU application benchmark:

16 nodes, 5 different configurations, Mean over 5 runs each ASC Sweep3D:

128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. Test machine: Chiba-City ANL

Page 32: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 32

Perturbation Study

Disabled probe effect. Single

instrumentation very cheap.

E.g. Scheduling.

Complete Integrated Profiling Cost under

3% on Avg. and as low as 1.58%.

Sweep3d on 128 Nodes

Base ProfAll+TAU

Elapsed Time: 368.25 369.9

% Avg Slow.: 0.49%

Page 33: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 33

Future Work

Dynamic measurement control Improve performance data sources Improve integration with TAU’s user-space capabilities

Better correlation of user and kernel performance Full callpaths and phase-based profiling Merged user/kernel traces (already available)

Integration of TAU and KTAU with Supermon Porting efforts to IA-64, PPC-64, and AMD Opteron ZeptoOS characterization efforts

BGL I/O node Dynamically adaptive kernels

Page 34: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 34

Recent Work on ZeptoOS Project

Accurate Identification of “noise” sources Modified Linux on BG/L should be efficient Effect of OS “noise” on synchronization / collectives What OS aspects induce what types of interference

code paths configurations devices attached

Requires user-level and OS measurement If can identify noise sources, then can remove or

alleviate interference

Page 35: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 35

Approach

ANL Selfish benchmark to identify “detours” Noise events in user-space Shows durations and frequencies of events Does NOT show cause or source Runs a tight loop with an expected (ideal) duration

logs times and duration of detours

Use KTAU OS-tracing to record OS activity Correlate time of occurrence

uses same time source as Selfish benchmark Infer type of OS-activity (if any) causing the “detour”

Page 36: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 36

OS/User Performance View of Scheduling

preemptivescheduling

Page 37: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 37

OS/User View of OS Background Activity

Page 38: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 38

OS/User View of OS Background Activity

Page 39: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 39

Other Applications of KTAU

Need to fill this in … (is there time)

Page 40: Kernel-level Measurement for Integrated Parallel Performance Views

Cluster 2006Kernel-level Measurement for Integrated Parallel

Performance Views 40

Acknowledgements

Department of Energy’s Office of Science National Science Foundation University of Oregon (UO) Core Team

Aroon Nataraj, PhD Student Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer Suravee Suthikulpanit , MS Student (Graduated)

Argonne National Lab (ANL) Contributors Pete Beckman Kamil Iskra Kazutomo Yoshii