21
Zellescher Weg 12 WIL A 208 Tel. +49 351 - 463 – 34217 Michael Kluge ([email protected]) Visualization of Lustre RPC Traces with Vampir EOFS Workshop September 2011 – Paris Center for Information Services and High Performance Computing

Visualization of Lustre RPC Traces with Vampir · 2016. 3. 10. · Zellescher Weg 12 WIL A 208 Tel. +49 351 - 463 – 34217 Michael Kluge ([email protected]) Visualization

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

  • Zellescher Weg 12

    WIL A 208

    Tel. +49 351 - 463 – 34217

    Michael Kluge ([email protected])

    Visualization of Lustre RPC Traces with Vampir

    EOFS Workshop September 2011 – Paris

    Center for Information Services and High Performance Computing

  • Folie 2

    Content

    !   Problem statement

    !   OTF and Vampir introduction

    !   Converter design

    !   Time synchronisation

    !   Screenshots

    !   Directions

  • Folie 3

    Problem statement

    !   Lustre RPC traces consist of millions of events

    !   Data size is in the GB range

    !   One trace per server and client

    !   Log files are text files

    !   Evaluation?

    !   Visualization?

  • Folie 4

    OTF

    !   One index file

    !   One definition file

    !   Multiple event streams

    !   Each stream can store events for many processes

    !   Typical usage: application traces

  • Folie 5

    Vampir - Overview

    Michael Kluge

    Vampir Trace

    Vampir Trace

    Trace File

    (OTF)

    Vampir 7

    Trace Bundle

    VampirServer

    CPU CPU

    CPU CPU CPU CPU

    CPU CPU

    Multi-Core Program

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    Many-Core Program

  • Folie 6

    Vampir - Performance Radar

    !   Color coded

    !   many counter timelines

    !   Like heat maps in gnuplot

    !   Basic arithmetics on counter data

    Michael Kluge

  • Folie 7

    Approach

    !   Rewrite Lustre RPC logs to OTF trace file bundle

    !   Use the Performance Radar to show:

    – RPCs in flight per client – Average RPC completion time – Queue times on the server – Different types of RPCs – …

  • Folie 8

    Challenges

    !   Conversation of lots of unsynchronized individual files

    !   Find a ‘file name’ to ‘host name’ mapping

    !   Final event streams need to be ordered in time

    !   Five events per RPC

    – Client send –  Server receive –  Server start work –  Server end work – Client ack

    !   Catch different protocol versions

    !   Deal with server side logs only

    client

    server

  • Folie 9

    Time Synchronisation

    !   Initial time offsets unknown

    !   Unknown whether tracing started synchronously

    !   Solution:

    –  Provide a {RPC log file, file type, NID list} tuple – Read the first N seconds from each file –  Try to find matching RPC pairs – Create a timer offset map

  • Folie 10

    timeline for a server

    Internal Data Structure

    XID OpCode Client Server

    ptr ptr

    ptr

    ptr ptr

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    timeline for a client

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

    Time Type RPC

  • Folie 11

    100GBit Testbed

    QDR InfiniBand

    Cluster IB-S

    witc

    h 5x

    Server 1

    Server 2

    Server 16

    16x 10G Uplink 100G Link, 37 Miles 16x 10G Downlink

    16 Servers 16x FC8 to 2x S2A9900

  • Folie 12

    Screenshot – RPCs in flight on the clients

  • Folie 13

    Screenshot – average time to complete an RPC

  • Folie 14

    Screenshot – average time to complete an RPC

  • Folie 15

    Screenshot – RPCs queued on the server

  • Folie 16

    Screenshot – average RPC processing time (1)

  • Folie 17

    Screenshot – average RPC processing time (2)

  • Folie 18

    Screenshot – Jaguar (server log only)

  • Folie 19

    Where to go from here

    !   More metrics needed?

    !   Use markers to mark interesting locations

    !   Locks are in the log files

    !   How about a complete log from jaguar

    !   Other possibilities to explore:

    – make use of the Python interface for OTF – make use of the Octave import for OTF files

  • Folie 20

    Time for Questions

  • Folie 21

    Some internal limits

    !   Queued (open) RPCs per process

    !   Maximum RPC completion time

    !   Time window for average calculations

    !   Maximum allowed time synchronisation