26
tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr Kindratenko Roy H Campbell

Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing Performance Tuning in Production

May 2019

Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr Kindratenko Roy H Campbell

Page 2: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Changes

!2

Itera

tion

Tim

e

Time

Page 3: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Changes

!2

Itera

tion

Tim

e

Time

Page 4: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Changes

!2

Itera

tion

Tim

e

Time

Page 5: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Changes

!2

Change of ModelHyper parameters: e.g. Batch Size

Itera

tion

Tim

e

Time

Page 6: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Changes

!2

Change of ModelHyper parameters: e.g. Batch Size

StorageNetworkMemory

Itera

tion

Tim

e

Time

Page 7: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Changes

!2

Change of ModelHyper parameters: e.g. Batch Size

StorageNetworkMemory

DriverSoftware StackMisconfiguration

Itera

tion

Tim

e

Time

Page 8: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Tuning Developer

!3

Code Probe

Python

Images Credit: Google Brain

Page 9: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Tuning Developer

!3

Code Probe DAG Probe

Python TensorBoard

Images Credit: Google Brain

Page 10: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Tuning Developer

!3

Code Probe DAG Probe Whole DAG Runtime Execution

Python TensorBoard Chrome Tracing

Images Credit: Google Brain

Page 11: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Tuning Admin

!4

Application-Level

Page 12: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Tuning Admin

!4

Application-Level

Pros Effective

Cons Code Modification Advance Planning Could be complicated (e.g. T2T)

Page 13: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Performance Tuning Admin

!4

Application-Level Resource-Level

Pros Effective

Cons Code Modification Advance Planning Could be complicated (e.g. T2T)

netstat nvidia-smi

NSight dstat

Pros Easy to Use

No Code Modification No Advance Planning

General Availability

Cons Too Coarse

Don’t distinguish different tasks The report time is too small

Data is hard to interpret without context

Page 14: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Challenges Admin

!5

Detect Problems Find the Baseline Detect Anomaly

Root Cause Analysis Runtime Profiling/Tracing without modification/planning Data Exchange

Page 15: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

!6

MonkeyPatching Intercepts Framework Calls No need for code modification

Admin Portal Runs at the start of a job Collects Task-Base Profiling to Establish Baseline On Demand Tracing/No need for advanced planning

Tracing File Format Portable format CLI to explore traces

Page 16: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing MonkeyPatching

!7

session.runTensorFlowTensorflow

Application

Page 17: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing MonkeyPatching

!7

session.runTensorFlowTensorflow

ApplicationMonkeyPatching

tensorflow-tracing

Page 18: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing MonkeyPatching

!7

Disabled No interception Only Manage Selected Sessions

session.runTensorFlowTensorflow

ApplicationMonkeyPatching

tensorflow-tracing

Per Application Intercept an application Manage all the sessions

System-wide Intercept the global library Manage all applications

Page 19: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing Admin Portal

!8

Separate Different Tasks

Page 20: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing Collection

!9

Profile Collect Automatically Low Overhead (≈0%) Establish the Baseline

Trace Collect On Demand High Overhead (≈3%) Root Cause Analysis

Page 21: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

tensorflow-tracing Availability

!10

Deploy Campus-wide Deep Learning Cluster In use at NCSA since Fall 2018

Apache-2 Downloaded +4k times from Pip

Quick Start pip install tensorflow-tracer

Source Code https://github.com/xldrx/tensorflow-tracer

Page 22: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Demo

!11

Page 23: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Experiences Common Causes

!12

Network Transfer Timing [Hashemi et al, SysML19] Congestion Wrong Network Interface

Storage NFS Exhaustion - Rogue Application Small Reads vs TFRecords

Platform Software Stack Drivers Containers

Device Placement CPU/GPU Locality

Page 24: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

TicTac Result

!13

Hashemi et al, SysML19

Page 25: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Tensor2Tensor

!14

tennsor2tensor

Page 26: Performance Tuning in Production - USENIX...tensorflow-tracing Performance Tuning in Production May 2019 Sayed Hadi Hashemi Paul Rausch Benjamin Rabe Kuan-Yen Chou Simeng Liu Volodymyr

Questions

Image Credit: The Neverhood

This work is supported by: National Science Foundation under Grant No. 1725729

Quick Start pip install tensorflow-tracer

Source Code https://github.com/xldrx/tensorflow-tracer

[email protected]