17
Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu Initial Characterization of I/O in Large-Scale Deep Learning Applications -1- November 12, 2018

Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Fahim Chowdhury, Jialin Liu,

Quincey Koziol, Thorsten Kurth,

Steven Farrell, Suren Byna,

Prabhat, and Weikuan Yu

Initial Characterization of

I/O in Large-Scale Deep

Learning Applications

- 1 -

November 12, 2018

Page 2: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Outline

- 2 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Page 3: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Outline

- 3 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Page 4: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Objectives

Deep Learning (DL) applications demand large-scale computing facilities.

DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

The goals of this project are

Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

Addressing possible bottlenecks caused by I/O in the training phase

Developing optimization strategies to overcome the possible I/O bottlenecks

4

Page 5: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Objectives

Deep Learning (DL) applications demand large-scale computing facilities.

DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

The goals of this project are

Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

Addressing possible bottlenecks caused by I/O in the training phase

Developing optimization strategies to overcome the possible I/O bottlenecks

5

Page 6: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Outline

- 6 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Page 7: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

HEPCNNB Overview

High Energy Physics Deep Learning Convolutional Neural Network Benchmark

(HEPCNNB)

Runs on distributed TensorFlow using Horovod

Can generate particle events that can be described by standard model physics

and particle events with R-parity violating Supersymmetry

Uses a 496 GB dataset of 2048 HDF5 files representing particle collisions

generated by a fast Monte-Carlo generator named Delphes at CERN

7

Page 8: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

CDB Overview

Climate Data Benchmark (CDB)

Runs on distributed TensorFlow using Horovod

Can act as an image recognition model to detect patterns for extreme weather

Uses a 3.5 TB dataset of 62738 HDF5 images representing climate data

Leverages TensorFlow Dataset API and python’s multiprocessing package for

input pipelining

8

Page 9: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Outline

- 9 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Page 10: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Profiling Approaches

Develop TimeLogger tool based on python to profile application layer

Determine the total latency from merged interval list for each training component

Explore TensorFlow Runtime Tracing Metadata Visualization (TRTMV) tool

developed at Google and extract I/O specific metadata

Working on integration of runtime metadata from application and framework layer

Work available in: https://github.com/NERSC/DL-Parallel-IO

10

Page 11: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Outline

- 11 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Page 12: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

HEPCNNB Latency Breakdown

- 12 -

Local Shuffle Global Shuffle

3.60%

3.08%

3.17%

2.91%1.44%

8.01%

7.72%

6.83%

6.16%

1.49%

I/O takes more time when Global Shuffling is introduced

Global Shuffling affects I/O even for small dataset and only 5 epochs training

I/O bottleneck can become more severe with increasing epochs

Page 13: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

HEPCNNB Read Bandwidth

- 13 -

I/O takes more time when Global Shuffling is introduced

Global Shuffling affects I/O even for small dataset and only 5 epochs training

I/O bottleneck can become more severe with increasing epochs

Local Shuffle Global Shuffle

9.9121.69

44.20

91.53

194.99

4.33 8.76 15.9030.86

187.99

Page 14: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

CDB Latency and Read Bandwidth

- 14 -

The percentage of I/O in the training process is more when dataset is larger

The I/O percentage increases with the number of nodes

Training benefits more from the scaling than I/O

8.73%

15.05%

10.63%

11.04% 0.30 0.33

1.14

2.57

Page 15: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Outline

- 15 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Page 16: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Future Work

To integrate TRTMV results with TimeLogger data for better profiling of highly

parallelized I/O pipeline

To explore the I/O patterns and determine possible I/O bottlenecks in distributed

TensorFlow

To develop an optimized cross-framework I/O strategy to overcome the possible

I/O bottlenecks

16

Page 17: Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

- 17 -

Thank You