41
FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton 1

FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

FLEET: Flexible Efficient Ensemble Trainingfor Heterogenous Deep Neural Networks

Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen,Seung-Hwan Lim, Robert Patton

1

Page 2: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

2

Build an image classifier?

Pre-processingStorage

Training

DecodingRotationCropping…

Hyperparameters tuning:• # layers• # parameters in each layer• Learning rate scheduling• …

Deep Neural Network (DNN)

CPU GPU

Page 3: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

3

Pre-processingTrain model 1

Ensemble Training• concurrently train a set of DNNs on a cluster of nodes.

Training

Storage

Train model 2

Train model N

Pre-processing

Training

Pre-processing

Training

……

Page 4: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

4

Pre-processingTrain model 1

Pre-processingStorageTrain model 2

Pre-processingTrain model N

Preprocessing is redundant across the pipelines.

Training

Training

Training

Page 5: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

5

Pittman, Randall, et al. "Exploring flexible communications for streamlining DNN ensemble training pipelines." SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018.

Eliminate pipeline redundancies in preprocessing through data sharing• Reduce CPU usage by 2-11X• Achieve up to 10X speedups with 15% energy consumption

Pittman et al., 2018

Page 6: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

6

Train model 1

Train model 2

Train model N

Training

Training

Training

Ensemble training with data sharing

Pre-processing

Storage

Pre-processing

CPU GPU

Page 7: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

7

Train model 1

Train model 2

Train model N

With data sharing, the training goes even slower!

Training

Training

Training

Pre-processing

Storage

Pre-processing

CPU GPU

Page 8: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

8

Heterogenous Ensemble

A set of DNNs withdifferent architecturesand configurations.

Varying training rateVarying convergence speed

Training

Training

Training

Pre-processing

Pre-processing

CPU GPU

Page 9: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

Training

Training

Training

Pre-processing

Pre-processing

…9

CPU GPU

Varying training rate

Training rate:compute throughput of processing units used for training the DNN. 40 images/sec

100 images/sec

100 images/sec

Heterogenous Ensemble

Page 10: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

Pre-processing

Training

Training

Training

Pre-processing

…10

CPU GPU

Varying training rate

If a DNN consumes data slower, other DNNs will have to wait for it before evicting current set of cached batches.

40 images/sec

100 images/sec

100 images/sec

Bottleneck

Heterogenous Ensemble

Page 11: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

Pre-processing

Training

Training

Training

Pre-processing

…11

CPU GPU

50 epochs

40 epochs

40 epochs

Varying convergence speed

Due to differences inarchitectures and hyper-parameters, some DNNsconverge slower than others.

Varying training rateHeterogenous Ensemble

Page 12: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

Pre-processing

Training

Training

Training

Pre-processing

…12

CPU GPU

50 epochs

40 epochs

40 epochsResourceunder-utilized

Varying convergence speed

A subset of DNNs have already converged while the shared preprocessing have to keep working for the remaining ones.

Varying training rateHeterogenous Ensemble

Page 13: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

13

A flexible ensemble training framework forefficiently training

a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speedVarying training rate

heterogenous ensemble

1.12 – 1.92X speedup

Page 14: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

14

A flexible ensemble training framework forefficiently training

a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speedVarying training rate

heterogenous ensemble

Contributions:1. Optimal resource allocation

1.12 – 1.92X speedup

Page 15: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

15

A flexible ensemble training framework forefficiently training

a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speedVarying training rate

heterogenous ensembleData-parallel distributed training

Checkpointing

Contributions:1. Optimal resource allocation2. Greedy allocation algorithm

Page 16: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

16

A flexible ensemble training framework forefficiently training

a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speedVarying training rate

heterogenous ensembleData-parallel distributed training

Checkpointing

Contributions:1. Optimal resource allocation2. Greedy allocation algorithm3. A set of techniques to solve challenges in implementing FLEET

Page 17: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

17

A flexible ensemble training framework forefficiently training

a heterogenous set of DNNs.

Focus of This Talk

Varying convergence speedVarying training rate

heterogenous ensembleData-parallel distributed training

Checkpointing

Contributions:1. Optimal resource allocation2. Greedy allocation algorithm3. A set of techniques to solve challenges in implementing FLEET

Page 18: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

18

Pre-processing

CPU GPU

Pre-processing

Training

Training

Training

Resource Allocation Problem

DNN 2

DNN 1

DNN N

Optimal CPU allocation:Set #processes for preprocessingto be the one that just meet thecomputing requirements oftraining DNNs

What is anoptimal GPUallocation?

Page 19: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

19

GPU Allocation

DNN 1

DNN 2

DNN 3

DNN 4

GPU

GPU

GPU

GPU

Node 1

Node 2

Page 20: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

20

DNN 1

DNN 2

DNN 3

DNN 4

GPU

GPU

GPU

GPU

Node 1

Node 2

100 images/sec

80 images/sec

80 images/sec

40 images/sec

With data sharing, the slowest DNN determines the training rate ofthe ensemble training pipeline.

Training rate of the pipeline

Training rate

GPU Allocation: 1 GPU to 1 DNN

Page 21: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

21

DNN 1

DNN 4GPU

GPU

GPU

GPU

Node 1

Node 2

100 images/sec

DNN 4

DNN 4

105 images/sec

Training rate

Another way to allocate GPUs: only DNN 1 and DNN 4 are trainedtogether with data sharing.

Training rate of the pipeline

Reduce waiting timeIncrease utilization

GPU Allocation: Different GPUs to Different DNNs

Page 22: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

22

DNN 1

DNN 4GPU

GPU

GPU

GPU

Node 1

Node 2

100 images/sec

DNN 4

DNN 4

105 images/sec

Training rate

A set of DNNs to be trained together withdata sharing (e.g., DNN1 and DNN4 ).

Flotilla

GPU Allocation: Different GPUs to Different DNNs

Page 23: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

23

DNN 1

DNN 4GPU

GPU

GPU

GPU

Node 1

Node 2DNN 4

DNN 4

Training rate

We need to create a list of flotillas to train all DNNs to converge.

DNN 2

DNN 3

DNN 3

DNN 2

Flotilla 1 Flotilla 2

… …

GPU Allocation: Different GPUs to Different DNNs

Page 24: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

24

Given a set of DNN to train and a cluster of nodes, find:(1) the list of flotillas and

(2) GPU assignments within each flotillasuch that the end-to-end ensemble training time is minimized.

NP-hard

Optimal Resource Allocation

Page 25: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

25

Greedy Allocation Algorithm

Dynamically determine the list of flotillas:(1) whether a DNN is converged or not,

(2) the training rate of each DNN.

Once a flotilla is created,derive an optimal GPU assignment

Page 26: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

26

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble Greedy Allocation Algorithm

Page 27: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

27

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

# GPU 1 2 3 4DNN 1 100 190 270 350DNN 2 80 150 220 280DNN 3 80 150 200 240DNN 4 40 75 105 120

Converged DNNs

DNN ensemble Greedy Allocation Algorithm:profiling

Training rates (images/sec) of DNNs on GPUs.

Page 28: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

28

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble Greedy Allocation Algorithm

Step 1: Flotilla Creation

Step 2: GPU Assignment

Step 3: Model training

Page 29: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

29

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble

#1: DNNs in the same flotilla should be able toreach a similar training rate if a proper number ofGPUs is assigned to each DNN.

#2: Pack into one flotilla as many DNNs as possible.

Step 1: Flotilla Creation

Reduce GPU waiting time

Avoid inefficiency due to sublinear scalingAllow more DNNs to share preprocessing

Page 30: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

30

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble

# GPU 1 2 3 4DNN 1 100 190 270 350DNN 2 80 150 220 280DNN 3 80 150 200 240DNN 4 40 75 105 120

DNN 1

#GPU=1

DNN 4

#GPU=3

# GPUs available: 4-1 à 3 – 3 à 0

Step 1: Flotilla Creation

Page 31: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

31

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble

#1: When assigning multiple GPUs to a DNN,try to use GPUs in the same node.

#2: Try to assign DNNs that need a smallernumber of GPUs to the same node.

Step 2: GPU Assignment

Reduce the variation in communication latency

GPUGPU

Node 1 Node 2

GPUGPU

DNN 1 DNN 4 DNN 4 DNN 4

Page 32: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

32

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble Step 1: Flotilla CreationStep 2: GPU Assignment

Varying training rate

Data-parallel distributed training

Page 33: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

33

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble Step 3: Model Training

Varying convergence speed

Checkpointing

Page 34: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

34

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble

# GPU 1 2 3 4DNN 1 100 190 270 350DNN 2 80 150 220 280DNN 3 80 150 200 240DNN 4 40 75 100 120

Once converged, mark as complete and release GPUs.

Step 3: Model Training

Stop training the flotilla once less than 80%of GPUs remain active for training.

Page 35: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

35

Create a new flotilla

Assign GPUs for DNNs inthe flotilla

Train DNNs in the flotillawith data sharing

Profile training ratesof each DNN on m GPUs

Converged DNNs

DNN ensemble

# GPU 1 2 3 4DNN 1 100 190 270 350DNN 2 80 150 220 280DNN 3 80 150 200 240DNN 4 40 75 100 120

Consider only un-converged DNNs whencreate the next flotilla.

Step 3: Model Training

Page 36: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

36

Experiment Settings• Heterogenous ensemble

• 100 DNNs derived from DenseNets and ResNets• Training rate on a single GPU: 21~176 images/sec.

• Summit-Dev@ORNL• 2 IBM POWER8 CPUs with 256GB DRAM• 4 NVIDIA Tesla P100 GPUs

• Dataset• Caltech256: 30K training images (240 minutes limit)

Page 37: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

37

Counterparts for Comparisons

• Baseline• Train each DNN on one GPU independently• Randomly picks one yet-to-be-trained DNN whenever a GPU is free

• Homogeneous Training (Pittman et al., 2018)• Train each DNN on one GPU with data sharing• When #GPUs < #DNNs, randomly picks a subset of DNNs to train

after the previous subset is done• FLEET-G (global paradigm)• FLEET-L (local paradigm)

• Train remaining DNNs once some GPUs are released• Pick the DNNs to train by the greedy algorithm in FLEET-G

Page 38: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

38

End-to-End Speedups

0

0.5

1

1.5

2

2.5

(20,100) (40,100) (60,100) (80,100) (100,100) (120,100) (140,100) (160,100)

Spee

dup

(#GPUs, #DNNs)

Homogeneous[24]FLEET-L

FLEET-GHomogeneousFLEET-L

Homogeneous: slowdowns are due to the waiting of other GPUs for the slowest DNN to finish.

Page 39: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

39

End-to-End Speedups

0

0.5

1

1.5

2

2.5

(20,100) (40,100) (60,100) (80,100) (100,100) (120,100) (140,100) (160,100)

Spee

dup

(#GPUs, #DNNs)

Homogeneous[24]FLEET-L

FLEET-GHomogeneousFLEET-L

FLEET-G: the best overall performance, 1.12-1.92X speedups over the baseline.

FLEET-L: notable but smallerspeedups for less favorableallocation decisions.

The overhead of scheduling and checkpointing is at most 0.1% and 6.3% of the end-to-end training time in all the settings.

Page 40: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

40

Conclusions and Future Work

• Systematically explore the strategies for flexible ensemble training fora heterogenous set of DNNs.• Optimal resource allocation à Greedy allocation algorithm• Software implementation

• Data-parallel distributed training, dynamic GPU-DNNmappings, checkpointing, data sharing

• Future work: apply FLEET to real hyperparameter tuning and neuralarchitecture search workloads.

Page 41: FLEET:FlexibleEfficientEnsembleTraining ...02... · training the DNN. 40images/sec 100images/sec 100images/sec Heterogenous Ensemble. Pre-processing Training ... FLEET-G:the best

41