Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Virtualized Infrastructures, Systems, & Applications
KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep
Neural Networks
Saman Biookaghazadeh, Yitao Chen, Kaiqi Zhao, Ming Zhao
Arizona State University
http://visa.lab.asu.edu
Background
� Deep neural networks (DNNs) haven shown great potentialo Solving many challenging problemso Relying on large systems to learn on big data
� Limitations of centralized learningo Cannot scale to handle future demands
• “20 billion connected devices by 2020”
• Real-time learning/decision making
o Cannot exploit the capabilities ofthe increasingly powerful edge• Multiprocessing, accelerators
2Ming Zhao
Objectives
� Distributed learning across heterogeneous resourceso Including diverse and potentially weak resourceso Across edge devices and cloud resources
� Limitations of existing solutionso Data parallelism does not work
for heterogeneous resourceso Conventional backpropagation
based training makes modelparallelism difficult
o Edge devices cannot train large models
o Model compression works onlyfor inference, not training
3
Proposed approach
� Using synthetic gradients to achieve model parallelismo Split a large DNN model into many small modelso Deploy small models on distributed and potentially weak resourceso Use synthetic gradients to decouple the training of the small
models
4
Forward path
Backward path
● To training a neural network model ○ forward path: calculate loss of the class prediction○ backward path: calculate the gradients of loss w.r.t the parameters
● Need tens of thousands iterations to achieve a high accuracy
● Layers are coupled and training are sequential
● Limited scalability in training
Conv
1_1
Conv
1_2
Pool
1
Conv
2_1
Conv
2_2
Pool
2
Conv
3_1
Conv
3_2
Conv
3_3
Pool
3
FC 4
_1
FC 4
_2
FC 4
_3 Prediction y
Conventional training
Tightly coupled
FC 4
_1
FC 4
_2
FC 4
_3Prediction yh(4_1) h(4_2)
h(i): layer output of the ith layerδ(i): error of the ith layer
δ(4_3)δ(4_2)
Pool
3
δ(4_1)
h(3)
Training with synthetic gradients
6
M3
h(3_2) h(3_3)
h(i): layer output of the ith layerδ(i): error of the ith layer
δ(3_4)δ(3_3)Conv
3_1
Pool
3
Conv
3_3
Conv
3_2
● Use another small neural network to predict the erroràsynthetic gradients
● Use synthetic gradients to train the small model instead of waiting for the actual gradients from backpropagation
h(3_1)
δ(3_2)
δ(4_1) (synthetic error)
M4
FC 4
_1
δ(4_1) - δ(4_1)Minimize
Decouple the layers
Proposed approach
� Using knowledge transfer to improve training on weak resourceso Train a large, teacher model in
the cloudo Train many small, student
models on the deviceso Exploit the knowledge of the
teacher to train the studentso Improve the accuracy and
convergence speed of the students
7Ming Zhao
Teacher Student
Knowledge transfer
8
conv1
conv2
conv3
conv4
conv5
conv6
fc1
softmax
conv1
conv2
conv3
fc1
softmax
RMSE
RMSE
RMSE
RMSE
cross entrop
correct labels
Input data
Teacher network
Student network
Minimize squared difference between layer outputs (RMSE) to allow student to learning from teacher representations
1st -> 1st2nd -> 4th3rd -> 6thSoftmax -> Softmax
Preliminary results
� Using synthetic gradients to achieve model parallelismo Model: a simple 4-Layer
• One convolution layer and three fully-connected
o MNIST Dataseto 500K Iteration of training
9
Backpropagation
Synthetic gradients
4-layer model
0.984 0.977
8-layer model
0.992 0.924
VGG16 0.994 0.845
Training accuracy
0.7
0.8
0.9
1
1.1
0 50K 100K 150K 200K 250K 300K 350K 400K 450K 500K
Accu
racy
Iterations
sg validationbp validation
Preliminary results
� Using knowledge transfer to improve training on weak resourceso Teacher Model: VGG-16 (8.5M parameters)o Student Model: miniaturized VGG-16 (3.2M parameters)o Training student for 100 Epochs
10
Accuracy
Teacher 76.88%
Student (dependent) 74.12%
Student (independent) 61.24%
“Are Existing Knowledge Transfer Techniques Effective For Deep Learning on Edge Devices?” R. Sharma, S. Biookaghazadeh, and M. Zhao, EDGE, 2018
Conclusions and future work
� Rethink DNN platforms to handle future learning needso Utilize heterogeneous resources to train DNNso Distribute learning across edge and cloud
� Achieve efficient model parallelismo Use synthetic gradients to decouple the training of distributed
modelso Need to further improve its accuracy for complex models
� Overcome the resource constraints of edge deviceso Use knowledge transfer to help train on-device modelso Need to further study its effectiveness under scenarios with
limited data and limited supervision
11Ming Zhao
Acknowledgement
� National Science Foundationo GEARS project: CNS-1629888o CNS-1619653, CNS-1562837, CNS-
1629888, CMMI-1610282, IIS-1633381
� VISA Lab @ ASUo Saman, Yitao, Kaiqi, and others
� Thank you!o Questions and suggestions?o Come to our poster at Happy
Hour!
12Ming Zhao