46
AutoTVM & Device Fleet `

AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

  • Upload
    others

  • View
    7

  • Download
    1

Embed Size (px)

Citation preview

Page 1: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

AutoTVM & Device Fleet`

Page 2: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Hardware

Frameworks

Page 3: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Hardware

Frameworks

Page 4: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Machine Learning based Program Optimizer

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Hardware

Frameworks

Page 5: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Machine Learning based Program Optimizer

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Learning to generate optimized programfor new operator workloads and hardware

Hardware

Frameworks

Page 6: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Search over Possible Program Transformations

Hardware

Loop Transformations Thread Bindings Cache Locality

Thread Cooperation Tensorization Latency Hiding

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 7: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Search over Possible Program Transformations

Hardware

Loop Transformations Thread Bindings Cache Locality

Thread Cooperation Tensorization Latency Hiding

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 8: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Search over Possible Program Transformations

Hardware

Loop Transformations Thread Bindings Cache Locality

Thread Cooperation Tensorization Latency Hiding

Billionsof possibleoptimizationchoices

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 9: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

!4

Program Optimizer ProgramCode Generator

Page 10: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

Runtime Measurements

!4

Program Optimizer ProgramCode Generator

Page 11: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

Runtime Measurements

High experiment cost, each trial costs ~1second !4

Program Optimizer ProgramCode Generator

Page 12: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

!5

Program Optimizer ProgramCode Generator

Page 13: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

!5

Program Optimizer ProgramCode Generator

Cost Model

Page 14: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

Need reliable cost model per hardware!5

Program Optimizer ProgramCode Generator

Cost Model

Page 15: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

Program Optimizer ProgramCode Generator

Page 16: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

Program Optimizer ProgramCode Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

Page 17: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

Program Optimizer ProgramCode Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

Learning

Statistical Cost Model

Page 18: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Learning-based Program Optimizer

• Relatively low experiment cost• Domain-specific problem structure• Large quantity of similar tasks

Unique ProblemCharacteristics

Program Optimizer ProgramCode Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

Learning

Statistical Cost Model

Page 19: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Program-aware Cost Modeling

High-Level Configuration

Page 20: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

Page 21: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

C A By 64 64 64x 8 8 64k 1 8 8

y 1x 8k 64

touched memory

outer looplength

statistical features

Boosted Tree Ensembles

Page 22: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

for

context vec of x

for

for

context vec of y

context vec of k

+

soft scatter

finalembedding

TreeGRU

C A By 64 64 64x 8 8 64k 1 8 8

y 1x 8k 64

touched memory

outer looplength

statistical features

Boosted Tree Ensembles

Page 23: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Effectiveness of ML based Model

!8

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Page 24: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Effectiveness of ML based Model

!8

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

One Conv2D Layer of ResNet18 on Titan X

Page 25: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Effectiveness of ML based Model

!8

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials

One Conv2D Layer of ResNet18 on Titan X

Page 26: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Effectiveness of ML based Model

!8

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials

One Conv2D Layer of ResNet18 on Titan X

Rela

tive

Spee

dup

Baseline: CuDNN

Page 27: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Effectiveness of ML based Model

!9

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials

One Conv2D Layer of ResNet18 on Titan X

Rela

tive

Spee

dup

Baseline: CuDNN

TVM: Random Search

Page 28: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Effectiveness of ML based Model

!10

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

One Conv2D Layer of ResNet18 on Titan X

Rela

tive

Spee

dup

TVM: Random Search

TVM: ML-based Model

Number of Trials

Baseline: CuDNN

Page 29: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Transfer Learning Among Different Workloads

Historical Optimization Tasks

Page 30: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Transfer Learning Among Different Workloads

Historical Optimization Tasks

Domain Invariant Program Representations

Page 31: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Transfer Learning Among Different Workloads

Historical Optimization Tasks

Domain Invariant Program Representations

Transferable Models to speedup new tasks

Page 32: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Transfer Learning Among Different Workloads

Historical Optimization Tasks

Domain Invariant Program Representations

Transferable Models to speedup new tasks

Page 33: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

NVIDIA GPU Optimization (GTX 1080 Ti)La

tenc

y (m

s)

0

1.75

3.5

5.25

7

ResNet-50 MobileNet VGG-19 Inception V3 DenseNet-121

MXNet + TensorRT 4.0 AutoTVM

Page 34: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

AMD GPU Optimization (Vega FE)La

tenc

y (m

s)

0

1.75

3.5

5.25

7

ResNet-50 MobileNet DenseNet-121

MIOpen AutoTVM

Page 35: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Bonus (INT8, GTX 1080)La

tenc

y (m

s)

0E+00

4E-05

8E-05

1.2E-04

1.6E-04

1-7-512-512-1 4-7-512-512-1 1-7-512-512-3 4-7-512-512-3 1-14-256-256-1 4-14-256-256-1

cuDNN AutoTVM

Page 36: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

High Level: Scaling Automatic Performance Profiling

!15

@

@

@

@

Fleet Tracker

Page 37: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Low Level: Portable RPC Tracker + Server

!16

Resource Manager (Tracker)

Nvidia GPU Server

RPC RT

CUDA tasks

Android Phone

RPC RT

OpenCL tasks

AMD GPU Server

RPC RT

ROCmtasks

Zynq FPGA board

RPC RT

JIT driver

Raspberry Pi

RPC RT

ARM tasks

Shared cluster of heterogeneous devices

Optimization Service

RPCclient

Resource token

Resource Allocation

RPC Session Data Path

Optimization Service

RPCclient

crosscompiler

Red modules can be reconfigured remotely in each session

crosscompiler

Running optimization services

Prioritizer

Workload 1

Workload 2

Workload 3

ML-based cost model

Hardwarebitstream

Page 38: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

Client Device

upload code

run code

return run time

Device

Page 39: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

Client Device

upload code

run code

return run time

Device

device free

Page 40: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

Client Device

upload code

run code

return run time

Device

device free

Page 41: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

request device

Client Device

upload code

run code

return run time

Device

device free

Page 42: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

request device

Client Device

upload code

run code

return run time

Device

device free

Page 43: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

request device

Client Device

upload code

run code

return run time

Device

device free

Page 44: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

RPC Communication Flow

!17

Client Tracker

request device

return handle

Client Device

upload code

run code

return run time

Device

device free

Page 45: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Model to Tuned Implementation

!18

Model Bag of Operatorsoperator extraction AutoTVM tuning Tuned

Model

Page 46: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow

Next: Autoscheduler, Lianmin @ 16:30

!19

conv2d, x86

conv2d, GPU, winograd

conv2d, ARM, spatial packing

Handcrafted Schedule Templates