34
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner ytchen 2012.09.19 1

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

Embed Size (px)

Citation preview

Page 1: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

1

An Effective Dynamic Scheduling Runtime and Tuning

System for HeterogeneousMulti and Many-Core Desktop

Platforms

Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner

ytchen2012.09.19

Page 2: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

2

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 3: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

3

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 4: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

4

Introduction • High performance platforms are commonly

required for scientific and engineering algorithms dealing appropriately with timing constraints.

• Both computation time and performance need to be optimized.

• Efficiency with respect to both huge domain sizes and with small problems is important.

Page 5: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

5

Introduction • Our dynamic scheduling method combines a first

assignment phase for a set of high-level tasks (algorithms, for example), based on a pre-processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database.

Page 6: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

6

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 7: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

7

Motivation • 3D Computational Fluid Dynamics (CFD)• large computations

o velocity field o local pressure

• Exampleo planeso Cars

Page 8: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

8

Motivation• three iterative solvers for SLEs (Jacobi, Red-Black

Gauss-Seidel, and Conjugate Gradient)o Jacobi: determining the solutions of a system of linear

equations with largest absolute values in each row and column dominated by the diagonal element.

o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations.

o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite.

Page 9: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

9

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 10: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

10

System overview• Units of Allocation (UA): is represented as a task.

Page 11: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

11

Platform Independent Programming Model

• OpenCL• In its basic principle, the API encapsulates

implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent.

Page 12: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

12

Profiler and Database• profiler monitors and stores tasks’ execution

times and characteristics in a timing performance database.

• input data (size and type), data transfers between PUs, among others.

Page 13: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

13

Profiler and Database• The performance is measured in Host (CPU)

counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency.

Page 14: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

14

Dynamic Scheduler • First, it establishes an initial scheduling guess

over the PUs just when the applications(s) starts.o First Assignment Phase – FAP

• Second, for every new arriving task, it performs a scheduling consulting the timing database.o Runtime Assignment Phase – RAP

Page 15: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

15

First Assignment Phase – FAP

• Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs.

• lowest total execution time: o m: the number of Pus

• m = 2o n: the number of considered taskso i: tasko j: processor

Page 16: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

16

Page 17: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

17

Page 18: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

18

Page 19: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

19

Page 20: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

20

Page 21: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

21

Runtime Assignment Phase - RAP

• Modeled the arriving of new tasks as a FIFO (First In First Out) queue.

• assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain.

• When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size.

Page 22: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

22

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 23: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

23

Experiment results• Domain sizes and execution costs of the tasks on

the PUs

Page 24: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

24

Experiment results• Comparison of allocation heuristics

o 0-GPU, 1-CPU

Page 25: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

25

Experiment results• Overhead of the dynamic scheduling using ALG.2

and its gain in comparison to scheduling all tasks to the GPU

Page 26: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

26

Experiment results• Scheduling techniques for 24 tasks

o Overhead: the time to perform the schedulingo Solve time: the execution time to compute the tasks o Total time: overhead + solve timeo Error: the total time of the techniques in comparison to the optimal

solution without it overhead • ex: (7660-6130) / 6130

o Optimal: exhaustive search

Page 27: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

27

Experiment results• Scheduling 24 tasks in the FAP + 42 tasks arriving

in the RAP

Page 28: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

28

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 29: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

29

Related work • Distributed processing on a CPU-GPU platform

• Scheduling on a CPU-GPU platformo HEFT (Heterogeneous-Earliest-Finish-Time)

Page 30: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

30

Related workStarPU this paper

execution model codelets OpenCL

method low-level high-level

motivation CFD matrix multiplication

system runtime system

scheduling database

Page 31: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

31

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Page 32: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

32

Conclusion• This paper presents a context-aware runtime and

tuning system based on a compromise between reducing the execution time of engineering applications.

• We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL.

Page 33: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

33

Conclusion• We achieved an execution time gain of 21.77% in

comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search.

Page 34: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos

34

Thanks for your listening!