19
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Tile Reduction: the first step towards tile aware

parallelization in OpenMP

Ge GanDepartment of Electrical and Computer

EngineeringUniv. of Delaware

Overview• Background• Motivation• A new idea: Tile Reduction• Experimental Results• Conclusion• Related Work• Future Work

2

Tile/Tiling• Natural representation of data objects that

are heavily used in scientific algorithms• Tiling improves data locality• Tiling can increase parallelism and reduce

synchronization in parallel programs• It is an effective compiler optimizing

technique• Essentially a program design paradigm• Supported in many parallel programming

languages: ZPL, CAF, HTA, etc.

3

OpenMP• OpenMP is the de facto standard for

shared-memory parallel programming• Provides a simple and flexible interface for

developing portable and scalable parallel application

• Support incremental parallelization• Maintain sequential consistency• “tile oblivious”, no directive or clause can

be used to annotate data tile and carry such information to compiler

4

A Motivating Example

5

Parallelizing: the traditional way(1)

6

Parallelizing: the traditional way(2)

• Can only leverage the traditional scalar reduction in OpenMP

• Parallelism is trivial• Data locality is not bad• Not natural and intuitive

7

The Expected Parallelization

8

• View the inner most two loops as a macro operation performing on the 2x2 data tiles

• Aggregate the data tiles in parallel• More parallelism• Better data locality

Tile Reduction Interface

9

Terms• Reduction Tile: the data tile under

reduction• Tile descriptor: the “multi-dimensional

array” in the list construct• Reduction kernel loops: the loops involved

in performing “one” recursive calculation• Tile name• Dimension descriptor: the tuples following

the tile name

10

A Use Case

11

Tiled Matrix Multiplication

Tile Reduction Applied on the Tiled Matrix Multiplication Code

Code Generation (1)

12

• Distribute the iterations of the parallelized loop among the threads

• Allocate memory for the private copy of the tile used in the local recursive calculation

• Perform the local recursive calculation which is specified by the reduction kernel loops

• Update the global copy of the reduction tile

Code Generation (2)

13

Experimental Results (1)

14

2D Histogram Reduction

Experimental Results (2)

15

Matrix-Matrix Multiplication

Experimental Results (3)

16

Matrix-Vector Multiplication

Conclusions

17

• As one of the building block of the tile aware parallelization theory, tile reduction brings more opportunities to parallelize dense matrix applications

• For some benchmarks, tile reduction is a more natural and intuitive way to reason about the best parallelization decision

• For some benchmarks, tile reduction not only can improve data locality, but also can expose more parallelism

• Amiable to programmers• Code generation is as simple as the scalar

reduction in the current OpenMP• Runtime overhead is trivial

Similar Works

18

• Parallel reduction is supported in:• C**: Viswanathan, G., Larus, J.R.: User-defined reductions for efficient

communication in data-parallel languages. Technical Report 1293, University of Wisconsin-Madison (Jan 1996)

• SAC: Scholz, S.B.: On defining application-specific high-level array operations by means of shape invariant programming facilities. In: APL ’98: Proceedings of the APL98 conference on Array processing language, New York, NY, USA, ACM (1998) 32–38

• ZPL: Deitz, S.J., Chamberlain, B.L., Snyder, L.: High-level language support for user-defined reductions. J. Supercomput. 23(1) (2002) 23–37

• UPC Consortium: UPC Collective Operations Specifications V1.0 A publication of the UPC Consortium (2003)

• Forum, M.P.I.: MPI: A message-passing interface standard (version 1.0). Technical report (May 1994) URL http://www.mcs.anl.gov/mpi/mpi-report.ps.

• Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP’08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123–133

Future Works

19

• Design and develop OpenMP pragma directives that can be used to help compiler to generate efficient data movement code for parallel applications running on many-core platforms with highly non-uniform memory system, like the Cyclops-64 processor