24
Software Pipelining on Multi-Core Architectures Alban Douillet, Guang R. Gao CISC 879 : Software Support for Multicore Architectures Tom St. John Dept of Electrical and Computer Engineering University of Delaware

Software Pipelining on Multi-Core Architectures

  • Upload
    others

  • View
    9

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Software Pipelining on Multi-Core Architectures

Software Pipelining on Multi-Core Architectures

Alban Douillet, Guang R. Gao

CISC 879 : Software Support for Multicore Architectures

Tom St. JohnDept of Electrical and Computer Engineering

University of Delaware

Page 2: Software Pipelining on Multi-Core Architectures

Outline

• Introduction

• Single-Dimension Software Pipelining

CISC 879 : Software Support for Multicore Architectures

• Multi-Threaded Software Pipelining

• Experiments

Page 3: Software Pipelining on Multi-Core Architectures

Introduction

• Software pipelining is among most successful optimizations

• Can it be applied to multi-core chips?

• What extensions are required?

CISC 879 : Software Support for Multicore Architectures

Page 4: Software Pipelining on Multi-Core Architectures

Single-Dimension SWP

• Does not simply pipeline innermost loop

- Pipelines most profitable loop level

• Loop levels enclosing selected loop left to global scheduler

Selected loop seen as outermost loop

CISC 879 : Software Support for Multicore Architectures

- Selected loop seen as outermost loop

- Inner loops executed sequentially

• Able to take advantage of ILP/data locality properties present in other loops

Page 5: Software Pipelining on Multi-Core Architectures

SSP Example

CISC 879 : Software Support for Multicore Architectures

Page 6: Software Pipelining on Multi-Core Architectures

SSP Example

CISC 879 : Software Support for Multicore Architectures

Page 7: Software Pipelining on Multi-Core Architectures

SSP Example

CISC 879 : Software Support for Multicore Architectures

Page 8: Software Pipelining on Multi-Core Architectures

Multi-Threaded SWP

• Several Obstacles Exist

• Dependences/resource constraints must be respected

• Operation cannot be scheduled before all dependences are satisfied

CISC 879 : Software Support for Multicore Architectures

dependences are satisfied

• Memory dependences may exist between thread units

- Synchronization is required

Page 9: Software Pipelining on Multi-Core Architectures

Multi-ThreadedFinal Schedule

• Schedule each group of Sn iterations on a thread unit using round-robin approach

- Workload balance is fair

• Sn is max number of iterations that can be executed in parallel without resource conflict

CISC 879 : Software Support for Multicore Architectures

in parallel without resource conflict

• Thread units may share same instruction cache

Page 10: Software Pipelining on Multi-Core Architectures

Final Schedule Example

CISC 879 : Software Support for Multicore Architectures

Page 11: Software Pipelining on Multi-Core Architectures

Final Schedule Example

CISC 879 : Software Support for Multicore Architectures

Page 12: Software Pipelining on Multi-Core Architectures

Data Dependencies

• Data dependencies may exist between outermost iterations

• Synchronization points are chosen to minimize code duplication during code generation

- WAIT is placed before each repeating pattern

CISC 879 : Software Support for Multicore Architectures

- WAIT is placed before each repeating pattern

- SIGNAL is placed after each pattern

• Synchronization delay guarantees the correctness of the schedule

Page 13: Software Pipelining on Multi-Core Architectures

Synchronization DelayExample

CISC 879 : Software Support for Multicore Architectures

Page 14: Software Pipelining on Multi-Core Architectures

Synchronization DelayExample

CISC 879 : Software Support for Multicore Architectures

Page 15: Software Pipelining on Multi-Core Architectures

Synchronization DelayExample

CISC 879 : Software Support for Multicore Architectures

Page 16: Software Pipelining on Multi-Core Architectures

Synchronization

• Each thread has two counters

- Synchronization counter counts number of synchronization signals received

- Clock counter incremented after each WAIT

When thread reaches a WAIT, execution continues

CISC 879 : Software Support for Multicore Architectures

• When thread reaches a WAIT, execution continues only if synchronization counter greater or equal to clock counter

• WAIT implemented with an active loop

• SIGNAL is a non-blocking atomic add-in-memory instruction

Page 17: Software Pipelining on Multi-Core Architectures

Innermost Loop Tiling

• Allows for coarser-grain synchronization

• Execution of Nn - 1 instances of the innermost loop pattern is tiled into tiles of G iterations

• WAIT and SIGNAL are issued at the entrance and exit of each tile

CISC 879 : Software Support for Multicore Architectures

exit of each tile

• Gmin, value of G that minimizes final schedule length, can be approximated at compile time

Page 18: Software Pipelining on Multi-Core Architectures

Cross-IterationRegister Dependences

• Assume thread units do not share registers

• Insert copy operations to copy value from one thread unit to next

• Register dependence transformed into memory dependence

CISC 879 : Software Support for Multicore Architectures

dependence

• Issue memory spill instruction to copy from register to scratch-pad memory of destination thread

• Value restored using local memory load

Page 19: Software Pipelining on Multi-Core Architectures

Cross-IterationRegister Dependences

• Memory spill instructions only need to be issued by the last iteration of an iteration group

• Memory restore instructions only need to be issued by the first iteration

CISC 879 : Software Support for Multicore Architectures

• If distance of dependence is greater than 1, cascading copies and memory spills/restores will bring value to target iteration

Page 20: Software Pipelining on Multi-Core Architectures

Cross-IterationDependence Example

CISC 879 : Software Support for Multicore Architectures

Page 21: Software Pipelining on Multi-Core Architectures

Correctness & Properties

• The multi-core final schedule represented by the schedule function is correct

• The multi-threaded final schedule is deadlock-free

• The synchronization signal guarantees that the memory accesses preceding it on the same thread

CISC 879 : Software Support for Multicore Architectures

memory accesses preceding it on the same thread unit have been committed

Page 22: Software Pipelining on Multi-Core Architectures

Experimental Framework

• The MTS method has been implemented on the Open64 compiler retargeted for the IBM Cyclops64 architecture

• Loop nests from the Livermore Suite, SPEC2000 and NAS were used

CISC 879 : Software Support for Multicore Architectures

Page 23: Software Pipelining on Multi-Core Architectures

Execution Time Speedup

CISC 879 : Software Support for Multicore Architectures

• MTS schedules showed very good scalability, with relative speedup between 57.5 and 81 for 99 threads

Page 24: Software Pipelining on Multi-Core Architectures

Loop Tiling Factor

CISC 879 : Software Support for Multicore Architectures

• Timing results using tiling factor Gmin match results using best empirical tiling factor