6. Module KernelDFG0 CFG0 SubkernelFormationPass 2 Step1. Assign
top kernel for allCFG0 CFG1 CFG2 a b + a a c entry exit 7. Module
Kernel 0Module Kernel 0Kernel 1 Kernel 0 Kernel 1 kernel_0 kernel_1
Schedule assign Sub Kernels assign SubkernelFormationPass 3 CFG0
CFG1 CFG2 CFG2 8. SubkernelFormationPass 1
Algorithm 9. 1) start at a kernel entry point that dominates
all remaining blocks 10. 2) create a strongly connected subgraph
with N instructions and no barriers 11. a) This is a new kernel 12.
3) For all edges leaving the graph 13. a) save all live registers
14. b) save the target block's id 15. c) create a new scheduler
block includes an indirect branch to each 16. of the targets 17. d)
redirect each edge to the kernel exit point 18. e) create a new
kernel rooted in the new scheduler block, goto 1
The call instruction stores the address of the next
instruction, so execution can resume at that point after executing
a ret instruction. A call is assumed to be divergent unless the
.uni suffix is present, indicating that the call is guaranteed to
be non-divergent, meaning that all threads in a warp have identical
values for the guard predicate and call target.
ref PTX_isa 2.1
28. RemoveBarrierPass 3
Example
Assigna=a+1 in CTA with different thread. 29. a = a+1 ;
sync();//@ sync mem reg .... b=b+1;
b+1 = thread1 30. a+1 = thread2
thread1 wait for thread2 finish...
bar.sync()
31. 32. RemoveBarrierPass 3
sample methods to do@ load /store 2 new memory address