ocelot

1.

PTXOptimizer @ ocelot by Sean-Chen2011/04/15

PTXOptimizer

SubkernelFormationPass 3. RemoveBarrierPass 4. LinearScanRegisterAllocationPass 5. MIMDThreadSchedulingPass

Algorithm 9. 1) start at a kernel entry point that dominates all remaining blocks 10. 2) create a strongly connected subgraph with N instructions and no barriers 11. a) This is a new kernel 12. 3) For all edges leaving the graph 13. a) save all live registers 14. b) save the target block's id 15. c) create a new scheduler block includes an indirect branch to each 16. of the targets 17. d) redirect each edge to the kernel exit point 18. e) create a new kernel rooted in the new scheduler block, goto 1

sample methods to do

Create new Kernel

Kernel = new kernel();

Assign New CFG 2 Kernel

New_kernel->cfg() = new CFG(); 20. Org_Kernel->cfg()->update();

Update PTX graph

PTX->cfg()->update()

Update module

module->update()

Re-schedule()

SSA graph 21. Dominator tree 22. Control tree

Why to do it? 24. Reduce the kernel loading and Paralleling 25. Ps: that is a trade off in fork join with kernel communication

How to do it?

Replace Barrier instruction to function call.

Definition

The call instruction stores the address of the next instruction, so execution can resume at that point after executing a ret instruction. A call is assumed to be divergent unless the .uni suffix is present, indicating that the call is guaranteed to be non-divergent, meaning that all threads in a warp have identical values for the guard predicate and call target.

ref PTX_isa 2.1

Example

Assigna=a+1 in CTA with different thread. 29. a = a+1 ; sync();//@ sync mem reg .... b=b+1;

b+1 = thread1 30. a+1 = thread2

thread1 wait for thread2 finish...

bar.sync()

sample methods to do@ load /store 2 new memory address

Find branch location and replace it with

Brn = Kernel->cfg()->terminator()->Branch(); 33. Instruiction *IT= new Instruction(IR::FunctionCall) 34. Kernel->cfg()->insert(IT); 35. kernel->cfg()->remove(Brn);

Assign Function call type

IT->d() = IR::addressType // dest register 36. IT->a() = IR::addressType // source register 37. IT->type() = IR::FunctionCall

Link update

IT->Preprocessor()->update() 38. IT->Successor()->update()

Call back to original pointer

new end pointer = org end pointer

Why to do it? 40. Reduce the thread waiting time in each barrier synchronous check.

LinearScanRegisterAllocationPass 1


Base On SSA graph

Find PHINodes

kernel->dfg()->hasPHINode()?

Replace all alive in PHINode

Foreach(kernel->dfg->PHINode()->aliveIn())...

Update graph

kernel->cfg()->update() 43. Preprocessor 44. Successor

Why to do it? 46. Replace register to local share memory.

More parallelism to thread access. 47. More data sharing

definition

Predicated Execution

reg .pred p, q, r

Example

if (i < n) 49. j = j + 1; 50. setp.lt.s32 p, i, n; // compare i to n 51. @!p bra L1;// if false, branch over 52. add.s32 j, j, 1; 53. L1:...


Find Branch instruction and dominator

Dom = kernel->dominator_tree(); 56. Post = kernel->post_dominator_tree(); 57. kernel->terminator()->hasBranch()?

Replace Branch to Predicted

Instruction IT = new Instruction(IR::Instruction::Pred); 58. kernel->Instruction->Insert(IT); 59. Kernel->Instruction->erase(Bn);

Update graph

kernel->cfg()->update(); 60. kernel->PTX()->update();

Why to do it? 62. More parallelism to thread access

gpuocelot

http://code.google.com/p/gpuocelot/wiki/References

Technology

ocelot