ocelot

Embed Size (px)

Citation preview

  • 1.
    • PTXOptimizer @ ocelot by Sean-Chen2011/04/15

2.

  • PTXOptimizer
    • SubkernelFormationPass 3. RemoveBarrierPass 4. LinearScanRegisterAllocationPass 5. MIMDThreadSchedulingPass

6. Module KernelDFG0 CFG0 SubkernelFormationPass 2 Step1. Assign top kernel for allCFG0 CFG1 CFG2 a b + a a c entry exit 7. Module Kernel 0Module Kernel 0Kernel 1 Kernel 0 Kernel 1 kernel_0 kernel_1 Schedule assign Sub Kernels assign SubkernelFormationPass 3 CFG0 CFG1 CFG2 CFG2 8. SubkernelFormationPass 1

  • Algorithm 9. 1) start at a kernel entry point that dominates all remaining blocks 10. 2) create a strongly connected subgraph with N instructions and no barriers 11. a) This is a new kernel 12. 3) For all edges leaving the graph 13. a) save all live registers 14. b) save the target block's id 15. c) create a new scheduler block includes an indirect branch to each 16. of the targets 17. d) redirect each edge to the kernel exit point 18. e) create a new kernel rooted in the new scheduler block, goto 1

Ref: SubkernelFormationPass.cpp 19. SubkernelFormationPass 4

  • sample methods to do
    • Create new Kernel
      • Kernel = new kernel();
    • Assign New CFG 2 Kernel
      • New_kernel->cfg() = new CFG(); 20. Org_Kernel->cfg()->update();
    • Update PTX graph
      • PTX->cfg()->update()
    • Update module
      • module->update()
    • Re-schedule()
      • SSA graph 21. Dominator tree 22. Control tree

23. SubkernelFormationPass 5

  • Why to do it? 24. Reduce the kernel loading and Paralleling 25. Ps: that is a trade off in fork join with kernel communication

26. RemoveBarrierPass 1 Ref: ocelot-pact.pdf 27. RemoveBarrierPass 2

  • How to do it?
    • Replace Barrier instruction to function call.
  • Definition
    • The call instruction stores the address of the next instruction, so execution can resume at that point after executing a ret instruction. A call is assumed to be divergent unless the .uni suffix is present, indicating that the call is guaranteed to be non-divergent, meaning that all threads in a warp have identical values for the guard predicate and call target.
  • ref PTX_isa 2.1

28. RemoveBarrierPass 3

  • Example
    • Assigna=a+1 in CTA with different thread. 29. a = a+1 ; sync();//@ sync mem reg .... b=b+1;
      • b+1 = thread1 30. a+1 = thread2
    • thread1 wait for thread2 finish...
      • bar.sync()

31. 32. RemoveBarrierPass 3

  • sample methods to do@ load /store 2 new memory address
    • Find branch location and replace it with
      • Brn = Kernel->cfg()->terminator()->Branch(); 33. Instruiction *IT= new Instruction(IR::FunctionCall) 34. Kernel->cfg()->insert(IT); 35. kernel->cfg()->remove(Brn);
    • Assign Function call type
      • IT->d() = IR::addressType // dest register 36. IT->a() = IR::addressType // source register 37. IT->type() = IR::FunctionCall
    • Link update
      • IT->Preprocessor()->update() 38. IT->Successor()->update()
    • Call back to original pointer
      • new end pointer = org end pointer

39. RemoveBarrierPass 4

  • Why to do it? 40. Reduce the thread waiting time in each barrier synchronous check.

41.

    • LinearScanRegisterAllocationPass 1

@ %r1 Ref: ocelot-pact.pdf 42.

    • LinearScanRegisterAllocationPass 2
  • sample methods to do
    • Base On SSA graph
      • Find PHINodes
        • kernel->dfg()->hasPHINode()?
      • Replace all alive in PHINode
        • Foreach(kernel->dfg->PHINode()->aliveIn())...
      • Update graph
        • kernel->cfg()->update() 43. Preprocessor 44. Successor

45.

    • LinearScanRegisterAllocationPass 3
  • Why to do it? 46. Replace register to local share memory.
    • More parallelism to thread access. 47. More data sharing

48. MIMDThreadSchedulingPass 1

  • definition
    • Predicated Execution
      • reg .pred p, q, r
    • Example
      • if (i < n) 49. j = j + 1; 50. setp.lt.s32 p, i, n; // compare i to n 51. @!p bra L1;// if false, branch over 52. add.s32 j, j, 1; 53. L1:...

j=j+1 j=j+1 54. 55. MIMDThreadSchedulingPass 2

  • sample methods to do
    • Find Branch instruction and dominator
      • Dom = kernel->dominator_tree(); 56. Post = kernel->post_dominator_tree(); 57. kernel->terminator()->hasBranch()?
    • Replace Branch to Predicted
      • Instruction IT = new Instruction(IR::Instruction::Pred); 58. kernel->Instruction->Insert(IT); 59. Kernel->Instruction->erase(Bn);
    • Update graph
      • kernel->cfg()->update(); 60. kernel->PTX()->update();

61. MIMDThreadSchedulingPass 3

  • Why to do it? 62. More parallelism to thread access

63. Reference

  • gpuocelot
    • http://code.google.com/p/gpuocelot/wiki/References