31
GPU Memory II GPU Memory II GPU Memory II Memory Hardware and Bank Conflict Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory II - Virginia Techaccel.cs.vt.edu › files › lecture11.pdfGPU Memory II St t d B k C fli tStructs and Bank Conflicts ¾Struct assignments compile into as ma ny memory

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • GPU Memory II

    GPU Memory IIGPU Memory II─ Memory Hardware and Bank Conflict

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

  • GPU Memory II

    CUDA Device Memory Space: ReviewCUDA Device Memory Space: ReviewEach thread can: (Device) Grid

    Bl k (0 0) Bl k (1 0)R/W per-thread registersR/W per-thread local memoryR/W per-block shared memory

    Block (0, 0)

    Shared Memory

    R i t R i t

    Block (1, 0)

    Shared Memory

    R i t R i t/ pe b oc s a ed e o yR/W per-grid global memoryRead only per-grid constant memory

    Thread (0, 0)

    Registers

    Thread (1, 0)

    Registers

    Thread (0, 0)

    Registers

    Thread (1, 0)

    Registers

    memoryRead only per-grid texture memory

    LocalMemory

    LocalMemory

    LocalMemory

    LocalMemory

    ConstantMemory

    GlobalMemory

    Host

    • The host can R/W global, constant, and

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    2TextureMemory

    texture memories

  • GPU Memory II

    P ll l M Sh i gP ll l M Sh i gParallel Memory SharingParallel Memory SharingLocal Memory: per-thread

    Private per threadThread

    Private per threadAuto variables, register spill

    Shared Memory: per-BlockShared by threads of the same block

    Local Memory

    BlockShared by threads of the same blockInter-thread communication

    Global Memory: per-applicationShared by all threads

    SharedMemory

    yInter-Grid communication

    Grid 0

    . . . Global

    MemoryGrid 1

    SequentialGridsin Time

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    . . .1

  • GPU Memory II

    H d O iH d O iHardware OverviewHardware OverviewStreaming Processor Array

    TPC TPC TPC TPC TPC TPC TPC TPC

    Thread Processor ClusterStreaming Multiprocessor

    Instruction Fetch/Dispatch

    Instruction L1 Data L1Thread Processor Cluster

    SMShared Memory

    TEX

    SMSP

    SP

    SP

    SFUSP

    SP

    SP

    SFU

    Special Function Unit (SFU)

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    SP SP

  • GPU Memory II

    R gi t FilR gi t FilRegister FileRegister File

    Register File (RF) I$Register File (RF)32 KBProvides 4 operands/clock

    L1

    MultithreadedInstruction Buffer

    Texture pipe can also read/write RF2 SMs share 1 TEX

    Load/Store pipe can also

    RF

    C$L1

    SharedMem

    Load/Store pipe can also read/write RF Operand Select

    MAD SFUMAD SFU

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

  • GPU Memory II

    P g Vi f R gi t FilP g Vi f R gi t FilProgrammer View of Register FileProgrammer View of Register File

    There are 8192 registers in 4 blocks 3 blocks

    each SM in G80Registers are dynamically partitioned across all Blocks

    i d t th SMassigned to the SMOnce assigned to a Block, the register is NOT accessible by threads in other Blocksthreads in other BlocksEach thread in the same Block only access registers assigned to itself

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

  • GPU Memory II

    M t i M lti li ti E lM t i M lti li ti E lMatrix Multiplication ExampleMatrix Multiplication Example

    If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?

    Each Block requires 10*256 = 2560 registersq g8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concerned

    How about if each thread increases the use of registers by 1?

    Each Block now requires 11*256 = 2816 registersEach Block now requires 11 256 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    7

    p

  • GPU Memory II

    M D i P titi i gM D i P titi i gMore on Dynamic PartitioningMore on Dynamic Partitioning

    Dynamic partitioning gives more flexibility to Dynamic partitioning gives more flexibility to compilers/programmers

    One can run a smaller number of threads that One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each

    This allows for finer grain threading than traditional CPU threading models.

    The compiler can tradeoff between instruction-level parallelism and thread level parallelism

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    8

  • GPU Memory II

    C t tC t tConstantConstant

    Immediate address constants I$Indexed address constantsConstants stored in DRAM, and cached on

    L1

    MultithreadedInstruction Buffer

    chipL1 per SM

    A constant value can be broadcast to all

    RF

    C$L1

    SharedMem

    A constant value can be broadcast to all threads in a Warp

    Extremely efficient way of accessing a value th t i f ll th d i Bl k!

    Operand Select

    MAD SFUthat is common for all threads in a Block! MAD SFU

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

  • GPU Memory II

    Sh d MSh d MShared MemoryShared Memory

    Each Multi-processor has 16 KB of I$pShared Memory

    16 banks of 32bit wordsWill discuss about accessing pattern later

    L1

    MultithreadedInstruction Buffer

    Will discuss about accessing pattern later

    Visible to all threads in a thread blockread and write access

    RF

    C$L1

    SharedMem

    Operand Select

    MAD SFUMAD SFU

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

  • GPU Memory II

    M t i M lti li ti E lM t i M lti li ti E lMatrix Multiplication ExampleMatrix Multiplication Example

    Explore Tile based implementation with Explore Tile-based implementation with Shared Memory.Question:Question:

    How is shared memory organized?What are the issues when accessing shared What are the issues when accessing shared memory?

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

  • GPU Memory II

    Til B d M lti li tiTil B d M lti li tibx

    Tile Based MultiplicationTile Based Multiplication

    One block computes one square sub-tx

    01 bsize-12

    0 1 2

    matrix Psub of size BLOCK_SIZEOne thread computes one element of P

    N

    BL

    OC

    K_S

    IZE

    ght

    PsubAssume that the dimensions of M and N are multiples of BLOCK_SIZE and square

    BB

    LO

    CK

    _SIZ

    E N.h

    eig

    shape M P

    P0

    0

    EPsub

    BLOCK SIZEBLOCK SIZEBLOCK SIZE

    byty

    210

    bsize-1

    1

    BL

    OC

    K_S

    IZE

    M.h

    eigh

    t

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    13

    BLOCK_SIZE

    N.widthM.width

    BLOCK_SIZEBLOCK_SIZE

    2

  • GPU Memory II

    Til d M t i M lti li ti K l Til d M t i M lti li ti K l R iR iTiled Matrix Multiplication Kernel Tiled Matrix Multiplication Kernel ---- ReviewReview__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){{1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

    3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;

    5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;

    7. float Pvalue = 0;

    8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory

    9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();

    12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14. Synchthreads();15 }

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    15. }16. Pd[Row*Width+Col] = Pvalue;}

  • GPU Memory II

    Sh d M Sh d M O g i tiO g i tiShared Memory Shared Memory OrganizationOrganizationParallel Memory Architecture:

    M i di id d i t b kBank 0

    Memory is divided into banksEssential to achieve high bandwidth

    Each bank can service one address per cycle Bank 4Bank 3Bank 2Bank 1

    A memory can service as many simultaneous accesses as it has banks

    Multiple simultaneous accesses to a bank Bank 7Bank 6Bank 5Bank 4

    Multiple simultaneous accesses to a bankresult in a bank conflict

    Conflicting accesses are serialized Bank 15

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    16

  • GPU Memory II

    Sh M A ISh M A IShare Memory Access IssueShare Memory Access Issue

    No Bank Conflicts No Bank ConflictsLinear addressing stride == 1

    Random 1:1 Permutation

    B k 0Th d 0 B k 0Th d 0

    Bank 3Bank 2Bank 1Bank 0

    Thread 3Thread 2Thread 1Thread 0

    Bank 3Bank 2Bank 1Bank 0

    Thread 3Thread 2Thread 1Thread 0

    Bank 7Bank 6Bank 5Bank 4

    Thread 7Thread 6Thread 5Thread 4

    Bank 7Bank 6Bank 5Bank 4

    Thread 7Thread 6Thread 5Thread 4

    Bank 15

    Bank 7

    Thread 15

    Thread 7

    Bank 15

    Bank 7

    Thread 15

    Thread 7

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    Bank 15Thread 15 Bank 15Thread 15

  • GPU Memory II

    Sh M A ISh M A IShare Memory Access IssueShare Memory Access Issue

    2-way Bank Conflicts 8-way Bank ConflictsLinear addressing stride == 2

    Linear addressing stride == 8

    Thread 0 Bank 0 Thread 0 Bank 0x8

    Th d 4Thread 3Thread 2Thread 1Thread 0

    B k 4Bank 3Bank 2Bank 1Bank 0

    Th d 4Thread 3Thread 2Thread 1Thread 0

    Bank 2Bank 1Bank 0

    Thread 4

    Bank 7Bank 6Bank 5Bank 4

    Thread 7Thread 6Thread 5Thread 4

    Bank 9Bank 8Bank 7

    Thread 11Thread 10Thread 9Thread 8

    Bank 15

    Bank 7

    Thread 15

    Thread 7 Bank 9

    Bank 15

    x8

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    Thread 11 Bank 15 Thread 15 Bank 15

  • GPU Memory II

    How addresses map to banks How addresses map to banks in CUDAin CUDAEach bank has a bandwidth of 32 bits per clock cycleSuccessive 32-bit words are assigned to successive banksG80 h 16 b kG80 has 16 banks

    So bank = address % 16Same as the size of a half-warp

    No bank conflicts between different half-warps, only within a single half-warp

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    19

  • GPU Memory II

    Sh M P fSh M P fShare Memory PerformanceShare Memory PerformanceShared memory is as fast as registers if there are no b k fli tbank conflictsThe fast case:

    If all threads of a half-warp access different banks, there is no If all threads of a half warp access different banks, there is no bank conflictIf all threads of a half-warp access the identical address, there is no bank conflict (broadcast)is no bank conflict (broadcast)

    The slow case:Bank Conflict: multiple threads in the same half-warp access th b kthe same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bank

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    20

  • GPU Memory II

    Li Li Add i g (1D)Add i g (1D)Linear Linear Addressing (1D)Addressing (1D)Given:

    Bank 3Bank 2Bank 1Bank 0

    Thread 3Thread 2Thread 1Thread 0

    s=1

    __shared__ float shared[256];float foo = shared[baseIndex + s * threadIdx.x];

    Bank 7Bank 6Bank 5Bank 4Bank 3

    Thread 7Thread 6Thread 5Thread 4Thread 3

    This is only bank-conflict-free if s shares no common factors with the number of

    Bank 15Thread 15

    banks 16 on G80, so s must be odd

    Bank 2Bank 1Bank 0

    Thread 2Thread 1Thread 0

    s=3

    Bank 7Bank 6Bank 5Bank 4Bank 3

    Thread 7Thread 6Thread 5Thread 4Thread 3

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    21Bank 15Thread 15

  • GPU Memory II

    D t t d b k fli tD t t d b k fli tData types and bank conflictsData types and bank conflictsThis has no conflicts if type of shared is 32-bits:

    foo = shared[baseIndex + threadIdx.x]

    Bank 4Bank 3Bank 2Bank 1Bank 0

    Thread 4Thread 3Thread 2Thread 1Thread 0

    But not if the data type is smaller4-way bank conflicts:

    __shared__ char shared[];B k 15

    Bank 7Bank 6Bank 5

    Th d 15

    Thread 7Thread 6Thread 5

    foo = shared[baseIndex + threadIdx.x];Bank 15Thread 15

    Bank 1Bank 0

    Thread 1Thread 0

    2-way bank conflicts:__shared__ short shared[];foo = shared[baseIndex + threadIdx.x];

    Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1

    Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    22Bank 15

    Bank 7

    Thread 15

    Thread 7

  • GPU Memory II

    St t d B k C fli tSt t d B k C fli tStructs and Bank ConflictsStructs and Bank ConflictsStruct assignments compile into as many memory accesses as there are struct members:

    struct vector { float x, y, z; };struct myType {

    float f; B k 4Bank 3Bank 2Bank 1Bank 0

    Th d 4Thread 3Thread 2Thread 1Thread 0

    int c;};__shared__ struct vector vectors[64];shared struct myType myTypes[64];

    Bank 7Bank 6Bank 5Bank 4

    Thread 7Thread 6Thread 5Thread 4

    __shared__ struct myType myTypes[64];

    This has no bank conflicts for vector; struct size is 3 words3 accesses per thread, contiguous banks (no common factor with 16)

    Bank 15Thread 15

    struct vector v = vectors[baseIndex + threadIdx.x];

    This has 2-way bank conflicts for my Type; (2 accesses per thread)struct myType m = myTypes[baseIndex + threadIdx.x];

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    23

  • GPU Memory II

    Common Array Bank Conflict Common Array Bank Conflict Patterns 1DPatterns 1D

    Each thread loads 2 elements into shared memory:

    2-way-interleaved loads result in 2-way bank conflicts:

    Th d 2

    Thread 1

    Thread 0

    B k 2

    Bank 1

    Bank 0

    int tid = threadIdx.x;shared[2*tid] = global[2*tid];h d[2*tid+1] l b l[2*tid+1]

    Thread 4

    Thread 3

    Thread 2

    Bank 5

    Bank 4

    Bank 3

    Bank 2

    shared[2*tid+1] = global[2*tid+1];

    This makes sense for traditional CPU th d l lit i h li g d

    Thread 9

    Thread 8Bank 7

    Bank 6

    threads, locality in cache line usage and reduced sharing traffic.

    Not in shared memory usage where there is no cache line effects but banking effects

    Thread 11

    Thread 10

    Bank 15

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    24

    no cache line effects but banking effects

  • GPU Memory II

    A B tt A A P ttA B tt A A P ttA Better Array Access PatternA Better Array Access PatternEach thread loads one element in every consecutive group of blockDimevery consecutive group of blockDimelements.

    Bank 2

    Bank 1

    Bank 0

    Thread 2

    Thread 1

    Thread 0

    shared[tid] = global[tid];shared[tid + blockDim.x] = global[tid + blockDim.x];

    B k 5

    Bank 4

    Bank 3

    Bank 2

    Th d 5

    Thread 4

    Thread 3

    Thread 2

    Bank 7

    Bank 6

    Bank 5

    Thread 7

    Thread 6

    Thread 5

    Bank 15Thread 15

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

    25

    Bank 15Thread 15

  • GPU Memory II

    Common Bank Conflict Patterns (2D)Common Bank Conflict Patterns (2D)Operating on 2D array of floats in shared memory

    Bank Indices without Padding0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15memory

    e.g. image processingExample: 16x16 block

    Each thread processes a row

    0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15p

    So threads in a block access the elements in each column simultaneously (example: row 1 in purple)16 way bank conflicts: rows all start at bank 0

    0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

    0 1 2 3 4 5 6 7 1516-way bank conflicts: rows all start at bank 0

    Solution 1) pad the rowsAdd one float to the end of each row

    0 1 2 3 4 5 6 7 151 2 3 4 5 6 7 8 02 3 4 5 6 7 8 9 1

    012

    Bank Indices with Padding

    Add one float to the end of each rowSolution 2) transpose before processing

    Suffer bank conflicts during transpose

    3 4 5 6 7 8 9 10 24 5 6 7 8 9 1011 35 6 7 8 9 101112 46 7 8 9 10111213 57 8 9 1011121314 7

    34568

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes26

    15 0 1 2 3 4 5 6 1415

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()

    Nd

    IDT

    H

    14. Synchthreads();15. }

    TIL

    E_W

    I_W

    IDT

    H WID

    TH

    TIL

    E_

    Md Pd

    Pdsub

    TIL

    E_W

    IDT

    HE

    WID

    TH

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes27

    TILE_WIDTH

    WIDTHWIDTH

    TILE_WIDTHTILE_WIDTH

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

    Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes28

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

    Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]

    0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

    For TILE_WIDTH = 16The whole half-warp is accessing the

    same shared memory location.C fli t B t GPU t b d ti0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

    0 1 2 3 4 5 6 7 15

    Conflict. But, GPU support broadcasting.

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes29

    0 1 2 3 4 5 6 7 15

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

    Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]

    0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 15

    For TILE_WIDTH = 8The first half-warp and the second half-

    warp are accessing two different shared l ti8 9 10 11 12 13 14 150 1 2 3 4 5 6 7

    8 9 10 11 12 13 14 15

    memory location.8-way bank conflict.

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes30

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

    Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]

    0 1 2 34 5 6 78 9 10 11

    12 13 14 15For TILE_WIDTH = 4

    4-way bank conflict.

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes31

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

    Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]

    0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

    For TILE_WIDTH = 16Each thread in a half-warp is accessing

    different shared memory location.0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

    0 1 2 3 4 5 6 7 15

    yNo conflict.

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes32

    0 1 2 3 4 5 6 7 15

  • GPU Memory II

    Does Matrix Multiplication Incur Shared Memory Bank Conflicts?Does Matrix Multiplication Incur Shared Memory Bank Conflicts?12. for (int k = 0; k < TILE_WIDTH; ++k) {13. Pvalue += Mds[ty][k] * Nds[k][tx];14 S hth d ()14. Synchthreads();15. }

    Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx]

    For TILE_WIDTH = 8Since the memory storage organization is

    row-major for 2D array, so it’s the same with

    0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 7j y

    TILE_WIDTH = 16.No conflict.

    0 1 2 3 4 5 6 78 9 10 11 12 13 14 150 1 2 3 4 5 6 78 9 10 11 12 13 14 15

    Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes33