33
synergy.cs.vt .edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia Tech (Undergrad) Advisor: Dr. Wu-chun Feng* § , Virginia Tech * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

Embed Size (px)

Citation preview

Page 1: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core ArchitectureStudent: Carlo C. del Mundo*, Virginia Tech (Undergrad)Advisor: Dr. Wu-chun Feng*§, Virginia Tech* Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

Page 2: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Forecast: Hardware-Software Co-Design

Software(Transpose)

Hardware(K20c and shuffle)

NVIDIA Kepler K20c Shuffle Mechanism

Page 3: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What is shuffle?

Page 4: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What is shuffle?

Cheaper data movement

Page 5: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What is shuffle?

Cheaper data movement

• Faster than shared memory

Page 6: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What is shuffle?

Cheaper data movement

• Faster than shared memory• Only in NVIDIA Tesla Kepler GPUs

Page 7: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What is shuffle?

Cheaper data movement

• Faster than shared memory• Only in NVIDIA Tesla Kepler GPUs• Limited to a warp

Page 8: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What is shuffle?

Cheaper data movement

• Faster than shared memory• Only in NVIDIA Tesla Kepler GPUs• Limited to a warp>>> Idea: reduce data communication between threads <<<

Page 9: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What are you solving?

Page 10: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What are you solving?

• Enable efficient data communication

Page 11: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What are you solving?

• Enable efficient data communication– Shared Memory (the “old” way)

Page 12: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What are you solving?

• Enable efficient data communication– Shared Memory (the “old” way)

Page 13: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Q: What are you solving?

• Enable efficient data communication– Shared Memory (the “old” way)

– Shuffle (the “new” way)

Page 14: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Approach

• Evaluate shuffle using matrix transpose– Matrix transpose is a data communication step in FFT

Page 15: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Approach

• Evaluate shuffle using matrix transpose– Matrix transpose is a data communication step in FFT

Page 16: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Approach

• Evaluate shuffle using matrix transpose– Matrix transpose is a data communication step in FFT

• Devised Shuffle Transpose Algorithm– Consists of horizontal (inter-thread shuffles) and vertical

(intra-thread)

Page 17: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Analysis

• Bottleneck: Intra-thread data movement

Page 18: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Analysis

Register File

t0 t1 t2 t3

• Bottleneck: Intra-thread data movement

Stage 2: Vertical

Page 19: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Analysis

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• Bottleneck: Intra-thread data movement

Stage 2: Vertical

Page 20: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Analysis

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• Bottleneck: Intra-thread data movement

Stage 2: Vertical

Page 21: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Analysis

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• Bottleneck: Intra-thread data movement

Stage 2: Vertical

15x

Page 22: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

General strategies• Registers are fast.• CUDA local memory is slow.

– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.

Analysis

15x

Page 23: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Registers are fast.• CUDA local memory is slow.

– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.

Analysis

15x

Page 24: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Registers are fast.• CUDA local memory is slow.

– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.

Divergence

Divergence

Divergence

Analysis

15x

6%

Page 25: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Registers are fast.• CUDA local memory is slow.

– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.

Code 3 (SELP OOP)

65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];

Divergence

Divergence

Divergence

Analysis

15x

6%

44%

Page 26: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Results

Page 27: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Results

Page 28: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Conclusion

• Overall Performance– Max. Speedup (Amdahl’s Law): 1.19-fold– Achieved Speedup: 1.17-fold

Page 29: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Conclusion

• Overall Performance– Max. Speedup (Amdahl’s Law): 1.19-fold– Achieved Speedup: 1.17-fold

• Surprise Result– Goal: Accelerate communication (“gray bar”)– Result: Accelerated the computation also (“black

bar”)

Page 30: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Thank You!

• Enabling Efficient Intra-Warp Comunication for Fourier Transforms in a Many-Core Architecture– Student: Carlo del Mundo, Virginia Tech

(undergrad)– Overall Performance

• Theoretical Speedup: 1.19-fold• Achieved Speedup: 1.17-fold

Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Page 31: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Appendix

Page 32: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Motivation

• Goal– Accelerating an application based on hardware-

specific mechanisms (e.g., “the hardware-software co-design process”)

• Case Study– Application: Matrix transpose as part of a 256-pt

FFT– Architecture: NVIDIA Kepler K20c

• Use shuffle to accelerate communication

• Results– Max. Theoretical Speedup: 1.19-fold– Achieved Speedup: 1.17-fold

Page 33: Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia

synergy.cs.vt.edu

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

Background: The New and Old

• Shuffle– Idea:

• Communicate data within a warp w/o shared memory

– Pros• Faster (1 cycle to

perform load and store)• Eliminate the use of

shared memory higher thread occupancy

– Cons• Poorly understood• Only available in Kepler

GPUs

• Only limited to 32 threads

• Shared Memory– Idea

• Scratchpad memory to communicate data

– Pros• Easy to program• Scales to a block (up to

1536 threads)– Cons

• Prone to bank conflicts