COSC 6385 Computer Architecture - Data Level …gabriel/courses/cosc6385_s18/CA_14_DLP_2.pdf• Each SP has access to 2048 register file entries each with 32 ... but handled by SM

1

COSC 6385

Computer Architecture

- Data Level Parallelism (II)

Edgar Gabriel

Fall 2018

SIMD Instructions

• Originally developed for Multimedia applications

• Same operation executed for multiple data items

• Uses a fixed length register and partitions the carry chain to

allow utilizing the same functional unit for multiple

operations

– E.g. a 64 bit adder can be utilized for two 32-bit add

operations simultaneously

• Instructions originally not intended to be used by compiler,

but just for handcrafting specific operations in device drivers

• All elements in a register have to be on the same memory

page to avoid page faults within the instruction

2

SIMD Instructions

• MMX (Mult-Media Extension) - 1996

– Existing 64 bit floating point register could be used for

eight 8-bit operations or four 16-bit operations

• SSE (Streaming SIMD Extension) – 1999

– Successor to MMX instructions

– Separate 128-bit registers added for sixteen 8-bit, eight

16-bit, or four 32-bit operations

• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007

– Added support for double precision operations

• AVX (Advanced Vector Extensions) - 2010

– 256-bit registers added

• AVX2 – 2013

– 512 –bit registers added

AVX InstructionsAVX Instruction Description

VADDPD Add four packed double-precision operands

VSUBPD Subtract four packed double-precision operands

VMULPD Multiply four packed double-precision operands

VDIVPD Divide four packed double-precision operands

VFMADDPD Multiply and add four packed double-precision

operands

VFMSUBPD Multiply and subtract four packed double-precision

operands

VCMPxx Compare four packed double-precision operands for EQ, NEQ, LT, LTE, GT, GE…

VMOVAPD Move aligned four packed double-precision operands

VBROADCASTSD Broadcast one double-precision operand to four

locations in a 256-bit register

3

Graphics Processing Units (GPU)

• Hardware in Graphics Units similar to Vector Processors

– Works well with data-level parallel problems

– Scatter-gather transfers

– Mask registers

– Large register files

• Differences:

– No scalar processor

– Uses multithreading to hide memory latency

– Has many functional units, as opposed to a few deeply

pipelined units like a vector processor

Graphics Processing Units (II)

• Using NVIDIA GPUs as an example

• Basic idea:

– Heterogeneous execution model

• CPU is the host, GPU is the device

– Develop a C-like programming language for GPU

– Unify all forms of GPU parallelism as CUDA thread

– Programming model is “Single Instruction Multiple

Threads”

• GPU hardware handles thread management, not

applications or OS

4

Example: Vector Addition

• Sequential code:int main ( int argc, char **argv )

{

int A[N], B[N], C[N];

for ( i=0; i<N; i++) {

C[i] = A[i] + B[i];

}

return (0);

}

CUDA: replace the loop by N threads

each executing on element of the vector

add operation

Example: Vector Addition (II)

• CUDA: replace the loop by N threads each executing

one element of the vector add operation

• Question: How does each thread know which elements

to execute?

– threadIdx : each thread has an id which is unique in

the thread block

• of type dim3, which is a

struct {

int x,y,z;

} dim3;

– blockDim: Total number of threads in the thread block

• a thread block can be 1D, 2D or 3D

5

Example: Vector Addition (III)

• Initial CUDA kernel:

• This code is limited by the maximum number of threads in a thread

block

– Upper limit on max. number of threads in one block

– if vector is longer, we have to create multiple thread blocks

void vecadd ( int *d_A, int *d_B, int* d_C)

{

int i = threadIdx.x;

d_C[i] = d_A[i] + d_B[i];

return;

}

Assuming a 1-D thread block

-> only x-dimension used

How does the compiler now which code to

compile for CPU and which one for GPU?

• Specifier tells compiler where function will be executed

-> compiler can generate code for corresponding processor

• Executed on CPU, called form CPU (default if not specified)

__host__ void func(…)

• CUDA kernel to be executed on GPU, called from CPU

__global__ void func(...);

• CUDA kernel to be executed on GPU, called from GPU

__device__ void func(...);

6

Example: Vector Addition (IV)

• so the CUDA kernel is in reality:

• Note:

– d_A, d_B, and d_C are in global memory

– int i is in local memory of the thread

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)

{

int i = threadIdx.x;

d_C[i] = d_A[i] + d_B[i];

return;

}

If you have multiple thread blocks


{

int i = blockIdx.x * blockDim.x + threadIdx.x;

d_C[i] = d_A[i] + d_B[i];

return;

}

ID of the thread block that

this thread is part of

Number of threads in a

thread block

7

Using more than one element per thread


{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j;

for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++)

d_C[j] = d_A[j] + d_B[j];

return;

}

NVIDIA Instruction Set Architecture

• Parallel Thread Execution (PTX)

– is an abstraction of the hardware instruction set

– Uses virtual registers

– Translation to machine code is performed in software

– Example for one iteration of a loop executing

y[i] = a*x[i] + y[i]

with a blocksize of 512 threads per blockshl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size

add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * a

add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])

st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

8

Conditional Branching

• Branch hardware uses internal masks

– Branch synchronization stack to support nested branch

instructions

• Entries consist of masks for each SIMD lane (CUDA thread)

– Instruction markers to manage when a branch diverges into

multiple execution paths

• Push on divergent branch

– …and when paths converge

• Act as barriers

• Pops stack

• For equal length IF-ELSE conditions, code will operate at 50%

efficiency

– Either IF or the ELSE part is not executing

Nvidia GT200

• A GT200 is multi-core chip with two level hierarchy

– focuses on high throughput on data parallel workloads

• 1st level of hierarchy: 10 Thread Processing Clusters (TPC)

• 2nd level of hierarchy: each TPC has

– 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1

core in a conventional processor)

– a texture pipeline (used for memory access)

• Global Block Scheduler:

– issues thread blocks to SMs with available capacity

– simple round-robin algorithm but taking resource

availability (e.g. of shared memory) into account

9

Nvidia GT200

Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”,

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1

Streaming multi-processor (I)

• Instruction fetch, decode and issue logic

• 8 32bit ALU units (that are often referred to as Streaming

processor (SP) or confusingly called a ‘core’ by Nvidia)

• 8 branch units

– a thread encountering a branch will stall until it is

resolved (no speculation), branch delay: 4 cycles

• Two 64bit special units for less frequent operations

– 64bit operations 8-12 times slower than 32bit operations!

• 1 special function unit for ‘unusual’ instructions

– transcendental functions, interpolations, reciprocal

square roots

– take anywhere from 16 to 32 cycles to execute



10

Streaming multi-processor (II)

• Single issue with SIMD capabilities

• Can execute up to 8 thread blocks/1024 threads

concurrently

• Does not support speculative execution or branch prediction

• Instructions are scoreboarded to reduce stalls

• Each SP has access to 2048 register file entries each with 32

bits

– a double precision number has to utilize two adjacent

registers

– register file can be used by up to 128 threads

concurrently

Streaming multi-processor (III)

Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”,




11

Streaming multi-processor (IV)

• Execution units of an SM run at twice the frequency of fetch

and issue logic as well as memory and register

• 64KB register file that is partitioned across alls SPs

• 16KB shared memory that can be used for communication

between the threads running on the SPs of the same SM

– organized in 4096 entries, 16 banks ( = 32bit bank width)

– accessing shared memory is as fast as accessing a

register!

Load/Store operations• Generated in SMs, but handled by SM controller in the TPC

– load pipeline shared hardware with texture pipeline

– shared by three 3 SMs

– mutual exclusive usage of load and texture pipelines

– effective address calculation + mapping of 40byte virtual

addresses to physical address by MMU

• Texture cache:

– 2-D addressing

– read only caches without cache coherence

• entire cache hierarchy invalidated if a data item is

modified

– texture caches used to save bandwidth and power, not

really faster than texture memory

12

CUDA Memory Model

CUDA Memory Model (II)

• cudaError_t cudaMalloc(void** devPtr, size_t size)

– Allocates size bytes of device(global) memory pointed to by *devPtr

– Returns cudaSuccess for no error

• cudaError_t cudaMempy(void* dst, const void* src,

size_t count, enum cudaMemcpyKind kind)

– Dst = destination memory address

– Src = source memory address

– Count = bytes to copy

– Kind = type of transfer (“HostToDevice”, “DeviceToHost”,

“DeviceToDevice”)

• cudaError_t cudaFree(void* devPtr)

– Frees memory allocated with cudaMalloc

Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo

http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

13

Example: Vector Addition (V)int main ( int argc, char ** argv) {

float a[N], b[N], c[N];

float *d_a, *d_b, *d_c;

cudaMalloc( &d_a, N*sizeof(float));

cudaMalloc( &d_b, N*sizeof(float));

cudaMalloc( &d_c, N*sizeof(float));

cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice);

dim3 threadsPerBlock(256); // 1-D array of threads

dim3 blocksPerGrid(N/256); // 1-D grid

vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c);

cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return-0;

}

Nvidia Fermi processor• Next generation processors of Nvidia

• Removed one level of hierarchy

– contains 16 SM processors, but no notion of TPCs anymore

• Each SM processor has

– 32 ALU units (Nvidia ‘cores’, SIMD ‘lanes’ in the book)

• compared to 8 on the GT200

• organized as two units with 16 ALUs each

– 16 load/store units

• compared to 1 for three SMs in GT200

– 64 kb local SRAM that can be split into L1 cache and

shared memory (16kb/48kb or 48kb/16kb)

– 4 special function units

• compared to 1 in GT200

14

Nvidia Fermi SM processor

Image Source:Peter N. Glaskowsky, “Nvidia’s Fermi: The First Complete GPU Architecture”

http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf

Nvidia Fermi processor

• Can manage up 1,536 threads simultaneously per SM

– compared to 1024 per SM on the GT200

• Register file increased to 128kB (32k entries)

• New: modified address space using 40bit addresses

– global, shared and local addresses are ranges within that

address space

• New: support for atomic read-modify-write operation

• New: support for predicated instructions

http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf

15

Similarities and Differences between GPU

and Vector Processors

• Memory organization and management

– All GPU memory accesses are gather-scatter

-> special hardware to recognize address coalescing

-> hides memory latency due to large number of threads

and scoreboarding

– Loading data into vector register contiguous by default

-> special support for gather-scatter operation

-> costs of load/store operation amortized due to large

number of elements accessed at once


and Vector Processors (II)

• Processor organization and ISA

– Vector register hold entire vector <-> vector is

distributed across registers in different ALUs on GPU

– Much higher number of ALU/threads supported in GPU

than no. of lanes in a vector processor

– PTX instruction similar to a vector instruction

– Both approaches use mask registers to handle conditional

instructions

-> mask set by compiler for vector processors

-> mask set at runtime by hardware for GPU

16


and Vector Processors (III)

• Scalar units executes scalar operations in vector

processor

• GPU could use the regular CPU for scalar operations

– High costs of data transfer between GPU and CPU

memory

– Scalar code often executed on GPU

Documents

COSC 6385 Computer Architecture - Data Level …gabriel/courses/cosc6385_s18/CA_14_DLP_2.pdf• Each SP has access to 2048 register file entries each with 32 ... but handled by SM