34
UC Regents Spring 2014 © UCB CS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/ ~cs152/ TA: Eric Love Lecture 22 -- GPU + SIMD + Vectors I Pla y:

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Embed Size (px)

Citation preview

Page 1: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors

2014-4-15

John Lazzaro(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 22 -- GPU + SIMD + Vectors I

Play:

Page 2: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

UC Regents Fall 2006 © UCBCS 152 L22: GPU + SIMD + Vectors

Today: Architecture for data parallelism

The Landscape: Three chips that deliver TeraOps/s in 2014, and how they differ.

GK110: nVidia’s flagship Kepler GPU, customized for compute applications.

Short Break

E5-2600v2: Stretching the Xeon server approach for compute-intensive apps.

Page 3: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Sony/IBM Playstation PS3 Cell Chip - Released 2006

Page 4: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Sony PS3 Cell Processor SPE Floating-Point

32-bit 32-bit 32-bit 32-bitSingle-Instruction

Multiple-Data

4

single-precisionmultiply-

addsissue in lockstep(SIMD)

per cycle.6 cycle latency(in blue)

6 gamer SPEs,

3.2 GHz clock,

--> 150 GigaOps/s

Page 5: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Sony PS3 Cell Processor SPE Floating-Point

32-bit 32-bit 32-bit 32-bitSingle-Instruction

Multiple-DataIn the 1970s a big part

of a computer

architecture class would be learning how to build

units like this.Top-down

(f.p. format)&&

Bottom-up(logic design)

Page 6: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Sony PS3 Cell Processor SPE Floating-PointThe PS3 ceded ground to Xbox not because it

was underpowered, but because it was hard to program.

Today, the formats are standards (IEEE f.p.)

and the bottom-up is now “EE.”

Architects focus on how to organize

floating point units into

programmable machines

for application domains.

Page 7: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors

2014: TeraOps/Sec Chips

Page 8: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Intel E5-2600v2

12-core Xeon Ivy Bridge

0.52 TeraOps/s

12 cores @ 2.7 GHzEach core

can issue 16 single-

precisionoperations per cycle.

$2,600 per chip

Haswell: 1.04 TeraOps/s

Page 9: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

EECS 150: Graphics Processors UC Regents Fall 2013 © UCB

nVidia GPU5.12

TeraOps/s

2880 MACs @ 889 MHz

single-precision

multiply-adds

Kepler GK 110

$999

GTX Titan Black with

6GB GDDR5 (and 1 GPU)

Page 10: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Typical application: Medical imaging scanners, for first stage of processing after the A/D converters.

XC7VX980T

Xilinx Virtex 7 with the most

DSP blocks.

3600 MACs @ 714 MHzComparable

to single-precision

floating-point.

5.14 TeraOps/s

$16,824 per chip

(die photo of a related part)

Page 11: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Intel E5-2600v2

12 cores @ 2.7 GHz

How?

Haswell coresissue

32/cycle.

12 cores @ 2.7 GHzEach core

can issue 16 single-

precisionops/cycle.

Page 12: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Die closeup of one Sandy Bridge core

Advanced Vector Extension (AVX) unit

Smaller than L3 cache, but larger than L2 cache.Relative area has increased in

Haswell

Page 13: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Programmers ModelAVX

IA-32 Nehalem

8 128-bit registers

Each register holds 4 IEEE single-precision floats

The programmers model has many variants, which we will introduce in the slides that

follow

Page 14: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Example AVX Opcode

VMULPS XMM4 XMM2 XMM3

XMM2

XMM3

XMM4op = *

Multiply two 4-element vectors ofsingle-precision floats, element by element.

New issue every cycle. 5 cycle latency (Haswell).

Aside from its use of a special register set, VMULPS execute like normal IA-32

instructions.

Page 15: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Sandy Bridge, Haswell

Sandy Bridge extends register set to 256 bits: vectors are twice the

size.

IA-64 AVX/AVX2

has 16 registers

(IA-32: 8)

Haswell adds 3-operand instructions a*b + c

Fused multiply-add (FMA)

2 EX units with FMA --> 2X increase in ops/cycle

Page 16: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

OoO Issue Haswell

(2013)

Haswell sustains 4 micro-op issues per cycle.One possibility:2 for AVX, and 2 for Loads, Stores and book-keeping.

Haswell has two copies of the FMA engine, on separate ports.

Page 17: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

AVX: Not just single-precision floating-pointAVX instruction variants interpret 128-bit

registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...

256-bit version -> double-precision vectors of length 4

Page 18: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Exception Model

MXCSR: AVX

condition codes

register

Floating-point exceptions: Always a contentious issue in ISA design ...

Page 19: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Exception Handling

Use MXCSRto configureAVX to halt

program for divide by

zero, etc ...

Or, configure AVX for show must go onsemantics: on error,

results are set to +Inf, -Inf, NaN, ...

Page 20: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Data movesAVX register file reads pass through a permute

and shuffle networks in both “X” and “Y” dimensions.

Many AVX instructions rely on this feature ...

Page 21: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Pure data

move opcode.

Or, part of a

math opcode.

Page 22: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Permutes over 2 sets of 4 fields

of one vector.

Arbitrary data

alignment

Shuffling two vectors.

Page 23: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Memory System

Gather: Reading non-unit-stride memory locations into arbitrary positions in an AVX register, while minimizing redundant reads.

Values in memory.Specified indices.

Final result.

Page 24: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Positive observations ...

Best for applications that are a good fit for Xeon’s memory system: Large on-chip caches, up-to-a-TeraByte of DRAM, but only moderate bandwidth requirements to DRAM. Applications that do “a lot of everything” --integer, random-access loads/stores, string ops -- gain access to a significant fraction of a TeraOp/sof floating point, with no context switching.If you’re planning on experimenting with GPUs,you need a Xeon server anyway ...aside from $$$, why not buy a high-core-count variant?

Page 25: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Negative observations ...

AVX changes each generation, in a backward compatible way, to add the latest features. AVX is difficult for compilers. Ideally, someone has written a library of hand-crafted AVX assembly code that does exactly what you want.Two FMA units per core (50% of issue width) is probably the limit. So, scaling vector size or scaling core count are the only upgrade paths.

0.52 TeraOp/s (Ivy Bridge) << 5.12 TeraOp/s (GK110)And $2700 (chip only) >> $999 (Titan Black card).59.6 GB/s << 336 GB/s (memory bandwidth)

Page 26: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors

Break

Play:

Page 27: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

EECS 150: Graphics Processors UC Regents Fall 2013 © UCB

nVidia GPU

The granularity of SMX

cores (15 per

die)matches the Xeon

core count (12 per

die)

Kepler GK 110

Page 28: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

SMX core(28 nm)

Sandy Bridge core

(32 nm)

Page 29: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

889 MHz GK 110 SMX core vs 2.7 GHz Haswell core

single prec.

double prec.

1024-bit SIMD vectors: 4X more than Haswell32 single-precision floats or 16 double-precision floats

singleprecisio

n

singleprecisio

n

singleprecisio

n

singleprecisio

n

singleprecisio

n

singleprecisio

n

doubleprecisio

n

doubleprecisio

n

specialops

memory ops

Execution units vs. Haswell 3X (single-precision), 1X (double-precision)

Clock speed vs Ivy Bridge Xeon: 3X slower

4X single-precision, 1.33X double-precision

Page 30: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Organization: Multi-threaded like Niagara

Thread scheduler

2048 registers in total. Several programmer models available. Largest model has 256 registers per thread, supporting 8 active threads.

Page 31: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Organization: Multi-threaded, In-order

Thread scheduler

The SIMD math units live here

Each cycle, 3 threads can issue 2 in-order instructions.

Page 32: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

Bandwidth to DRAM

is 5.6X XeonIvy Bridge

But, DRAM limited to

6GB, and all caches are

small compared

to Xeon

Page 33: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

EECS 150: Graphics Processors UC Regents Fall 2013 © UCB

nVidia GPU5.12

TeraOps/s

Kepler GK 110

$999

GTX Titan Black with

6GB GDDR5 (and 1 GPU)

2880 MACs @ 889 MHz

single-precision

multiply-adds

Page 34: UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and

On Thursday

To be continued ...

Have fun in section !