34
UC Regents Spring 2014 © UCB CS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/ ~cs152/ TA: Eric Love Lecture 22 -- GPU + SIMD + Vectors I Pla y:

2014-4-15 John Lazzaro (not a prof - “John” is always OK)

  • Upload
    varsha

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

www-inst.eecs.berkeley.edu/~cs152/. CS 152 Computer Architecture and Engineering. Lecture 22 -- GPU + SIMD + Vectors I. 2014-4-15 John Lazzaro (not a prof - “John” is always OK). TA: Eric Love. Play:. Today: Architecture for data parallelism. The Landscape: Three chips that deliver - PowerPoint PPT Presentation

Citation preview

Page 1: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors

2014-4-15

John Lazzaro(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 22 -- GPU + SIMD + Vectors I

Play:

Page 2: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

UC Regents Fall 2006 © UCBCS 152 L22: GPU + SIMD + Vectors

Today: Architecture for data parallelism

The Landscape: Three chips that deliver TeraOps/s in 2014, and how they differ.

GK110: nVidia’s flagship Kepler GPU, customized for compute applications.

Short Break

E5-2600v2: Stretching the Xeon server approach for compute-intensive apps.

Page 3: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Sony/IBM Playstation PS3 Cell Chip - Released 2006

Page 4: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Sony PS3 Cell Processor SPE Floating-Point

32-bit 32-bit 32-bit 32-bitSingle-Instruction

Multiple-Data

4

single-precisionmultiply-

addsissue in lockstep(SIMD)

per cycle.6 cycle latency(in blue)

6 gamer SPEs,

3.2 GHz clock,

--> 150 GigaOps/s

Page 5: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Sony PS3 Cell Processor SPE Floating-Point

32-bit 32-bit 32-bit 32-bitSingle-Instruction

Multiple-DataIn the 1970s a big part

of a computer

architecture class would be learning how to build

units like this.Top-down

(f.p. format)&&

Bottom-up(logic design)

Page 6: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Sony PS3 Cell Processor SPE Floating-PointThe PS3 ceded ground to Xbox not because it

was underpowered, but because it was hard to program.

Today, the formats are standards (IEEE f.p.)

and the bottom-up is now “EE.”

Architects focus on how to organize

floating point units into

programmable machines

for application domains.

Page 7: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors

2014: TeraOps/Sec Chips

Page 8: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Intel E5-2600v2

12-core Xeon Ivy Bridge

0.52 TeraOps/s

12 cores @ 2.7 GHzEach core

can issue 16 single-

precisionoperations per cycle.

$2,600 per chip

Haswell: 1.04 TeraOps/s

Page 9: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

EECS 150: Graphics Processors UC Regents Fall 2013 © UCB

nVidia GPU5.12

TeraOps/s

2880 MACs @ 889 MHz

single-precision

multiply-adds

Kepler GK 110

$999

GTX Titan Black with

6GB GDDR5 (and 1 GPU)

Page 10: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Typical application: Medical imaging scanners, for first stage of processing after the A/D converters.

XC7VX980T

Xilinx Virtex 7 with the most

DSP blocks.

3600 MACs @ 714 MHzComparable

to single-precision

floating-point.

5.14 TeraOps/s

$16,824 per chip

(die photo of a related part)

Page 11: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Intel E5-2600v2

12 cores @ 2.7 GHz

How?

Haswell coresissue

32/cycle.

12 cores @ 2.7 GHzEach core

can issue 16 single-

precisionops/cycle.

Page 12: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Die closeup of one Sandy Bridge core

Advanced Vector Extension (AVX) unit

Smaller than L3 cache, but larger than L2 cache.Relative area has increased in

Haswell

Page 13: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Programmers ModelAVX

IA-32 Nehalem

8 128-bit registers

Each register holds 4 IEEE single-precision floats

The programmers model has many variants, which we will introduce in the slides that

follow

Page 14: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Example AVX Opcode

VMULPS XMM4 XMM2 XMM3

XMM2

XMM3

XMM4op = *

Multiply two 4-element vectors ofsingle-precision floats, element by element.

New issue every cycle. 5 cycle latency (Haswell).

Aside from its use of a special register set, VMULPS execute like normal IA-32

instructions.

Page 15: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Sandy Bridge, Haswell

Sandy Bridge extends register set to 256 bits: vectors are twice the

size.

IA-64 AVX/AVX2

has 16 registers

(IA-32: 8)

Haswell adds 3-operand instructions a*b + c

Fused multiply-add (FMA)

2 EX units with FMA --> 2X increase in ops/cycle

Page 16: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

OoO Issue Haswell

(2013)

Haswell sustains 4 micro-op issues per cycle.One possibility:2 for AVX, and 2 for Loads, Stores and book-keeping.

Haswell has two copies of the FMA engine, on separate ports.

Page 17: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

AVX: Not just single-precision floating-pointAVX instruction variants interpret 128-bit

registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...

256-bit version -> double-precision vectors of length 4

Page 18: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Exception Model

MXCSR: AVX

condition codes

register

Floating-point exceptions: Always a contentious issue in ISA design ...

Page 19: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Exception Handling

Use MXCSRto configureAVX to halt

program for divide by

zero, etc ...

Or, configure AVX for show must go onsemantics: on error,

results are set to +Inf, -Inf, NaN, ...

Page 20: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Data movesAVX register file reads pass through a permute

and shuffle networks in both “X” and “Y” dimensions.

Many AVX instructions rely on this feature ...

Page 21: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Pure data

move opcode.

Or, part of a

math opcode.

Page 22: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Permutes over 2 sets of 4 fields

of one vector.

Arbitrary data

alignment

Shuffling two vectors.

Page 23: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Memory System

Gather: Reading non-unit-stride memory locations into arbitrary positions in an AVX register, while minimizing redundant reads.

Values in memory.Specified indices.

Final result.

Page 24: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Positive observations ...

Best for applications that are a good fit for Xeon’s memory system: Large on-chip caches, up-to-a-TeraByte of DRAM, but only moderate bandwidth requirements to DRAM. Applications that do “a lot of everything” --integer, random-access loads/stores, string ops -- gain access to a significant fraction of a TeraOp/sof floating point, with no context switching.If you’re planning on experimenting with GPUs,you need a Xeon server anyway ...aside from $$$, why not buy a high-core-count variant?

Page 25: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Negative observations ...

AVX changes each generation, in a backward compatible way, to add the latest features. AVX is difficult for compilers. Ideally, someone has written a library of hand-crafted AVX assembly code that does exactly what you want.Two FMA units per core (50% of issue width) is probably the limit. So, scaling vector size or scaling core count are the only upgrade paths.

0.52 TeraOp/s (Ivy Bridge) << 5.12 TeraOp/s (GK110)And $2700 (chip only) >> $999 (Titan Black card).59.6 GB/s << 336 GB/s (memory bandwidth)

Page 26: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors

Break

Play:

Page 27: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

EECS 150: Graphics Processors UC Regents Fall 2013 © UCB

nVidia GPU

The granularity of SMX

cores (15 per

die)matches the Xeon

core count (12 per

die)

Kepler GK 110

Page 28: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

SMX core(28 nm)

Sandy Bridge core

(32 nm)

Page 29: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

889 MHz GK 110 SMX core vs 2.7 GHz Haswell core

single prec.

double prec.

1024-bit SIMD vectors: 4X more than Haswell32 single-precision floats or 16 double-precision floats

singleprecisio

n

singleprecisio

n

singleprecisio

n

singleprecisio

n

singleprecisio

n

singleprecisio

n

doubleprecisio

n

doubleprecisio

n

specialops

memory ops

Execution units vs. Haswell 3X (single-precision), 1X (double-precision)

Clock speed vs Ivy Bridge Xeon: 3X slower

4X single-precision, 1.33X double-precision

Page 30: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Organization: Multi-threaded like Niagara

Thread scheduler

2048 registers in total. Several programmer models available. Largest model has 256 registers per thread, supporting 8 active threads.

Page 31: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Organization: Multi-threaded, In-order

Thread scheduler

The SIMD math units live here

Each cycle, 3 threads can issue 2 in-order instructions.

Page 32: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

Bandwidth to DRAM

is 5.6X XeonIvy Bridge

But, DRAM limited to

6GB, and all caches are

small compared

to Xeon

Page 33: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

EECS 150: Graphics Processors UC Regents Fall 2013 © UCB

nVidia GPU5.12

TeraOps/s

Kepler GK 110

$999

GTX Titan Black with

6GB GDDR5 (and 1 GPU)

2880 MACs @ 889 MHz

single-precision

multiply-adds

Page 34: 2014-4-15 John Lazzaro (not a prof - “John” is always OK)

On Thursday

To be continued ...

Have fun in section !