12
1 Edgar Gabriel COSC 6385 Computer Architecture - Multi-Processors (II) The IBM Cell, Intel Larrabee and Nvidia G80 processors Edgar Gabriel Fall 2008 COSC 6385 – Computer Architecture Edgar Gabriel References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: Larrabee: a many-core x86 architecture for visual computing”, ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf IBM Cell processor: [2] C. R. Johns, D. A. Brokenshire “Introductioon to the Cell Broadband Engine Architecture”, IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519 http://www.research.ibm.com/journal/rd/515/johns.pdf [3] M. Kistler, M. Perrone, F. Petrini, “Cell Multiprocessor Communication Network: Built for Speed” IEEE Micro, vol. 26, no. 3, pp .10-23 ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf Nvidia G80 [4] Scott Wasson, Nvidia GeForce 8800 graphics processor” http://techreport.com/articles.x/11211/1

COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

1

Edgar Gabriel

COSC 6385

Computer Architecture

- Multi-Processors (II)

The IBM Cell, Intel Larrabee and

Nvidia G80 processors

Edgar Gabriel

Fall 2008

COSC 6385 – Computer Architecture

Edgar Gabriel

References• Intel Larrabee:

[1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins,

A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan:

“Larrabee: a many-core x86 architecture for visual computing”,

ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15.

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

• IBM Cell processor:

[2] C. R. Johns, D. A. Brokenshire

“Introductioon to the Cell Broadband Engine Architecture”,

IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519

http://www.research.ibm.com/journal/rd/515/johns.pdf

[3] M. Kistler, M. Perrone, F. Petrini,

“Cell Multiprocessor Communication Network: Built for Speed”

IEEE Micro, vol. 26, no. 3, pp .10-23

ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

• Nvidia G80

[4] Scott Wasson, Nvidia GeForce 8800 graphics processor”

http://techreport.com/articles.x/11211/1

Page 2: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

2

COSC 6385 – Computer Architecture

Edgar Gabriel

Larrabee Motivation

• Comparison of two architectures with the same number

of transistors

– Half the performance of a single stream for the simplified

core

– 40x increase for multi-stream executions

2 out-of-order

cores

10 in-order

cores

Instruction issue 4 2

VPU per core 4-wide SSE 16-wide

L2 cache size 4 MB 4 MB

Single stream 4 per clock 2 per clock

Vector

throughput

8 per clock 160 per clock

COSC 6385 – Computer Architecture

Edgar Gabriel

Larrabee Overview

• Many-core visual computing architecture

• Based on x86 CPU cores

– Extended version of the regular x86 instruction set

– Supports subroutines and page faulting

• Number of x86 cores can vary depending on the

implementation and processor version

• Fixed functional units for texture filtering

– Other graphical operations such as rasterization or post-

shader blending done in software

Page 3: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

3

COSC 6385 – Computer Architecture

Edgar Gabriel

Larrabee Overview (II)

Image Source: [1]

COSC 6385 – Computer Architecture

Edgar Gabriel

Overview of a Larrabee Core (I)

Image Source: [1]

Page 4: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

4

COSC 6385 – Computer Architecture

Edgar Gabriel

Overview of a Larrabee Core (I)

• x86 core derived from the Pentium processor

– No out-of-order execution

• Standard Pentium instruction set with the addition of

– 64 bit instructions

– Instructions for pre-fetching data into L1 and L2 cache

– Support for 4 simultaneous threads, separate registers for

each thread

• Each core is augmented with a wide vector processor

(VPU)

• 32kb L1 Instruction cache, 32 kb L1 Data Cache

• 256 KB of ‘local subset’ of the L2 cache

– Coherent L2 cache across all cores

COSC 6385 – Computer Architecture

Edgar Gabriel

Vector Processing Unit in Larrabee

• 16-wide VPU executing integer, single- and double

precision floating point operations

• VPU supports gather-scatter operations

– The 16 elements are loaded or can be stored from up to

16 different addresses

• Support for predicated instructions using a mask control

register (if-then-else statements)

Page 5: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

5

COSC 6385 – Computer Architecture

Edgar Gabriel

Inter-Processor Ring Network

• Bi-directional ring network

• 512 bits-wide per direction

• Routing decisions done before injecting message into

the network

COSC 6385 – Computer Architecture

Edgar Gabriel

Larrabee Programming Models

• Most application can be executed without modification

due to the full support of the x86 instruction set

• Support for POSIX threads to create multiple threads

– API extended by thread affinity parameters

• Recompiling code with Larrabee’s native compiler will

generate automatically the codes to use the VPUs.

• Alternative parallel approaches

– Intel threading building blocks

– Larrabee specific OpenMP directives

Page 6: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

6

COSC 6385 – Computer Architecture

Edgar Gabriel

Larrabee Performance

Image Source: [1]

COSC 6385 – Computer Architecture

Edgar Gabriel

IBM Cell Overview (I)

• Cell Broadband Architecture (CBEA) defined by a

consortium of IBM, Sony, and Toshiba

• Originally targeting the multi-media industry

– E.g. Playstation 3, Toshiba HDTV, etc.

• Sold as regular compute-blades also by IBM

– IBM QS20, QS21, QS22

• Main idea: heterogeneous microprocessor consisting of

– one (or more) general purpose processor element (PPE)

and

– (one or) more synergistic processor elements (SPEs)

Page 7: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

7

COSC 6385 – Computer Architecture

Edgar Gabriel

Cell Architecture block diagram

Image Source: [2]

COSC 6385 – Computer Architecture

Edgar Gabriel

• Two generations available so far:

– Cell BE:

• 204.8 GFLOPS single precision peak performance

• 14.6 GFLOPS double precision peak performance

– PowerXCell 8i (2008):

• 204.8 GFLOPS single precision peak performance

• 102.4 GFLOPS double precision peak performance

– Both have 1 PPE and 8 SPEs

Page 8: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

8

COSC 6385 – Computer Architecture

Edgar Gabriel

General Purpose Processor (PPE)

• Based on the IBM PowerPC processor

– Supports multiple simultaneous operating environments

(virtualization)

– E.g. can execute an instance of a real-time operating

system and an instance of a non-real-time operating

system

• Performs management and application control

functions

COSC 6385 – Computer Architecture

Edgar Gabriel

Synergistic Processor Element (SPE)

• SIMD processor used for offloading compute-intensive,

data parallel operations from the PPE

• Each SPE has its own local storage and can access data

only from the local storage

– Current versions of the Cell processors: 256k local storage

• The local storage is connected to the main memory

through a Memory Flow Controller (MFC)

– MFC moves data from main memory to local storage or

between two SPEs.

Page 9: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

9

COSC 6385 – Computer Architecture

Edgar Gabriel

MFC commands

Image Source: [2]

COSC 6385 – Computer Architecture

Edgar Gabriel

Synergistic Processor Element (SPE) (II)

• Each SPE has 128 registers

• Each register is 128 bits wide which can be used to

hold

– Sixteen 8-bit integers or

– Eight 16-bit integers or

– Four 32-bit integers or single precision floating-point

numbers

– Two 64-bit integers or double precision floating point

numbers

• Most instructions supported by the synergistic processor

unit utilize all elements in a register -> SIMD

Page 10: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

10

COSC 6385 – Computer Architecture

Edgar Gabriel

Simplified representation of a current

Cell processor

Image Source: [3]

COSC 6385 – Computer Architecture

Edgar Gabriel

Element Interconnect Bus

• PPE and SPEs communicate through the Element

Interconnect Bus

– Contains a shared command bus

• Sets up end-to-end transactions

• Used for coherence protocols

– Point-to-point data interconnect

• Four 16-byte-wide rings, two used for clockwise data

transfers, two for counter-clockwise data transfers

• Each ring transfer 128 byte packets ( = cache block

size of an SPE)

• Communication costs between two SPEs can vary

between 1 hop and 6 hops

– Overall bandwidth: 204.8 GB/s

Page 11: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

11

COSC 6385 – Computer Architecture

Edgar Gabriel

Comparison IBM Cell and Intel

Larrabee• Both use a large number of small and simple cores

• Both use high-bandwidth ring bus to communicate

between the cores

• Intel Larrabee is homogeneous, while IBM Cell is a

heterogeneous process (difference between PPE and

SPE)

• IBM Cell requires data to be moved explicitly to the

‘local store’, while Larrabee can address any memory

area

– Programm for the Cell have to be written taking the

limited amount of memory available for a SPE into

account

COSC 6385 – Computer Architecture

Edgar Gabriel

Nvidia G80

• Parallel Stream Processor

– Each green block is a stream processor

– 16 stream processors are grouped and connected by a L1 cache

– Each G80 has 8 groups with 16 SPs = 128 SPs total

– Each SP is a generalized processors running at 1.35 GHz

– Each SP operates on a single element (scalar)

– groups are connected by a crossbar style switch and that connects them to six

ROP

– Each ROP has its own L2 cache and an interface to graphics memory (frame

buffer) with 64 bits width

– 6 * 64bits = 384 bits path to memory

Page 12: COSC 6385 Computer Architecture -Multi-Processors (II) The ...gabriel/courses/cosc6385_f08/CA_17_MultiProcessors-4.pdfInter-Processor Ring Network • Bi-directional ring network •

12

COSC 6385 – Computer Architecture

Edgar Gabriel

Nvidia G80 (I)

COSC 6385 – Computer Architecture

Edgar Gabriel

Performance comparison G80 to IBM

Cell

Source: http://gametomorrow.com/blog/index.php/2007/09/05/cell-vs-g80/

• Ray Tracing Application