42
September 13, 2010 INF5063: Programming heterogeneous multi-core processors

INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

  • Upload
    lamkien

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

September 13, 2010

INF5063: Programming heterogeneous multi-core processors

Page 2: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Overview

  Course topic and scope

  Background for the use and parallel processing using heterogeneous multi-core processors

  Examples of heterogeneous architectures

Page 3: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063: The Course

Page 4: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

People

  Håvard Espeland email: haavares @ ifi

  Håkon Kvale Stensland email: haakonks @ ifi

  Carsten Griwodz email: griff @ ifi

  Pål Halvorsen email: paalh @ ifi

Page 5: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Time and place

  Lectures: Fridays 13.15 – 15.00 Store Aud. ??? Veilabben???

  NB! The web page states that we will have group exercises on

Thursdays 10.15 - 12.00, 3B. However, there will NOT be any weekly exercises, but this hour is assigned for your mandatory assignments (we will NOT be there).

Page 6: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

About INF5063: Topic & Scope

  Content: The course gives …

-  … an overview of heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use)

-  … an introduction to working with heterogeneous multi-core processors

•  Intel IXP 2400 network processor card

•  SSEx for x86

•  nVIDIA’s family of GPUs and the CUDA programming framework

•  The Cell Broadband Engine Architecture

-  … some ideas of how to use/program heterogeneous multi-core processors

Page 7: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

About INF5063: Topic & Scope   Tasks:

The important part of the course is lab-assignments where you program each of the three examples of heterogeneous multi-core processors

  1 exercise (not graded) on Intel IXP

-  packet counter – download, run and extend wwpingbump

  3 graded home exams (counting 33% each):

-  Deliver code

-  Make a demonstration and explain your design and code to the class

1.  On the x86

•  Video encoding – Improve performance of video compression by using SSE instructions.

2.  On the nVIDIA graphics cards

•  Video encoding – Improve the performance of video compression by using the G80 architecture

3.  On the Cell processor

•  Video encoding – the same as above, but exploit the parallelity of the Cell processor’s SBEs

Page 8: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Available Resources

  Resources will be placed at

- http://www.ifi.uio.no/~griff/INF5063 - Login: inf5063 - Password: ixp

- Manuals, papers, code example, …

Page 9: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

Background and Motivation:

Moore’s Law

Page 10: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation: Intel View

  >billion transistors integrated 2010: •  2,3 billion - Intel 8-Core Xeon Nehalem-EX •  3,0 billion - nVidia GF100 (Fermi)

1971: •  2,300 - Intel 4004

Page 11: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation: Intel View

  >billion transistors integrated   Clock frequency can still increase

2010: •  5 (6) GHz – IBM Power6

Page 12: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation: Intel View

  >billion transistors integrated   Clock frequency can still increase   Future applications will demand TIPS

2010: 147,600 MIPS @ 3.3 GHz – Intel Core i7

Extreme Edition i980EE (6 cores)

Page 13: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation: Intel View

  >billion transistors integrated   Clock frequency can still increase   Future applications will demand TIPS   Power? Heat?

Page 14: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation: Intel View

  Soon >billion transistors integrated   Clock frequency can still increase   Future applications will demand TIPS   Power? Heat?

Page 15: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation “Future applications will demand TIPS”

“Think platform beyond a single processor”

“Exploit concurrency at multiple levels”

“Power will be the limiter due to complexity and leakage”

Distribute workload on multiple cores

Page 16: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Symmetric Multi-Core Processors

Phenom X4

Page 17: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Symmetric Multi-Core Processors

UltraSparc

Page 18: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Intel Multi-Core Processors

  Symmetric multi-processors allow multi-threaded applications to achieve higher performance at less die area and power consumption than single-core processors

Page 19: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Symmetric Multi-Core Processors   Good - Growing computational power

  Problematic - Growing die sizes - Unused resources

•  Some cores used much more than others •  Many core parts frequently unused

  Why not spread the load better? -  Functions exist only once per core -  Parallel programming is hard

⇒  Asymmetric multi-core processors

Page 20: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Asymmetric Multi-Core Processors

  Asymmetric multi-processors consume power and provide increased computational power only on demand

Highly parallel Moderately parallel Sequential

Page 21: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Motivation “Future applications will demand TIPS”

“Think platform beyond a single processor”

“Exploit concurrency at multiple levels”

“Power will be the limiter due to complexity and leakage”

Distributed workload on multiple cores + simple processors are “easier” to program

+consume less energy

heterogeneous multi-core processors

Page 22: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Co-Processors

  The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU) -  50-fold speed up of floating point operations

  Intel kept the co-processor up to i486 -  486DX contained an optimized i487 block -  Still separate pipeline (pipeline flush when starting and ending use) -  Communication over an internal bus

  Commodore Amiga was one of the earlier machines that used multiple processors - Motorola 680x0 main processor -  Blitter (block image transferrer - moving data, fill operations, line

drawing, performing boolean operations) -  Copper (Co-Processor - change address for video RAM on the fly)

Page 23: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

What now – are today’s cores really “Symmetric”?

Nehalem

Page 24: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Review of General Data Path on Conventional Computer Hardware Architectures

communication system

application

user space

kernel space

sending:

communication system

application

receiving:

communication system

application

forwarding:

transport (TCP/UDP)

network (IP)

link

Page 25: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Network Processors: Main Idea

Traditional system: - slow - resource demanding - shared with other operations

Network processors: - a computer within the computer - special, programmable hardware - offloads host resources

Page 26: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

IXA: Internet Exchange Architecture

  IXA - a broad term to describe the Intel network architecture - HW & SW, control- & data plane

  IXP: Internet Exchange Processor - processor that implements IXA

- IXP1200 is the first IXP chip (4 versions)

- IXP2xxx has now replaced the first version

Page 27: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

IXA: Internet Exchange Architecture   IXP1200 basic features -  1 embedded 232 MHz StrongARM -  6 packet 232 MHz µengines -  onboard memory -  4 x 100 Mbps Ethernet ports -  multiple, independent busses -  low-speed serial interface -  interfaces for external memory

and I/O busses -  …

Page 28: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

IXA: Internet Exchange Architecture   IXP2400 basic features

-  1 embedded 600 MHz XScale -  8 packet 600 MHz µengines -  onboard memory -  3 x 1 Gbps Ethernet ports -  multiple, independent busses -  low-speed serial interface -  interfaces for external memory

and I/O busses -  …

Page 29: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

IXP1200 Architecture

RISC processor: - StrongARM running Linux - control, higher layer protocols and exceptions - 232 MHz

Microengines: - low-level devices with limited set of instructions - transfers between memory devices - packet processing - 232 MHz

Access units: - coordinate access to external units

Scratchpad: - on-chip memory - used for IPC and synchronization

Page 30: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

IXP1200 IXP2400

SRAM

FLASH

MEMORY MAPPED

I/O

DRAM

SRAM access

SDRAM access

SCRATCH memory

PCI access

IX access

Embedded RISK CPU

(StrongARM)

PCI bus

IX bus

DRAM bus

SRAM bus

microengine 2

microengine 1

microengine 5

microengine 4

microengine 3

microengine 6

multiple independent

internal buses

IXP1200

Page 31: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

IXP2400

IXP2400 Architecture

microengine 8

SRAM

coprocessor

FLASH

DRAM

SRAM access

SDRAM access

SCRATCH memory

PCI access

MSF access

Embedded RISK CPU (XScale)

PCI bus

receive bus

DRAM bus

SRAM bus

microengine 2

microengine 1

microengine 5

microengine 4

microengine 3

multiple independent

internal buses slowport

access

transmit bus

Page 32: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Graphics Processing Units (GPUs)

buss connector

& memory

hub

GPU: a dedicated graphics rendering device

2D 3D

First GPUs,

  80s: for early 2D operations Amiga and Atari used a blitter, Amiga had also the copper

  90s: 3D hardware for game consoles like PS and N64 3dfx Voodoos 3D add-on card for PCs

New powerful GPUs, e.g.,:

  Nvidia GeForce GX280   240 1476 MHz core   1 GB memory   memory BW: 159 GB/sec   PCI Express 2.0

  similar to other manufacturers …

Page 33: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

General Purpose Computing on GPU   The -  high arithmetic precision -  extreme parallel nature -  optimized, special-purpose instructions -  available resources - …

… of the GPU allows for general, non-graphics related operations to be performed on the GPU

  Generic computing workload is off-loaded from CPU and to GPU

⇒  More generically: Heterogeneous multi-core processing

Page 34: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

nVIDIA G200 / GF100

-  1.4 / 3 billion transistors -  240 / 512 shaders

-  512 / 384 bit memory bus (GDDR3 / 5)

-  159 / 177 GB/sec memory bandwidth

-  933 / 1344 Gflops

-  PCI Express 2.0

Page 35: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

nVIDIA GT200 Streaming Multiprocessor ( SM )

Store to

SP 0 RF 0

SP 1 RF 1

SP 2 RF 2

SP 3 RF 3

SP 4 RF 4

SP 5 RF 5

SP 6 RF 6

SP 7 RF 7

Constant L 1 Cache

L 1 Fill

Load from Memory

Load Texture

S F U

S F U

Instruction Fetch Instruction L 1 Cache

Thread / Instruction Dispatch

L 1 Fill

Work

Control

Results

Shared Memory

Store to Memory

  Stream Multiprocessors (SMs) -  fundamental thread block unit -  8 stream processors (SPs)

(scalar ALU for threads) -  2 super function units (SFUs)

(cos, sin, log, ...) -  8 32KB local register files (RFs) -  16 kB level 1 cache -  64 kB shared memory -  256 kB global level 2 cache

  Number of stream multiprocessors -  1 - Quadro NVS 130M -  16 - GeForce 8800 GTX -  30 - GeForce GTX 285 -  4x30 - Tesla S1070

Page 36: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

Memory Bandwidth for CPU and GPU

Marketed as GPGPUs

Page 37: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

STI (Sony, Toshiba, IBM) Cell   Motivation for the Cell -  Cheap processor -  Energy efficient -  For games and media processing -  Short time-to-market

  Conclusion -  Use a multi-core chip -  Design around an existing, power-

efficient design -  Add simple cores specific for game and

media processing requirements

Page 38: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

STI (Sony, Toshiba, IBM) Cell   Cell is a 9-core processor -  combining a light-weight general-

purpose processor with multiple co-processors into a coordinated whole

-  Power Processing Element (PPE) •  conventional Power processor •  not supposed to perform all

operations itself, acting like a controller

•  running conventional OSes •  16 KB instruction/data level 1 cache •  512 KB level 2 cache

Page 39: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

STI (Sony, Toshiba, IBM) Cell

-  Synergistic Processing Elements (SPE)

•  specialized co-processors for specific types of code, i.e., very high performance vector processors

•  local stores •  can do general purpose operations •  the PPE can start, stop, interrupt

and schedule processes running on an SPE

-  Element Interconnect Bus (EIB) •  internal communication bus •  connects on-chip system elements:

  PPE & SPEs   the memory controller (MIC)   two off-chip I/O interfaces

•  25.6 GBps each way

Page 40: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

STI (Sony, Toshiba, IBM) Cell - memory controller

•  Rambus XDRAM interface to Rambus XDR memory

•  dual channels at 12.8 GBps 25.6 GBps

- I/O controller •  Rambus FlexIO interface which

can be clocked independently

•  dual configurable channels

•  maximum ~ 76.8 GBps

Page 41: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

STI (Sony, Toshiba, IBM) Cell

-  Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed

- used for example in •  Sony PlayStation 3:

  3.2 GHz clock   6 SPEs for general operations   1 SPE for security for the OS

•  Toshiba home cinema:   decoding of 48 HDTV MPEG streams dozens of thumbnail videos simultaneously on screen

•  IBM blade centers:   3.2 GHz clock   Linux ≥ 2.6.11

Page 42: INF5063: Programming heterogeneous multi-core · PDF fileheterogeneous multi-core processors ... - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo

The End: Summary

  Heterogeneous multi-core processors are already everywhere

 Challenge: programming - Need to know the capabilities of the system - Different abilities in different cores - Memory bandwidth - Memory sharing efficiency - Need new methods to program the different

components

  Next time: how to start programming the Intel IXP