IBM cell broadband engine

Embed Size (px)

Citation preview

  • 8/8/2019 IBM cell broadband engine

    1/81

    Cell Broadband Engine

    Cell Broadband Engine - enabling densitycomputing for data-rich environments

    2006 IBM Corporation

    Cell BE enabling density computing

    for data rich environmentsMichael GschwindBruce DAmoraAlexandre Eichenberger

  • 8/8/2019 IBM cell broadband engine

    2/81

    Cell Broadband Engine

    2006 IBM Corporation2 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell History IBM, SCEI/Sony, Toshiba Alliance formed in 2000

    Design Center opened in March 2001

    Based in Austin, Texas

    Hardware designed in parallel with software

    February 7, 2005: First external technical disclosures

    August 15, 2005: First external architecture disclosures

    August 25, 2005: Cell Launch - Architecture released

  • 8/8/2019 IBM cell broadband engine

    3/81

    Cell Broadband Engine

    2006 IBM Corporation3 Cell Broadband Engine - enabling density computing for data-rich environments

    Acknowledgements

    Cell is the result of a partnership between SCEI/Sony,

    Toshiba, and IBM Cell represents the work of more than 400 people

    starting in 2000 and a design investment of about

    $400M

  • 8/8/2019 IBM cell broadband engine

    4/81

    Cell Broadband Engine

    2006 IBM Corporation4 Cell Broadband Engine - enabling density computing for data-rich environments

    More about Cell Broadband Engine

    http://www.research.ibm.com/cell

    Online resources

    SpecificationDocumentation

    Open Source and Proprietary Tools

    Operating SystemPlatform Simulator

  • 8/8/2019 IBM cell broadband engine

    5/81

    Cell Broadband Engine

    2006 IBM Corporation5 Cell Broadband Engine - enabling density computing for data-rich environments

    Agenda

    8:00 9:00 Motivation and Architecture

    9:00 9:30 Heterogeneous Application Model

    9:30 10:30 Compilation and Auto-Parallelization

    10:30 11:00 BREAK

    11:00 12:00 Programming Models & Applications

  • 8/8/2019 IBM cell broadband engine

    6/81

    Cell Broadband Engine

    Cell Broadband Engine - enabling densitycomputing for data-rich environments

    2006 IBM Corporation

    Cell Broadband Engine Architecture

    Michael Gschwind

  • 8/8/2019 IBM cell broadband engine

    7/81

    Cell Broadband Engine

    2006 IBM Corporation7 Cell Broadband Engine - enabling density computing for data-rich environments

    Computer Architecture at the turn of the millenium

    The end of architecture was being proclaimed

    Frequency scaling as performance driverState of the art microprocessors

    multiple instruction issueout of order architectureregister renamingdeep pipelines

    Little or no focus on compilers

    Questions about the need for compiler research Academic papers focused on microarchitecture tweaks

  • 8/8/2019 IBM cell broadband engine

    8/81

    Cell Broadband Engine

    2006 IBM Corporation8 Cell Broadband Engine - enabling density computing for data-rich environments

    The age of frequency scaling

    SCALING:

    Voltage: V/

    Oxide: tox/

    Wire width: W/

    Gate width: L/

    Diffusion: xd/

    Substrate: * NA

    RESULTS:

    Higher Density: ~2

    Higher Speed: ~

    Power/ckt: ~1/ 2

    Power Density:~Constant

    p substrate, doping *NA

    Scaled Device

    L/ xd/

    GATE

    n+source

    n+drain

    WIRINGVoltage, V /

    W/tox/

    Source: Dennard et al., JSSC 1974.

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    710131619222528313437

    Total FO4 Per Stage

    Performance

    bips

    deeper pipeline

  • 8/8/2019 IBM cell broadband engine

    9/81

    Cell Broadband Engine

    2006 IBM Corporation9 Cell Broadband Engine - enabling density computing for data-rich environments

    Frequency scaling

    A trusted standby

    Frequency as the tide that raises all boats

    Increased performance across all applications

    Kept the industry going for a decade

    Massive investment to keep going

    Equipment cost

    Power increase

    Manufacturing variability

    New materials

  • 8/8/2019 IBM cell broadband engine

    10/81

    Cell Broadband Engine

    2006 IBM Corporation10 Cell Broadband Engine - enabling density computing for data-rich environments

    but reality was closing in!

    0

    0.2

    0.4

    0.6

    0.8

    1

    710131619222528313437

    Total FO4 Pe r Stage

    RelativetoO

    ptimalFO4

    bips

    bips^3/W

    0.001

    0.01

    0.1

    1

    10

    100

    1000

    0.010.11

    Active PowerLeakage Power

  • 8/8/2019 IBM cell broadband engine

    11/81

    Cell Broadband Engine

    2006 IBM Corporation11 Cell Broadband Engine - enabling density computing for data-rich environments

    Power crisis

    0.1

    1

    10

    100

    1000

    0.010.11gate length Lgate (m)

    classic scaling

    Tox()

    Vdd

    Vt

    The power crisis is notnatural

    Created by deviating fromideal scaling theory

    Vdd and Vt not scaled by

    additional performance withincreased voltage

    Laws of physics

    Pchip = ntransistors * Ptransistor

    Marginal performance gainper transistor low

    significant power increase

    decreased power/performance

    efficiency

  • 8/8/2019 IBM cell broadband engine

    12/81

    Cell Broadband Engine

    2006 IBM Corporation12 Cell Broadband Engine - enabling density computing for data-rich environments

    The power inefficiency of deep pipelining

    0

    0.2

    0.4

    0.6

    0.8

    1

    710131619222528313437

    Total FO4 Per Stage

    Relative

    toOptimalF

    O4

    bips

    bips^3/W

    Deep pipelining increases number of latches and switching rate

    power increases with at least f2

    Latch insertion delay limits gains of pipelining

    Power-performance optimal Performance optimal

    Source: Srinivasan et al., MICRO 2002

    deeper pipeline

  • 8/8/2019 IBM cell broadband engine

    13/81

    Cell Broadband Engine

    2006 IBM Corporation13 Cell Broadband Engine - enabling density computing for data-rich environments

    Hitting the memory wall

    Latency gap

    Memory speedup lags behindprocessor speedup

    limits ILP

    Chip I/O bandwidth gap

    Less bandwidth per MIPS

    Latency gap as applicationbandwidth gap

    Typically (much) less than

    chip I/O bandwidth

    -60

    -40

    -20

    0

    20

    40

    60

    80

    100

    1990 2003

    normalized

    MFLOPS Memory latency

    Source: McKee, Computing Frontiers 2004

    Source: Burger et al., ISCA 1996

    usable bandwidth avg. request size

    roundtrip latencyno. of inflight

    requests

  • 8/8/2019 IBM cell broadband engine

    14/81

    Cell Broadband Engine

    2006 IBM Corporation14 Cell Broadband Engine - enabling density computing for data-rich environments

    Our Y2K challenge

    10 performance of desktop systems

    1 TeraFlop / second with a four-node configuration

    1 Byte bandwidth per 1 Flop

    golden rule for balanced supercomputer design

    scalable design across a range of design points

    mass-produced and low cost

  • 8/8/2019 IBM cell broadband engine

    15/81

    Cell Broadband Engine

    2006 IBM Corporation15 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell Design Goals

    Provide the platform for the future of computing

    10 performance of desktop systems shipping in 2005

    Computing density as main challenge

    Dramatically increase performance per X X = Area, Power, Volume, Cost,

    Single core designs offer diminishing returns on

    investmentIn power, area, design complexity and verification cost

  • 8/8/2019 IBM cell broadband engine

    16/81

    Cell Broadband Engine

    2006 IBM Corporation16 Cell Broadband Engine - enabling density computing for data-rich environments

    Necessity as the mother of invention Increase power-performance efficiency

    Simple designs are more efficient in terms of power and area

    Increase memory subsystem efficiency

    Increasing data transaction size

    Increase number of concurrently outstanding transactions

    Exploit larger fraction of chip I/O bandwidth

    Use CMOS density scaling

    Exploit density instead of frequency scaling to deliver increasedaggregate performance

    Use compilers to extract parallelism from application

    Exploit application parallelism to translate aggregate

    performance to application performance

  • 8/8/2019 IBM cell broadband engine

    17/81

    Cell Broadband Engine

    2006 IBM Corporation17 Cell Broadband Engine - enabling density computing for data-rich environments

    Exploit application-level parallelism

    Data-level parallelism

    SIMD parallelism improves performance with little overhead

    Thread-level parallelism

    Exploit application threads with multi-core design approach

    Memory-level parallelism

    Improve memory access efficiency by increasing number ofparallel memory transactions

    Compute-transfer parallelism

    Transfer data in parallel to execution by exploiting applicationknowledge

  • 8/8/2019 IBM cell broadband engine

    18/81

    Cell Broadband Engine

    2006 IBM Corporation18 Cell Broadband Engine - enabling density computing for data-rich environments

    Concept phase ideas

    ChipMultiprocessor

    efficient use ofmemory interface

    large architected

    register set

    high frequencyand low power!

    reverse Vddscaling for low

    power

    modular designdesign reuse

    control cost of

    coherency

  • 8/8/2019 IBM cell broadband engine

    19/81

    Cell Broadband Engine

    2006 IBM Corporation19 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell Architecture

    Heterogeneous multi-core system

    architecture Power Processor

    Element for control tasks

    Synergistic ProcessorElements for data-

    intensive processing

    Synergistic ProcessorElement (SPE)

    consists of Synergistic Processor

    Unit (SPU)

    Synergistic Memory Flow

    Control (SMF)

    16B/cycle (2x)16B/cycle

    BIC

    FlexIOTM

    MIC

    DualXDRTM

    16B/cycle

    EIB (up to 96B/cycle)

    16B/cycle

    64-bit Power Architecture with VMX

    PPE

    SPE

    LS

    SXUSPU

    SMF

    PXUL1

    PPU

    16B/cycle

    L232B/cycle

    LS

    SXUSPU

    SMF

    LS

    SXUSPU

    SMF

    LS

    SXUSPU

    SMF

    LS

    SXUSPU

    SMF

    LS

    SXUSPU

    SMF

    LS

    SXUSPU

    SMF

    LS

    SXUSPU

    SMF

  • 8/8/2019 IBM cell broadband engine

    20/81

    Cell Broadband Engine

    2006 IBM Corporation20 Cell Broadband Engine - enabling density computing for data-rich environments

    Shifting the Balance of Power with Cell Broadband Engine

    Data processor instead of control system

    Control-centric code stable over timeBig growth in data processing needs

    Modeling

    Games

    Digital media

    Scientific applications

    Todays architectures are built on a 40 year old data modelEfficiency as defined in 1964

    Big overhead per data operation

    Parallelism added as an after-thought

  • 8/8/2019 IBM cell broadband engine

    21/81

    Cell Broadband Engine

    2006 IBM Corporation21 Cell Broadband Engine - enabling density computing for data-rich environments

    Powering Cell the Synergistic Processor Unit

    VRFVRF

    PERMPERM LSULSUVV FF PP UU VV FF XX UU

    Local StoreLocal Store

    Single PortSingle Port

    SRAMSRAMIssue /Issue /BranchBranch

    Fetch

    ILB

    16B x 2

    2 instructions

    16B x 3 x 2

    64B64B

    16B16B

    128B128B

    Source: Gschwind et al., Hot Chips 17, 2005

    SMFSMF

    C ll B db d E i

  • 8/8/2019 IBM cell broadband engine

    22/81

    Cell Broadband Engine

    2006 IBM Corporation22 Cell Broadband Engine - enabling density computing for data-rich environments

    Density Computing in SPEs

    Today, execution units only fraction of core area and power

    Bigger fraction goes to other functions Address translation and privilege levels Instruction reordering Register renaming Cache hierarchy

    Cell changes this ratio to increase performance per area andpower

    Architectural focus on data processing

    Wide datapaths More and wide architectural registers Data privatization and single level processor-local store All code executes in a single (user) privilege level Static scheduling

    C ll B db d E i

  • 8/8/2019 IBM cell broadband engine

    23/81

    Cell Broadband Engine

    2006 IBM Corporation23 Cell Broadband Engine - enabling density computing for data-rich environments

    Streamlined Architecture Architectural focus on simplicity

    Aids achievable operating frequency

    Optimize circuits for common performance case

    Compiler aids in layering traditional hardware functions

    Leverage 20 years of architecture research

    Focus on statically scheduled data parallelism

    Focus on data parallel instructions No separate scalar execution units Scalar operations mapped onto data parallel dataflow

    Exploit wide data paths

    Data processing Instruction fetch

    Address impediments to static scheduling

    Large register set Reduce latencies by eliminating non-essential functionality

    C ll B db d E i

  • 8/8/2019 IBM cell broadband engine

    24/81

    Cell Broadband Engine

    2006 IBM Corporation24 Cell Broadband Engine - enabling density computing for data-rich environments

    SPE Highlights RISC organization

    32 bit fixed instructions

    Load/store architecture

    Unified register file User-mode architecture

    No page translation within SPU

    SIMD dataflow Broad set of operations (8, 16, 32, 64 Bit) Graphics SP-Float

    IEEE DP-Float

    256KB Local Store

    Combined I & D DMA block transfer

    using Power Architecture memorytranslation

    LS

    LS

    LS

    LS

    GPR

    FXU ODD

    FXU EVN

    SFPDP

    CONT

    ROL

    CHANNEL

    DMA SMMATO

    SBIRTB

    BEB

    FWD

    14.5mm2 (90nm SOI)Source: Kahle, Spring Processor Forum 2005

    C ll B db d E i

  • 8/8/2019 IBM cell broadband engine

    25/81

    Cell Broadband Engine

    2006 IBM Corporation25 Cell Broadband Engine - enabling density computing for data-rich environments

    dataparallelselect

    staticprediction

    simplerarch

    sequentialfetch

    localstore

    high

    frequency

    shorterpipeline

    large

    basicblocks

    determ.latency

    largeregisterfile

    instructionbundling

    DLP

    widedata

    paths

    computedensity

    Synergistic Processing

    single

    port

    staticscheduling

    scalarlayering

    ILPopt.

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    26/81

    Cell Broadband Engine

    2006 IBM Corporation26 Cell Broadband Engine - enabling density computing for data-rich environments

    Efficient data-sharing between scalar and SIMD processing

    Legacy architectures separate scalar and SIMDprocessing

    Data sharing between SIMD and scalar processing unitsexpensive

    Transfer penalty between register files

    Defeats data-parallel performance improvement in manyscenarios

    Unified register file facilitates data sharing for

    efficient exploitation of data parallelismAllow exploitation of data parallelism without data transfer

    penalty

    Data-parallelism alwaysan improvement

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    27/81

    Cell Broadband Engine

    2006 IBM Corporation27 Cell Broadband Engine - enabling density computing for data-rich environments

    Compiling for the Cell Broadband Engine

    The lesson of RISC computing

    Architecture provides fast, streamlined primitives to compilerCompiler uses primitives to implement higher-level idioms

    If the compiler cant target it do not include in architecture

    Compiler focus throughout projectPrototype compiler soon after first proposal

    Cell compiler team has made significant advances in

    Automatic SIMD code generation Automatic parallelization

    Data privatization

    Programmability

    Programmer

    Productivity

    Raw Hardware

    Performance

    Cell

    Design

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    28/81

    Cell Broadband Engine

    2006 IBM Corporation28 Cell Broadband Engine - enabling density computing for data-rich environments

    SPU PIPELINE FRONT END

    SPU PIPELINE BACK END

    Branch Instruction

    Load/Store Instruction

    IF Instruction Fetch

    IB Instruction Buffer

    ID Instruction Decode

    IS Instruction Issue

    RF Register File Access

    EX ExecutionWB Write Back

    Floating Point Instruction

    Permute Instruction

    EX1 EX2 EX3 EX4

    EX1 EX2 EX3 EX4 EX5 EX6

    EX1 EX2 WB

    WB

    RF1 RF2

    WB

    Fixed Point Instruction

    EX1 EX2 EX3 EX4 EX5 EX6 WB

    IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2

    SPU Pipeline

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    29/81

    Cell Broadband Engine

    2006 IBM Corporation29 Cell Broadband Engine - enabling density computing for data-rich environments

    Permute Unit

    Load-Store Unit

    Floating-Point Unit

    Fixed-Point UnitBranch Unit

    Channel Unit

    Result Forwarding and Staging

    Register File

    Local Store(256kB)

    Single Port SRAM

    128B Read 128B Write

    DMA Unit

    Instruction Issue Unit / Instruction Line Buffer

    8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle

    On-Chip Coherent Bus

    SPE Block Diagram

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    30/81

    Cell Broadband Engine

    2006 IBM Corporation30 Cell Broadband Engine - enabling density computing for data-rich environments

    SPU Communication: Channel Architecture

    SPU uses channels to communicate with

    environment (incl. SMF)Access to special purpose registers

    Processor status

    Communication channels

    Requests to SMF

    SMF status

    Mailboxes

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    31/81

    Cell Broadband Engine

    2006 IBM Corporation31 Cell Broadband Engine - enabling density computing for data-rich environments

    SPU Channel Features

    Communication using channels numbered 0 - 127

    Implementation dependent

    Unidirectional

    Have capacity Burst without SPU execution stop if capacity available

    Channel operations are blocking

    On write if full On read if empty

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    32/81

    Cell Broadband Engine

    2006 IBM Corporation32 Cell Broadband Engine - enabling density computing for data-rich environments

    SPU Channel Access

    rdch RT, ca RT

  • 8/8/2019 IBM cell broadband engine

    33/81

    Cell Broadband Engine

    2006 IBM Corporation33 Cell Broadband Engine - enabling density computing for data-rich environments

    Synergistic Memory Flow Control

    Data Bus

    Snoop BusControl BusTranslate Ld/StMMIO

    DMA Engine DMAQueue

    RMTMMUAtomicFacility

    Bus I/F Control

    LS

    SXU

    SPU

    MMIO

    SPE

    SMF

    SMF implements memorymanagement and mapping

    SMF operates in parallel to SPU

    Independent compute and transferCommand interface from SPU

    DMA queue decouples SMF & SPU

    MMIO-interface for remote nodes

    Block transfer between systemmemory and local store

    SPE programs reference systemmemory using user-leveleffective address space

    Ease of data sharing

    Local store to local store transfers

    Protection

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    34/81

    Cell Broadband Engine

    2006 IBM Corporation34 Cell Broadband Engine - enabling density computing for data-rich environments

    Synergistic Memory Flow Control

    Data Bus

    Snoop BusControl BusTranslate Ld/StMMIO

    DMA Engine DMAQueue

    RMTMMUAtomicFacility

    Bus I/F Control

    LS

    SXU

    SPU

    MMIO

    SPE

    SMF

    SMF implements memorymanagement and mapping DMA for data transfer

    MMU for page translation AF for coherent data

    BIF for access to Element InterconnectBus

    RMT for resource management

    SMF operates in parallel to SPU

    Independent compute and transfer

    Channel command interface fromSPU DMA queue decouples SMF SPU

    SPU can synchronize with SMF

    MMIO-based interface for remote nodes

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    35/81

    Cell Broadband Engine

    2006 IBM Corporation35 Cell Broadband Engine - enabling density computing for data-rich environments

    System-wide Virtual Memory Architecture

    Data Bus

    Snoop BusControl BusTranslate Ld/StMMIO

    DMA Engine DMAQueue

    RMTMMUAtomicFacility

    Bus I/F Control

    LS

    SXU

    SPU

    MMIO

    SPE

    SMF

    SMF MMU follows PowerArchitecture Virtual MemoryarchitectureTwo-level translation

    Segmentation and paging

    PPE and SPEs share memory map

    SPE programs reference systemmemory using user-level effectiveaddress space

    Ease of data sharing

    Local store to local store transfers

    Protection

    Exceptions delivered to PPESLB miss

    Page fault

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    36/81

    Cell Broadband Engine

    2006 IBM Corporation36 Cell Broadband Engine - enabling density computing for data-rich environments

    Data Transfer with SMF DMAC

    Data Bus

    Snoop BusControl BusTranslate Ld/StMMIO

    DMA Engine DMAQueue

    RMTMMUAtomicFacility

    Bus I/F Control

    LS

    SXU

    SPU

    MMIO

    SPE

    SMF

    DMA Unit implements block datatransfer transfer specifies system memory and

    local store address

    system memory address is effectiveaddress

    translated to physical address by SMFMMU

    variable block size from 1B to 16KB

    DMA transfers LS system memory

    LS LS

    LS I/O Transfers

    DMA list command SMF program transfers data in parallel

    to computation

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    37/81

    Ce oadba d g e

    2006 IBM Corporation37 Cell Broadband Engine - enabling density computing for data-rich environments

    Synergistic Memory Flow Control

    Bus Interface Up to 16 outstanding DMA

    requests Requests up to 16KByte

    Token-based Bus AccessManagement

    Source: Clark et al., Hot Chips 17, 2005

    Element Interconnect Bus Four 16 byte data rings

    Multiple concurrent transfers

    96B/cycle peak bandwidth

    Over 100 outstanding requests

    200+ GByte/s @ 3.2+ GHz

    16B 16B 16B 16B

    Data Arb

    16B 16B 16B 16B

    16B 16B16B 16B16B 16B16B 16B

    16B

    16B16B

    16B

    16B

    16B16B

    16B

    SPE0 SPE2 SPE4 SPE6

    SPE7SPE5SPE3SPE1

    MIC

    PPE

    BIF/IOIF0

    IOIF1

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    38/81

    g

    2006 IBM Corporation38 Cell Broadband Engine - enabling density computing for data-rich environments

    Memory efficiency as key to application performance

    Greatest challenge in translating peak into app performance

    Peak Flops useless without way to feed data

    Cache miss provides too little data too late

    Inefficient for streaming / bulk data processing

    Initiates transfer when transfer results are already needed

    Application-controlled data fetch avoids not-on-time data delivery

    SMF is a better way to look at an old problem

    Fetch data blocks based on algorithm

    Blocks reflect application data structures

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    39/81

    g

    2006 IBM Corporation39 Cell Broadband Engine - enabling density computing for data-rich environments

    Traditional memory usage

    Long latency memory access operations exposed

    Cannot overlap 500+ cycles with computation Memory access latency severely impacts application performance

    computation mem protocol mem idle mem contention

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    40/81

    g

    2006 IBM Corporation40 Cell Broadband Engine - enabling density computing for data-rich environments

    Exploiting memory-level parallelism Reduce performance impact of memory accesses with concurrent

    access

    Carefully scheduled memory accesses in numeric code Out-of-order execution increases chance to discover more

    concurrent accesses

    Overlapping 500 cycle latency with computation using OoO illusory

    Bandwidth limited by queue size and roundtrip latency

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    41/81

    g

    2006 IBM Corporation41 Cell Broadband Engine - enabling density computing for data-rich environments

    Exploiting MLP with Chip Multiprocessing More threads more memory level parallelism

    overlap accesses from multiple cores

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    42/81

    2006 IBM Corporation42 Cell Broadband Engine - enabling density computing for data-rich environments

    A new form of parallelism: CTP Compute-transfer parallelism

    Concurrent execution of compute and transfer increasesefficiency

    Avoid costly and unnecessary serialization

    Application thread has two threads of control

    SPU computation thread

    SMF transfer thread

    Optimize memory access and data transfer at the

    application levelExploit programmer and application knowledge about

    data access patterns

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    43/81

    2006 IBM Corporation43 Cell Broadband Engine - enabling density computing for data-rich environments

    Memory access in Cell with MLP and CTP Super-linear performance gains observed

    decouple data fetch and use

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    44/81

    2006 IBM Corporation44 Cell Broadband Engine - enabling density computing for data-rich environments

    multicore

    manyoutstanding

    transfers

    sequentialfetch

    highthroughput

    timelydata

    delivery

    shared

    I & D

    eliminatecache miss

    logic

    parallelcompute &

    transfer

    improvedcode gen

    decouplefetch andaccess

    explicitdatatransfer

    storagedensity

    Synergistic Memory Management

    single

    port

    applicationcontrol

    reducedcoherence

    cost

    lowlatencyaccess

    blockfetch

    deeperfetchqueue

    localstore

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    45/81

    2006 IBM Corporation45 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    Power Core(PPE)

    L2 Cache

    NCU

    N

    N

    In the Beginning

    the Power Architecture Processor

    Custom Designed for high frequency, area

    and power efficiency

    Power Processor Element (PPE)Industry-standard 64-bit IBM PowerArchitecture processor

    PowerPC AS 2.0.2

    2-Way Hardware MultithreadedL1 : 32KB I ; 32KB D

    L2 : 512KB

    Coherent load/store

    VMX

    3.2+ GHz

    Realtime Control

    Locking L2 Cache & TLB

    Bandwidth Reservation

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    46/81

    2006 IBM Corporation46 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    Element Interconnect Busdata ring for internal communication

    Four 16 byte data rings,supporting multiple transfers

    96B/cycle peak bandwidth

    Over 100 outstanding requests

    200+ GByte/s @ 3.2+ GHz

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    47/81

    2006 IBM Corporation47 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    SPE provides computationalperformance

    Dual issue 32-bit SIMD architecture

    Dedicated resources

    128-entry 128-bit VRF256KB Local Store

    Each SMF can be dynamicallyconfigured to protect resources

    Dedicated DMA engineUp to 16 outstanding requests

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    48/81

    2006 IBM Corporation48 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    SMF provides memory management& mapping

    SPE Local Store aliased into systemmemory map

    SMF controls SPE DMA accessesImplements page translation and

    protection

    Using Power Architecture systemmemory map

    System memory map compatiblewith Power Architecture VirtualMemory architecture

    S/W controllable from PPE MMIO

    DMA 1,2,4,8,16,128 Byte

    16Kbytetransfers for memory and I/O access

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    49/81

    2006 IBM Corporation49 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    I/O provides high bandwidthDual XDR controller

    25.6GB/s @ 3.2Gbps

    Two configurable interfaces

    76.8GB/s @ 6.4Gbps

    Configurable number of Bytes

    Coherent or I/O ModeInterconnect

    Supports multiple systemconfigurations

    25 GB / sDRAM

    MIC

    IOIF1

    IOIF0

    Southbridge

    I/O

    20 GB / sBIF or IOIF1

    5 GB / s

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    MIC

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    50/81

    2006 IBM Corporation50 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    IIC Internal Interrupt ControllerHandles SPE Interrupts

    Handles External Interrupts

    From Coherent Interconnect

    From IOIF0 or IOIF1

    Interrupt Priority Level Control

    Duplicated for each PPEhardware thread

    25 GB / sDRAM

    MIC

    IOIF1

    IOIF0

    Southbridge

    I/O

    20 GB / sBIF or IOIF1

    5 GB / s

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    MIC

    IIC

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    51/81

    2006 IBM Corporation51 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor Components

    IOT implements I/O Bus MasterTranslation

    Translates bus address to systemaddress

    Two Level translationI/O Segments: 256 MB

    I/O Pages: 4KB, 64KB,1MB, 16MB

    I/O Device Identifier / page for LPAR

    IOST and IOPT Cachehardware/software managed

    25 GB / sDRAM

    MIC

    IOIF1

    IOIF0

    Southbridge

    I/O

    20 GB / sBIF or IOIF1

    5 GB / s

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    MIC

    IIC IOT

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    52/81

    2006 IBM Corporation52 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Processor ComponentsToken Manager provides Bandwidth

    Reservation for shared resourcesOptionally used for RT tasks or LPAR

    Multiple Resource Allocation Groups

    Generates access tokens at configurablerate for each allocation group

    1 per each memory bank (16 total)

    2 for each IOIF (4 total)

    Requestors assigned RAG ID byOS/hypervisor

    Each SPEPPE L2 / NCU

    IOIF 0 Bus Master

    IOIF 1 Bus Master

    Priority order for using another RAGs

    unused tokensResource overcommit warning interrupt

    25 GB / sDRAM

    MIC

    IOIF1

    IOIF0

    Southbridge

    I/O

    20 GB / sBIF or IOIF1

    5 GB / s

    96 Byte/Cycle 200+GB/sec @ 3.2+GHz

    Element Interconnect Bus

    Power Core(PPE)

    L2 Cache

    NCU

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    L

    ocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    LocalStore

    SPU

    MFC

    N

    AUC

    MIC

    IIC IOTTKM

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    53/81

    2006 IBM Corporation53 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell BE Implementation Characteristics

    241M transistors

    235mm2

    Design operates across widefrequency range

    Optimize for power & yield

    > 200 GFlops (SP) @3.2GHz

    > 20 GFlops (DP) @3.2GHz

    Up to 25.6 GB/s memory bandwidth

    Up to 75 GB/s I/O bandwidth

    100+ simultaneous bustransactions

    16+8 entry DMA queue per SPE

    Relative

    Frequency Increase vs. Power Consumption

    Voltage

    Source: Kahle, Spring Processor Forum 2005

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    54/81

    2006 IBM Corporation54 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    55/81

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    56/81

    2006 IBM Corporation56 Cell Broadband Engine - enabling density computing for data-rich environments

    Compiling and linking an integrated executable

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    57/81

    2006 IBM Corporation57 Cell Broadband Engine - enabling density computing for data-rich environments

    Loading and execution of a program

    .text:pu_main:

    spu_create_thread(0, spu0, spu0_main);spu_create_thread(1, spu1, spu1_main);

    printf:

    spu0:

    spu1:

    .data:

    EIB

    PPE

    PXUL1

    PPU

    16B/cycle

    L232B/cycle

    SPE

    .text:

    spu0_main:

    .data:

    SXU

    SPU

    SMF

    LS

    SXU

    SPU

    SMF

    LS

    SXU

    SPU

    SMF

    LS

    SXU

    SPU

    SMF

    LS

    SXU

    SPU

    SMF

    LS

    SXUSPU

    SMF

    LS

    SXU

    SPU

    SMF

    LS

    SXU

    SPU

    SMF

    LS

    SXU

    SPU

    SMF

    .text:

    spu0_main:

    .data:

    .text:

    spu1_main:

    .data:

    System memory

    PPE image loadsand executes;

    PPE initiates

    SMF transfer; SMF data transfer start SPU at

    specified address;

    SMF starts SPUexecution

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    58/81

    2006 IBM Corporation58 Cell Broadband Engine - enabling density computing for data-rich environments

    SPE thread creation

    PPE transfers mini-loader and parameters

    256b program

    Mini-loader transfers thread memory image

    Embedded in Power Architecture executable

    Provided to spe_create_thread() call

    SPE side program load more efficient

    SPE has access more queue entriesParallel loading on 8 SPEs

    Direct channel access vs. MMIO access

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    59/81

    2006 IBM Corporation59 Cell Broadband Engine - enabling density computing for data-rich environments

    Application Source& Libraries

    PPE object files SPE object files

    Physical SPEs

    Cell Broadband Engine-aware OS (Linux)

    SPE Virtualization / Scheduling Layer

    SPE threadsPPE threads

    Physical PPE

    PPE

    T1 T2 SPESPE SPE SPE

    SPESPE SPE SPE

    Heterogeneous Multi-threading and OS

    management Heterogeneous Multi-

    Threading Model

    PPE ThreadsSPE Threads

    SPE DMA EA = PPEProcess EA Space

    OS supports Create &Destroy SPE tasks

    Atomic Update Primitivesused for Mutex

    SPE Context Fully Managed

    OS assignment of SPEthreads

    Programmer directed using

    affinity mask

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    60/81

    2006 IBM Corporation60 Cell Broadband Engine - enabling density computing for data-rich environments

    Linux on Cell BE All software in STIDC written on Linux OS

    Started with Linux 2.4 PPC64 on Cell Simulator

    SPEs exposed as I/O Devices (function offload model)

    SPE DMA required pre-pinned memory

    Inflexible programming model

    Moved to 2.6.3

    Added heterogenous thread model via system call moved to SPUFS in 2.6.12

    SPE thread API created (similar to pthreads library)

    User mode direct and indirect SPE access models

    Full pre-emptive SPE context management

    spe_ptrace() added for gdb support

    spe_schedule() for thread to physical SPE assignment

    currently FIFO run to completion

    SPE threads share address space with parent PPE process (through DMA)

    Demand paging for SPE accesses

    Shared hardware page table with PPE

    SPE Error, Event and Signal handling directed to parent PPE thread SPE elf objects wrapped into PPE shared objects with extended gld

    SPE-side mini-loader

    madvise() extended for L2 cache and TLB locking/preloading (realtime feature)

    All patches for Cell in architecture dependent layer (subtree of PPC64)

    Except for a few shameless hacks - being removed in 2.6.12 Publishing Initial Cell BE Patches for 2.6.12

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    61/81

    2006 IBM Corporation61 Cell Broadband Engine - enabling density computing for data-rich environments

    SPE data transfer (from SPE or PPE) Embedded in program

    At compile time

    At runtime Direct store to SPU LS

    Using memory map alias when SPU LS mapped into memory map

    Mailbox

    Channel in Synergistic Processor Architecture MMIO in IBM Power Architecture core

    Externally initiated transfer

    Using SMF block transfer capabilities

    From PPE or remote SPE

    SPU-initiated transfer

    Based on an address provided using one of these four methods

    Based on address computed from data obtained by these five methods

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    62/81

    2006 IBM Corporation62 Cell Broadband Engine - enabling density computing for data-rich environments

    Data sharing between PPE and SPE

    SPE addresses system memory via SMF copy-in/copy-out using effective address

    System memory pointers can be shared between PPEand SPE

    PPE can access SPE local store using SMF orusing memory accesses

    PPE enqueues SMF requests via memory mapped I/O

    Aliasing of SPE LS gives PPE addressability as systemaddress

    Add LS base address of local store to use in PPE

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    63/81

    2006 IBM Corporation63 Cell Broadband Engine - enabling density computing for data-rich environments

    Access to data using common effective addressesppe_sum_all(float *a){for (i=0; i

  • 8/8/2019 IBM cell broadband engine

    64/81

    2006 IBM Corporation64 Cell Broadband Engine - enabling density computing for data-rich environments

    Access to data using common effective addressesppe_sum_all(float *a){for (i=0; i

  • 8/8/2019 IBM cell broadband engine

    65/81

    2006 IBM Corporation65 Cell Broadband Engine - enabling density computing for data-rich environments

    Access to data using common effective addressesppe_sum_all(float *a){for (i=0; i

  • 8/8/2019 IBM cell broadband engine

    66/81

    2006 IBM Corporation66 Cell Broadband Engine - enabling density computing for data-rich environments

    Access to data using common effective addressesppe_sum_all(float *a){for (i=0; i

  • 8/8/2019 IBM cell broadband engine

    67/81

    2006 IBM Corporation67 Cell Broadband Engine - enabling density computing for data-rich environments

    Access to data using common effective addressesppe_sum_all(float *a){for (i=0; i

  • 8/8/2019 IBM cell broadband engine

    68/81

    2006 IBM Corporation68 Cell Broadband Engine - enabling density computing for data-rich environments

    Access to data using common effective addressesppe_sum_all(float *a){for (i=0; i

  • 8/8/2019 IBM cell broadband engine

    69/81

    2006 IBM Corporation69 Cell Broadband Engine - enabling density computing for data-rich environments

    Mailbox communication

    SPE

    unsigned int mbox = spu_read_in_mbox();

    spu_write_out_mbox(mbox);

    PPE

    while (spe_stat_in_mbox(speid) == 0);spe_write_in_mbox(speid,data);

    unsigned int rmbox = spe_read_out_mbox(speid);

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    70/81

    2006 IBM Corporation70 Cell Broadband Engine - enabling density computing for data-rich environments

    Atomic updates and semaphores

    Atomic updates

    atomic_set((atomic_ea_t)(ptrAtomicData),0xffffffff);atomic_set, atomic_add, atomic_sub,atomic_dec_and_test,...

    Mutex lock/unlock

    mutex_lock (cond_mutex_ea);

    cond_signal (cond_ea);

    mutex_unlock (cond_mutex_ea);

  • 8/8/2019 IBM cell broadband engine

    71/81

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    72/81

    2006 IBM Corporation72 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell Real Address Memory Map

    Local storage of each SPE aliased in system memory map Direct (uncacheable) access by PPE

    Used for LS LS transfer Access control via page table

    QoS memory is pinned system memory bandwidth and latency guarantee

    managed by O/S

    I/O devices external to BE defined by system and I/O architecture

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    73/81

    2006 IBM Corporation73 Cell Broadband Engine - enabling density computing for data-rich environments

    System management of Cell BE resources

    Cell BE implements full set of Power Architecture

    virtualization and dynamic partitioningSupport of partition configuration state

    Logical Partition ID etc.

    Full state management by PPEAccess via memory mapped I/O registers

    Grouped by privilege level

    Access control to MMIO facilities controlled by page accesscontrol

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    74/81

    2006 IBM Corporation74 Cell Broadband Engine - enabling density computing for data-rich environments

    Per SPE Resources (PPE Side)

    Problem State

    8 Entry MFC Command Queue Interface

    DMA Command and Queue Status

    DMA Tag Status Query Mask

    DMA Tag Status

    32 bit Mailbox Status and Data from SPU

    32 bit Mailbox Status and Data to SPU

    4 deep FIFO

    Signal Notification 1

    Signal Notification 2

    SPU Run Control

    SPU Next Program Counter

    SPU Execution Status

    4K Physical Page Boundary

    4K Physical Page Boundary

    Optionally Mapped 256K Local Store

    Privileged 1 State (OS)

    4K Physical Page Boundary

    SPU Privileged Control

    SPU Channel Counter Initialize

    SPU Channel Data Initialize

    SPU Signal Notification Control

    SPU Decrementer Status & Control

    MFC DMA Control

    MFC Context Save / Restore Registers

    SLB Management Registers

    Privileged 2 State(OS or Hypervisor)

    4K Physical Page Boundary

    SPU Master Run Control

    SPU ID

    SPU ECC Control

    SPU ECC Status

    SPU ECC AddressSPU 32 bit PU Interrupt Mailbox

    MFC Interrupt Mask

    MFC Interrupt Status

    MFC DMA Privileged Control

    MFC Command Error Register

    MFC Command Translation Fault Register

    MFC SDR (PT Anchor)MFC ACCR (Address Compare)

    MFC DSSR (DSI Status)

    MFC DAR (DSI Address)

    MFC LPID (logical partition ID)

    MFC TLB Management Registers

    Optionally Mapped 256K Local Store

    4K Physical Page Boundary

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    75/81

    2006 IBM Corporation75 Cell Broadband Engine - enabling density computing for data-rich environments

    Per SPE Resources (SPU Side)

    128 - 128 bit GPRs

    External Event Status (Channel 0)

    Decrementer Event

    Tag Status Update Event

    DMA Queue Vacancy Event

    SPU Incoming Mailbox Event

    Signal 1 Notification Event

    Signal 2 Notification Event

    Reservation Lost Event

    External Event Mask (Channel 1)

    External Event Acknowledgement (Channel 2)

    Signal Notification 1 (Channel 3)

    Signal Notificaiton 2 (Channel 4)

    Set Decrementer Count (Channel 7)

    Read Decrementer Count (Channel 8)

    16 Entry MFC Command Queue Interface (Channels 16-21)

    DMA Tag Group Query Mask (Channel 22)

    Request Tag Status Update (Channel 23)

    Immediate

    Conditional - ALL

    Conditional - ANY

    Read DMA Tag Group Status (Channel 24)

    DMA List Stall and Notify Tag Status (Channel 25)

    DMA List Stall and Notify Tag Acknowledgement (Channel 26)

    Lock Line Command Status (Channel 27)

    Outgoing Mailbox to PU (Channel 28)

    Incoming Mailbox from PU (Channel 29)Outgoing Interrupt Mailbox to PU (Channel 30)

    SPU Direct Access Resources SPU Indirect Access Resources

    (via EA Addressed DMA)

    System Memory

    Memory Mapped I/OThis SPU Local Store

    Other SPU Local Store

    Other SPU Signal Registers

    Atomic Update (Cacheable Memory)

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    76/81

    2006 IBM Corporation76 Cell Broadband Engine - enabling density computing for data-rich environments

    Memory Flow Controller Commands

    LSA - Local Store Address (32 bit)

    EA - Effective Address (32 or 64 bit)

    TS - Transfer Size (16 bytes to 16K bytes)LS - DMA List Size (8 bytes to 16 K bytes)

    TG - Tag Group(5 bit)

    CL - Cache Management / Bandwidth Class

    DMA Commands

    Command Parameters

    Put - Transfer from Local Store to EA space

    Puts - Transfer and Start SPU execution

    Putr- Put Result - (Arch. Scarf into L2)

    Putl - Put using DMA List in Local Store

    Putrl - Put Result using DMA List in LS (Arch)

    Get - Transfer from EA Space to Local StoreGets - Transfer and Start SPU execution

    Getl - Get using DMA List in Local Store

    Sndsig - Send Signal to SPU

    Command Modifiers:

    f: Embedded Tag Specific Fence

    Command will not start until all previous commandsin same tag group have completed

    b: Embedded Tag Specific Barrier

    Command and all subsiquent commands in same

    tag group will not start until previous commands in same

    tag group have completed

    SL1 Cache Management Commands

    sdcrt - Data cache region touch (DMA Get hint)

    sdcrtst - Data cache region touch for store (DMA Put hint)

    sdcrz - Data cache region zero

    sdcrs - Data cache region storesdcrf - Data cache region flush

    Synchronization Commands

    Lockline (Atomic Update) Commands:

    getllar- DMA 128 bytes from EA to LS and set Reservation

    putllc - Conditionally DMA 128 bytes from LS to EA

    putlluc - Unconditionally DMA 128 bytes from LS to EAbarrier- all previous commands complete before subsequent

    commands are started

    mfcsync - Results of all previous commands in Tag group

    are remotely visible

    mfceieio - Results of all preceding Puts commands in same

    group visible with respect to succeeding Get commands

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    77/81

    2006 IBM Corporation77 Cell Broadband Engine - enabling density computing for data-rich environments

    Raising the bar with parallelism

    Data-parallelism and static ILP in both core types

    Results in low overhead per operation Multithreaded programming is key to great Cell BE performance

    Exploit application parallelism with 9 cores

    Regardless of whether code exploits DLP & ILP

    Challenge regardless of homogeneity/heterogeneity

    Leverage parallelism between data processing and data transfer

    A new level of parallelism exploiting bulk data transfer

    Simultaneous processing on SPUs and data transfer on SMFs

    Offers superlinear gains beyond MIPS-scaling

  • 8/8/2019 IBM cell broadband engine

    78/81

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    79/81

    2006 IBM Corporation79 Cell Broadband Engine - enabling density computing for data-rich environments

    System configurations

    Cell BEProcessor

    XDR XDR

    IOIF

    BIF

    Cell BEProcessor

    XDR XDR

    IOIF

    CellBE

    Processor

    XDR

    XDR

    IOIF

    BIF

    CellBE

    Processor

    XDR

    XDR

    IOIFswitch

    Cell BEProcessor

    XDR XDR

    IOIF BIF

    Cell BEProcessor

    XDR XDR

    IOIF

    Cell BEProcessor

    XDR XDR

    IOIF0 IOIF1

    Game console systems

    Blades

    HDTV

    Home media servers

    Supercomputers

    Cell

    Design

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    80/81

    2006 IBM Corporation80 Cell Broadband Engine - enabling density computing for data-rich environments

    Cell: a Synergistic System Architecture Cell is not a collection of different processors, but a synergistic whole

    Operation paradigms, data formats and semantics consistent

    Share address translation and memory protection model

    SPE optimized for efficient data processing

    SPEs share Cell system functions provided by Power Architecture

    SMF implements interface to memory

    Copy in/copy out to local storage

    Power Architecture provides system functions

    Virtualization

    Address translation and protection

    External exception handling

    EIB integrates system as data transport hub

    Cell Broadband Engine

  • 8/8/2019 IBM cell broadband engine

    81/81

    Copyright International Business Machines Corporation 2006.All Rights Reserved. Printed in the United States June 2006.

    The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.IBM IBM Logo Power Architecture

    Other company, product and service names may be trademarks or service marks of others.

    All information contained in this document is subject to change without notice. The products described in this document areNOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resultin death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change

    IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnityunder the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specificenvironments, and is presented as an illustration. The results obtained in other operating environments may vary.

    While the information contained herein is believed to be accurate, such information is preliminary, and should not be reliedupon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

    THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liablefor damages arising directly or indirectly from any use of the information contained in this document.