22
© 2006 Mercury Computer Systems, Inc. Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas, Luke Cico, Jon Greene, Maike Geng, Jamie Kenny, Michael Pepe, Myra Prelle

Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2006 Mercury Computer Systems, Inc.

Programming the Cell Broadband Engine Processor

Robert CooperWith thanks to: Brian Bouzas, Luke Cico, Jon Greene, Maike Geng, Jamie Kenny, Michael Pepe, Myra Prelle

Page 2: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.2 © 2006 Mercury Computer Systems

Outline

• Cell architecture and performance• Mercury approach to programming Cell• MultiCore Framework• Final thoughts

Page 3: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.3 © 2006 Mercury Computer Systems

Cell BE Processor Architecture

• Resembles distributed memory multiprocessor with explicit DMA over a fabric

Page 4: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.4 © 2006 Mercury Computer Systems

Mercury Multi-DSP Board (1996)

Page 5: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.5 © 2006 Mercury Computer Systems

Programming Cell: What’s Good and What’s Hard

No second guessing about cache replacement algorithmVery deterministic pipeline128 registers mask pipeline latency very well

DMA has negligible impact on SPE local store bandwidthGenerous ring bandwidth means topology is seldom an issue

Standard Power® core

Burden on software to get code and data into local storeLocal store is small compared to ring latencyBranch prediction is manual and very restricted

128 byte alignment necessary for best performanceXDR bandwidth is a bottleneckCell chips linked in coherent mode increases latency

Performance is modest

SPE

Ring and XDR

PPE

Good Hard

Page 6: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.6 © 2006 Mercury Computer Systems

How Much Faster Is Cell?Relative performance of Cell and leading general purpose processors

32

56

30

13

25

1.0 1.0 1.0 1.0 1.01.5 2.1 1.90.9 1.3 1.01.8 2.0

1.0

0

10

20

30

40

50

60

1K point FFT 8K point FFT 64K point FFT 15x15 16-bit filter 15x15 8-bit filter

Rel

ativ

e Pe

rform

ance

Cell BE 3.2 GHz

Freescale 744x 975 MHz

Pentium 3.6 GHz 2MB L2

Opteron 2.4 GHz

PPC 970 2.0 GHz

Single precision complex FFTs Symmetric image filters

• Performance relative to 1GHz Freescale 744x (i.e. Freescale = 1)

• In all cases, we are comparing Mercury optimized Cell algorithm implementations with the best available (Mercury or 3rd party) implementations on other processors

• Did not compare with dual core x86 processors

Page 7: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.7 © 2006 Mercury Computer Systems

Goals for Programming Cell

• Achieve high performance: The only reason for choosing Cell

• Ease of programming: An important aspect of this is programmer portability

• Code PortabilityImportant for large legacy code bases written in C/C++, FortranAnd new code developed for Cell should be portable to current and anticipated multiprocessor architectures

Page 8: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.8 © 2006 Mercury Computer Systems

Mercury Approach to Programming Cell

• Very pragmaticCan’t wait for tools to matureDevelop our own tools when it makes sense

• Emphasis on explicitly programming the architecture rather than trying to hide it

When the tools are immature, this allows us to get maximum performance

• Achieve ease-of-use and portability through function offload model

Run legacy code on PPEOffload compute intensive workload to SPEs

Page 9: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.9 © 2006 Mercury Computer Systems

Mercury Elements of Cell Ecosystem

• Maximum performance toolsSPE local algorithm libraries

• Mercury Scientific Algorithm Library (SAL) ported to SPEExplicit multicore middleware

• MultiCore FrameworkAssembly language generation and preprocessing tools

• Portability toolsPortable algorithm libraries callable from PPE

• PPE SAL runs on PPE vector unit• Multicore SAL executes in parallel on SPEs

• Performance debugging toolMercury Trace Analysis Tool (TATL) extended to work on SPEs

• Complement other tools in ecosystemCompilers, languages, libraries, debuggers, simulators

Page 10: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.10 © 2006 Mercury Computer Systems

MultiCore Framework

• An API for programming heterogeneous multicores that contain explicit non-cached memory hierarchies

• Provides an abstract view of the hardware oriented toward computation of multidimensional data sets

• First implementation is for the Cell BE processor

Page 11: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.11 © 2006 Mercury Computer Systems

MCF Abstractions

• Function offload modelWorker Teams: Allocate tasks to SPEsPlug-ins: Dynamically load and unload functions from

within worker programs• Data movement

Distribution Objects: Defining how n-dimensional data is organized in memory

Tile Channels: Move data between SPEs and main memory

Re-org Channels: Move data among SPEsMultibuffering: Overlap data movement and computation

• MiscellaneousBarrier and semaphore synchronizationDMA-friendly memory allocatorDMA convenience functionsPerformance profiling

Page 12: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.12 © 2006 Mercury Computer Systems

MCF Abstractions

• Function offload modelWorker Teams: Allocate tasks to SPEsPlug-ins: Dynamically load and unload functions from

within worker programs• Data movement

Distribution Objects: Defining how n-dimensional data is organized in memory

Tile Channels: Move data between SPE and main memory

Re-org Channels: Move data among SPEsMultibuffering: Overlap data movement and computation

• MiscellaneousBarrier and semaphore synchronizationDMA-friendly memory allocatorDMA convenience functionsPerformance profiling

Page 13: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.13 © 2006 Mercury Computer Systems

MCF Distribution Objects

One complete data set in main memory

Frame

• Distribution Object parameters:Number of dimensionsFrame sizeTile size and tile overlapArray indexing orderCompound data type organization (e.g. split / interleaved)Partitioning policy across workers, including partition overlap

Page 14: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.14 © 2006 Mercury Computer Systems

MCF Distribution Objects

• Distribution Object parameters:Number of dimensionsFrame sizeTile size and tile overlapArray indexing orderCompound data type organization (e.g. split / interleaved)Partitioning policy across workers, including partition overlap

One complete data set in main memory

Unit of work for an SPE

Tile

Frame

Page 15: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.15 © 2006 Mercury Computer Systems

MCF Partition Assignment

• Distribution Object parameters:Number of dimensionsFrame sizeTile size and tile overlapArray indexing orderCompound data type organization (e.g. split / interleaved)Partitioning policy across workers, including partition overlap

PartitionsSPE 0

SPE 1

SPE 2

Page 16: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.16 © 2006 Mercury Computer Systems

MCF Tile Channels

• Distribution Object parameters:Number of dimensionsFrame sizeTile size and tile overlapArray indexing orderCompound data type organization (e.g. split / interleaved)Partitioning policy across workers, including partition overlap

PartitionsSPE 0

SPE 1

SPE 2

Tile Channel

Page 17: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.17 © 2006 Mercury Computer Systems

MCF Manager Programmain(int argc, char **argv) {

mcf_m_net_create();mcf_m_net_initialize();

mcf_m_net_add_task();mcf_m_team_run_task();

mcf_m_tile_distribution_create_3d(“in”);mcf_m_tile_distribution_set_partition_overlap(“in”);mcf_m_tile_distribution_create_3d(“out”);

mcf_m_tile_channel_create(“in”);mcf_m_tile_channel_create(“out”);mcf_m_tile_channel_connect(“in”);mcf_m_tile_channel_connect(“out”);

mcf_m_tile_channel_get_buffer(“in”);

// fill input data here

mcf_m_tile_channel_put_buffer(“in”);mcf_m_tile_channel_get_buffer(“out”);

// process output data here}

Add worker tasks

Specify data organization

Create and connectto tile channels

Get empty source buffer

Fill it with dataSend it to workers

Wait for results from workers

Page 18: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.18 © 2006 Mercury Computer Systems

MCF Worker Program

mcf_w_main (int n_bytes, void * p_arg_ls) {mcf_w_tile_channel_create(“in”);mcf_w_tile_channel_create(“out”);mcf_w_tile_channel_connect(“in”);mcf_w_tile_channel_connect(“out”);

while (! mcf_w_tile_channel_is_end_of_channel(“in”){

mcf_w_tile_channel_get_buffer(“in”);

mcf_w_tile_channel_get_buffer(“out”);

// Do math here

mcf_w_tile_channel_put_buffer(“in”);

mcf_w_tile_channel_put_buffer(“out”);}

}

Create and connectto tile channels

Get full source buffer

Put back empty source buffer

Put back fulldestination buffer

Get empty destination bufferDo math and fill

destination buffer

Page 19: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.19 © 2006 Mercury Computer Systems

MCF Implementation

• Consists ofPPE librarySPE library and tiny executive (12 KB)

• Utilizes Cell Linux “libspe” supportBut amortizes expensive system callsReduces overhead from milliseconds to microsecondsProvides faster and smaller footprint memory allocation library

• Based on Data Reorg standard http://www.data-re.org

• Derived from existing Mercury technologiesOther Mercury RDMA-based middlewareDSP product experience with small footprint, non-cached architectures

Page 20: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.20 © 2006 Mercury Computer Systems

MCF Experience

• Works very well for regular N-dimensional matrix data and image data

Used by customers and for almost all internal application development

• Next stepsData re-org channels for SPE-SPE data movementSupport for irregular, variable length dataBetter integration with cluster-level middlewareOverlays / code movementSPE emulation on GPP to ease debuggingExtend to GPP multicores

• What we need from the community for our “performance first” approach:

OS to get out of the way of SPE and DMA performanceCompilers that treat carefully written SPU intrinsics code with respect

Page 21: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.21 © 2006 Mercury Computer Systems

Summing Up

• The Cell BE processor can achieve one to two orders of magnitude performance improvement over current general purpose processors

Lean SPE core saves space and power And makes it easier for software to approach peak performance

• Cell is a distributed memory multiprocessor on a chipPrior experience on these architectures translates easily to Cell

• But for most programmers, Cell is a new architectureSuccessful adoption by programmers is Cell’s biggest challengeAnd the history of other new processor architectures is not encouraging

• We need a range of tools that span the continuum from ease-of-use to high performance

Page 22: Programming the Cell Broadband Engine Processorcavazos/cisc879-spring2008/papers/Cel… · Programming the Cell Broadband Engine Processor Robert Cooper With thanks to: Brian Bouzas,

© 2005 Mercury Computer Systems, Inc.22 © 2006 Mercury Computer Systems

Final Thoughts:Criteria for Evaluating Cell Software Tools