Programming Many-Core Systems with GRAMPS

Programming Many-Core Systems with GRAMPS

Jeremy Sugerman14 May 2010

2

The single fast core era is over

• Trends:Changing Metrics: ‘scale out’, not just ‘scale up’Increasing diversity: many different mixes of ‘cores’

• Today’s (and tomorrow’s) machines:commodity, heterogeneous, many-core

Problem: How does one program all this complexity?!

3

High-level programming models

• Two major advantages over threads & locks– Constructs to express/expose parallelism– Scheduling support to help manage concurrency,

communication, and synchronization

• Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

4

My biases workloads

• Interesting applications have irregularity• Large bundles of coherent work are efficient• Producer-consumer idiom is important

Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

5

My target audience

• Highly informed, but (good) lazy– Understands the hardware and best practices– Dislikes rote, Prefers power versus constraints

Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

6

Contributions: Design of GRAMPS

• Programs are graphs of stages and queues

• Queues:– Maximum capacities, Packet sizes

• Stages:– No, limited, or total automatic parallelism– Fixed, variable, or reduction (in-place) outputs

Simple Graphics Pipeline

7

Contributions: Implementation

• Broad application scope: – Rendering, MapReduce, image processing, …

• Multi-platform applicability: – GRAMPS runtimes for three architectures

• Performance:– Scale-out parallelism, controlled data footprint– Compares well to schedulers from other models

• (Also: Tunable)

8

Outline

• GRAMPS overview• Study 1: Future graphics architectures• Study 2: Current multi-core CPUs• Comparison with schedulers from other

parallel programming models

GRAMPS Overview

10

GRAMPS

• Programs are graphs of stages and queues– Expose the program structure– Leave the program internals unconstrained

11

Writing a GRAMPS program

• Design the application graph and queues:

• Design the stages• Instantiate and launch.

Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

Cookie Dough Pipeline

12

Queues

• Bounded size, operate at “packet” granularity– “Opaque” and “Collection” packets

• GRAMPS can optionally preserve ordering– Required for some workloads, adds overhead

13

Thread (and Fixed) stages

• Preemptible, long-lived, stateful– Often merge, compare, or repack inputs

• Queue operations: Reserve/Commit• (Fixed: Thread stages in custom hardware)

14

Shader stages:

• Automatically parallelized:– Horde of non-preemptible, stateless instances– Pre-reserve/post-commit

• Push: Variable/conditional output support– GRAMPS coalesces elements into full packets

15

Queue sets: Mutual exclusion

• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed

• Bonus: Automatically instanced Thread stages

Cookie Dough Pipeline

16

Queue sets: Mutual exclusion

• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed

• Bonus: Automatically instanced Thread stages

Cookie Dough (with queue set)

17

A few other tidbits

• Instanced Thread stages

• Queues as barriers /read all-at-once

• In-place Shader stages /coalescing inputs

18

Formative influences

• The Graphics Pipeline, early GPGPU• “Streaming”• Work-queues and task-queues

Study 1: Future Graphics Architectures

(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

20

Graphics is a natural first domain

• Table stakes for commodity parallelism• GPUs are full of heterogeneity• Poised to transition from fixed/configurable

pipeline to programmable• We have a lot of experience in it

21

The Graphics Pipeline in GRAMPS

• Graph, setup are (application) software– Can be customized or completely replaced

• Like the transition to programmable shading– Not (unthinkably) radical

• Fits current hw: FIFOs, cores, rasterizer, …

22

Reminder: Design goals

• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware

23

The Experiment

• Three renderers:– Rasterization, Ray Tracer, Hybrid

• Two simulated future architectures– Simple scheduler for each

24

Scope: Two(-plus) renderers

Ray Tracing Extension

Rasterization Pipeline (with ray tracing extension)

Ray Tracing Graph

25

Platforms: Two simulated systems

CPU-Like: 8 Fat Cores, Rast

GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

26

Performance— Metrics

“Maximize machine utilization while keeping working sets small”

• Priority #1: Scale-out parallelism– Parallel utilization

• Priority #2: ‘Reasonable’ bandwidth / storage– Worst case total footprint of all queues– Inherently a trade-off versus utilization

27

Performance— Scheduling

Simple prototype scheduler (both platforms):• Static stage priorities:

• Only preempt on Reserve and Commit• No dynamic weighting of current queue sizes

(Lowest)

(Highest)

28

Performance— Results

• Utilization: 95+% for all but rasterized fairy (~80%).• Footprint: < 600KB CPU-like, < 1.5MB GPU-like• Surprised how well the simple scheduler worked• Maintaining order costs footprint

Study 2: Current Multi-core CPUs

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)

30

Reminder: Design goals

• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware

31

The Experiment

• 9 applications, 13 configurations• One (more) architecture: multi-core x86

– It’s real (no simulation here)– Built with pthreads, locks, and atomics

• Per-pthread task-priority-queues with work-stealing– More advanced scheduling

32

Scope: Application bonanza

• GRAMPSRay tracer (0, 1 bounce)Spheres(No rasterization, though)

• MapReduceHist (reduce / combine)LR (reduce / combine)PCA

• Cilk(-like)Mergesort

• CUDAGaussian, SRAD

• StreamItFM, TDE

33

Scope: Many different idioms

FM

Merge SortRay Tracer

SRAD

MapReduce

34

Platform: 2xQuad-core Nehalem

• Queues: copy in/out, global (shared) buffer• Threads: user-level scheduled contexts• Shaders: create one task per input packet

Native: 8 HyperThreaded Core i7’s

35

Performance— Metrics (Reminder)

“Maximize machine utilization while keeping working sets small”

• Priority #1: Scale-out parallelism• Priority #2: ‘Reasonable’ bandwidth / storage

36

Performance– Scheduling

• Static per-stage priorities (still)• Work-stealing task-priority-queues• Eagerly create one task per packet (naïve)• Keep running stages until a low watermark

– (Limited dynamic weighting of queue depths)

37

Performance– Good Scale-out

• (Footprint: Good; detail a little later)

Para

llel S

peed

up

Hardware Threads

38

Performance– Low Overheads

• ‘App’ and ‘Queue’ time are both useful work.

Perc

enta

ge o

f Exe

cutio

n

Execution Time Breakdown (8 cores / 16 hyperthreads)

Comparison with Other Schedulers

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)

40

Three archetypes

• Task-Stealing: (Cilk, TBB)Low overhead with fine granularity tasksNo producer-consumer, priorities, or data-parallel

• Breadth-First: (CUDA, OpenCL)Simple scheduler (one stage at the time)No producer-consumer, no pipeline parallelism

• Static: (StreamIt / Streaming)No runtime scheduler; complex schedulesCannot adapt to irregular workloads

41

GRAMPS is a natural framework

Shader Support

Producer-Consumer

Structured ‘Work’

Adaptive

GRAMPS Task-Stealing

Breadth-First

Static

42

The Experiment

• Re-use the exact same application code• Modify the scheduler per archetype:

– Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks

– Breadth-First: Unbounded queues, stage at a time, top-to-bottom

– Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

43

Seeing is believing (ray tracer)G

RA

MP

S

Bre

adth

-Firs

tS

tatic

(S

AS

)

Tas

k-S

teal

ing

44

Comparison: Execution time

• Mostly similar: good parallelism, load balance

Perc

enta

ge o

f Tim

e

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

45

Perc

enta

ge o

f Tim

e



• Breadth-first can exhibit load-imbalance

46

Perc

enta

ge o

f Tim

e



• Task-stealing can ping-pong, cause contention

Perc

enta

ge o

f Tim

e


47

Comparison: Footprint

• Breadth-First is pathological (as expected)

Relative Packet Footprint (Log-Scale)

Size

ver

sus

GRA

MPS

48

Footprint: GRAMPS & Task-StealingRelative Packet Footprint

Relative Task Footprint

49

GRAMPS gets insight from the graph:• (Application-specified) queue bounds

• Group tasks by stage for priority, preemption

Footprint: GRAMPS & Task-Stealing

MapReduceRay Tracer

MapReduceRay Tracer

50

Static scheduling is challenging

• Generating good Static schedules is *hard*.• Static schedules are fragile:

– Small mismatches compound– Hardware itself is dynamic (cache traffic, IRQs, …)

• Limited upside: dynamic scheduling is cheap!

Execution Time Packet Footprint

51

Discussion (for multi-core CPUs)

• Adaptive scheduling is the obvious choice.– Better load-balance / handling of irregularity

• Semantic insight (app graph) gives a big advantage in managing footprint.

• More cores, development maturity → more complex graphs and thus more advantage.

Conclusion

53

Contributions Revisited

• GRAMPS programming model design– Graph of heterogeneous stages and queues

• Good results from actual implementation – Broad scope: Wide range of applications– Multi-platform: Three different architectures– Performance: High parallelism, good footprint

54

Anecdotes and intuitions

• Structure helps: an explicit graph is handy.• Simple (principled) dynamic scheduling works.• Queues impedance match heterogeneity.• Graphs with cycles and push both paid off.• (Also: Paired instrumentation and visualization

help enormously)

55

Conclusion: Future trends revisited

• Core counts are increasing– Parallel programming models

• Memory and bandwidth are precious– Working set, locality (i.e., footprint) management

• Power, performance driving heterogeneity– All ‘cores’ need to communicate, interoperate

GRAMPS fits them well.

56

Thanks

• Eric, for agreeing to make this happen.• Christos, for throwing helpers at me.• Kurt, Mendel, and Pat, for, well, a lot.

• John Gerth, for tireless computer servitude.• Melissa (and Heather and Ada before her)

57

Thanks

• My practice audiences• My many collaborators• Daniel, Kayvon, Mike, Tim• Supporters at NVIDIA, ATI/AMD, Intel• Supporters at VMware• Everyone who entertained, informed,

challenged me, and made me think

58

Thanks

• My funding agencies:– Rambus Stanford Graduate Fellowship– Department of the Army Research– Stanford Pervasive Parallelism Laboratory

59

Q&A

• Thank you for listening!• Questions?

Extra Material (Backup)

61

Data: CPU-Like & GPU-Like

62

Footprint Data: Native

63

Tunability

• Diagnosis:– Raw counters, statistics, logs– Grampsviz

• Optimize / Control:– Graph topology (e.g., sort-middle vs. sort-last)– Queue watermarks (e.g., 10x win for ray tracing)– Packet size: Match SIMD widths, share data

64

Tunability– Grampsviz (1)

• GPU-Like: Rasterization pipeline

65

Tunability– Grampsviz (2)

• CPU-Like: Histogram (MapReduce)

Reduce Combine

66

• Graph topology/design:

Tunability– Knobs

Sort-Middle Sort-Last

• Sizing critical queues:

Alternatives

69

A few other tidbits

• In-place Shader stages /coalescing inputs

Image Histogram Pipeline

• Instanced Thread stages

• Queues as barriers /read all-at-once

70

Performance– Good Scale-out

• (Footprint: Good; detail a little later)

Para

llel S

peed

up

Hardware Threads

71

Seeing is believing (ray tracer)G

RA

MP

S

Sta

tic (

SA

S)

Tas

k-S

teal

ing

Bre

adth

-Firs

t

72

Perc

enta

ge o

f Tim

e



• Small ‘Sched’ time, even with large graphs

Documents

Programming Many-Core Systems with GRAMPS