71
Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

Programming Many-Core Systems with GRAMPS

  • Upload
    maalik

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Programming Many-Core Systems with GRAMPS. Jeremy Sugerman 14 May 2010. The single fast core era is over. Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core - PowerPoint PPT Presentation

Citation preview

Page 1: Programming Many-Core Systems with GRAMPS

Programming Many-Core Systems with GRAMPS

Jeremy Sugerman14 May 2010

Page 2: Programming Many-Core Systems with GRAMPS

2

The single fast core era is over

• Trends:Changing Metrics: ‘scale out’, not just ‘scale up’Increasing diversity: many different mixes of ‘cores’

• Today’s (and tomorrow’s) machines:commodity, heterogeneous, many-core

Problem: How does one program all this complexity?!

Page 3: Programming Many-Core Systems with GRAMPS

3

High-level programming models

• Two major advantages over threads & locks– Constructs to express/expose parallelism– Scheduling support to help manage concurrency,

communication, and synchronization

• Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

Page 4: Programming Many-Core Systems with GRAMPS

4

My biases workloads

• Interesting applications have irregularity• Large bundles of coherent work are efficient• Producer-consumer idiom is important

Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

Page 5: Programming Many-Core Systems with GRAMPS

5

My target audience

• Highly informed, but (good) lazy– Understands the hardware and best practices– Dislikes rote, Prefers power versus constraints

Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

Page 6: Programming Many-Core Systems with GRAMPS

6

Contributions: Design of GRAMPS

• Programs are graphs of stages and queues

• Queues:– Maximum capacities, Packet sizes

• Stages:– No, limited, or total automatic parallelism– Fixed, variable, or reduction (in-place) outputs

Simple Graphics Pipeline

Page 7: Programming Many-Core Systems with GRAMPS

7

Contributions: Implementation

• Broad application scope: – Rendering, MapReduce, image processing, …

• Multi-platform applicability: – GRAMPS runtimes for three architectures

• Performance:– Scale-out parallelism, controlled data footprint– Compares well to schedulers from other models

• (Also: Tunable)

Page 8: Programming Many-Core Systems with GRAMPS

8

Outline

• GRAMPS overview• Study 1: Future graphics architectures• Study 2: Current multi-core CPUs• Comparison with schedulers from other

parallel programming models

Page 9: Programming Many-Core Systems with GRAMPS

GRAMPS Overview

Page 10: Programming Many-Core Systems with GRAMPS

10

GRAMPS

• Programs are graphs of stages and queues– Expose the program structure– Leave the program internals unconstrained

Page 11: Programming Many-Core Systems with GRAMPS

11

Writing a GRAMPS program

• Design the application graph and queues:

• Design the stages• Instantiate and launch.

Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

Cookie Dough Pipeline

Page 12: Programming Many-Core Systems with GRAMPS

12

Queues

• Bounded size, operate at “packet” granularity– “Opaque” and “Collection” packets

• GRAMPS can optionally preserve ordering– Required for some workloads, adds overhead

Page 13: Programming Many-Core Systems with GRAMPS

13

Thread (and Fixed) stages

• Preemptible, long-lived, stateful– Often merge, compare, or repack inputs

• Queue operations: Reserve/Commit• (Fixed: Thread stages in custom hardware)

Page 14: Programming Many-Core Systems with GRAMPS

14

Shader stages:

• Automatically parallelized:– Horde of non-preemptible, stateless instances– Pre-reserve/post-commit

• Push: Variable/conditional output support– GRAMPS coalesces elements into full packets

Page 15: Programming Many-Core Systems with GRAMPS

15

Queue sets: Mutual exclusion

• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed

• Bonus: Automatically instanced Thread stages

Cookie Dough Pipeline

Page 16: Programming Many-Core Systems with GRAMPS

16

Queue sets: Mutual exclusion

• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed

• Bonus: Automatically instanced Thread stages

Cookie Dough (with queue set)

Page 17: Programming Many-Core Systems with GRAMPS

17

A few other tidbits

• Instanced Thread stages

• Queues as barriers /read all-at-once

• In-place Shader stages /coalescing inputs

Page 18: Programming Many-Core Systems with GRAMPS

18

Formative influences

• The Graphics Pipeline, early GPGPU• “Streaming”• Work-queues and task-queues

Page 19: Programming Many-Core Systems with GRAMPS

Study 1: Future Graphics Architectures

(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

Page 20: Programming Many-Core Systems with GRAMPS

20

Graphics is a natural first domain

• Table stakes for commodity parallelism• GPUs are full of heterogeneity• Poised to transition from fixed/configurable

pipeline to programmable• We have a lot of experience in it

Page 21: Programming Many-Core Systems with GRAMPS

21

The Graphics Pipeline in GRAMPS

• Graph, setup are (application) software– Can be customized or completely replaced

• Like the transition to programmable shading– Not (unthinkably) radical

• Fits current hw: FIFOs, cores, rasterizer, …

Page 22: Programming Many-Core Systems with GRAMPS

22

Reminder: Design goals

• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware

Page 23: Programming Many-Core Systems with GRAMPS

23

The Experiment

• Three renderers:– Rasterization, Ray Tracer, Hybrid

• Two simulated future architectures– Simple scheduler for each

Page 24: Programming Many-Core Systems with GRAMPS

24

Scope: Two(-plus) renderers

Ray Tracing Extension

Rasterization Pipeline (with ray tracing extension)

Ray Tracing Graph

Page 25: Programming Many-Core Systems with GRAMPS

25

Platforms: Two simulated systems

CPU-Like: 8 Fat Cores, Rast

GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

Page 26: Programming Many-Core Systems with GRAMPS

26

Performance— Metrics

“Maximize machine utilization while keeping working sets small”

• Priority #1: Scale-out parallelism– Parallel utilization

• Priority #2: ‘Reasonable’ bandwidth / storage– Worst case total footprint of all queues– Inherently a trade-off versus utilization

Page 27: Programming Many-Core Systems with GRAMPS

27

Performance— Scheduling

Simple prototype scheduler (both platforms):• Static stage priorities:

• Only preempt on Reserve and Commit• No dynamic weighting of current queue sizes

(Lowest)

(Highest)

Page 28: Programming Many-Core Systems with GRAMPS

28

Performance— Results

• Utilization: 95+% for all but rasterized fairy (~80%).• Footprint: < 600KB CPU-like, < 1.5MB GPU-like• Surprised how well the simple scheduler worked• Maintaining order costs footprint

Page 29: Programming Many-Core Systems with GRAMPS

Study 2: Current Multi-core CPUs

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)

Page 30: Programming Many-Core Systems with GRAMPS

30

Reminder: Design goals

• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware

Page 31: Programming Many-Core Systems with GRAMPS

31

The Experiment

• 9 applications, 13 configurations• One (more) architecture: multi-core x86

– It’s real (no simulation here)– Built with pthreads, locks, and atomics

• Per-pthread task-priority-queues with work-stealing– More advanced scheduling

Page 32: Programming Many-Core Systems with GRAMPS

32

Scope: Application bonanza

• GRAMPSRay tracer (0, 1 bounce)Spheres(No rasterization, though)

• MapReduceHist (reduce / combine)LR (reduce / combine)PCA

• Cilk(-like)Mergesort

• CUDAGaussian, SRAD

• StreamItFM, TDE

Page 33: Programming Many-Core Systems with GRAMPS

33

Scope: Many different idioms

FM

Merge SortRay Tracer

SRAD

MapReduce

Page 34: Programming Many-Core Systems with GRAMPS

34

Platform: 2xQuad-core Nehalem

• Queues: copy in/out, global (shared) buffer• Threads: user-level scheduled contexts• Shaders: create one task per input packet

Native: 8 HyperThreaded Core i7’s

Page 35: Programming Many-Core Systems with GRAMPS

35

Performance— Metrics (Reminder)

“Maximize machine utilization while keeping working sets small”

• Priority #1: Scale-out parallelism• Priority #2: ‘Reasonable’ bandwidth / storage

Page 36: Programming Many-Core Systems with GRAMPS

36

Performance– Scheduling

• Static per-stage priorities (still)• Work-stealing task-priority-queues• Eagerly create one task per packet (naïve)• Keep running stages until a low watermark

– (Limited dynamic weighting of queue depths)

Page 37: Programming Many-Core Systems with GRAMPS

37

Performance– Good Scale-out

• (Footprint: Good; detail a little later)

Para

llel S

peed

up

Hardware Threads

Page 38: Programming Many-Core Systems with GRAMPS

38

Performance– Low Overheads

• ‘App’ and ‘Queue’ time are both useful work.

Perc

enta

ge o

f Exe

cutio

n

Execution Time Breakdown (8 cores / 16 hyperthreads)

Page 39: Programming Many-Core Systems with GRAMPS

Comparison with Other Schedulers

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)

Page 40: Programming Many-Core Systems with GRAMPS

40

Three archetypes

• Task-Stealing: (Cilk, TBB)Low overhead with fine granularity tasksNo producer-consumer, priorities, or data-parallel

• Breadth-First: (CUDA, OpenCL)Simple scheduler (one stage at the time)No producer-consumer, no pipeline parallelism

• Static: (StreamIt / Streaming)No runtime scheduler; complex schedulesCannot adapt to irregular workloads

Page 41: Programming Many-Core Systems with GRAMPS

41

GRAMPS is a natural framework

Shader Support

Producer-Consumer

Structured ‘Work’

Adaptive

GRAMPS Task-Stealing

Breadth-First

Static

Page 42: Programming Many-Core Systems with GRAMPS

42

The Experiment

• Re-use the exact same application code• Modify the scheduler per archetype:

– Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks

– Breadth-First: Unbounded queues, stage at a time, top-to-bottom

– Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

Page 43: Programming Many-Core Systems with GRAMPS

43

Seeing is believing (ray tracer)G

RA

MP

S

Bre

adth

-Firs

tS

tatic

(S

AS

)

Tas

k-S

teal

ing

Page 44: Programming Many-Core Systems with GRAMPS

44

Comparison: Execution time

• Mostly similar: good parallelism, load balance

Perc

enta

ge o

f Tim

e

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Page 45: Programming Many-Core Systems with GRAMPS

45

Perc

enta

ge o

f Tim

e

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Comparison: Execution time

• Breadth-first can exhibit load-imbalance

Page 46: Programming Many-Core Systems with GRAMPS

46

Perc

enta

ge o

f Tim

e

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Comparison: Execution time

• Task-stealing can ping-pong, cause contention

Perc

enta

ge o

f Tim

e

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Page 47: Programming Many-Core Systems with GRAMPS

47

Comparison: Footprint

• Breadth-First is pathological (as expected)

Relative Packet Footprint (Log-Scale)

Size

ver

sus

GRA

MPS

Page 48: Programming Many-Core Systems with GRAMPS

48

Footprint: GRAMPS & Task-StealingRelative Packet Footprint

Relative Task Footprint

Page 49: Programming Many-Core Systems with GRAMPS

49

GRAMPS gets insight from the graph:• (Application-specified) queue bounds

• Group tasks by stage for priority, preemption

Footprint: GRAMPS & Task-Stealing

MapReduceRay Tracer

MapReduceRay Tracer

Page 50: Programming Many-Core Systems with GRAMPS

50

Static scheduling is challenging

• Generating good Static schedules is *hard*.• Static schedules are fragile:

– Small mismatches compound– Hardware itself is dynamic (cache traffic, IRQs, …)

• Limited upside: dynamic scheduling is cheap!

Execution Time Packet Footprint

Page 51: Programming Many-Core Systems with GRAMPS

51

Discussion (for multi-core CPUs)

• Adaptive scheduling is the obvious choice.– Better load-balance / handling of irregularity

• Semantic insight (app graph) gives a big advantage in managing footprint.

• More cores, development maturity → more complex graphs and thus more advantage.

Page 52: Programming Many-Core Systems with GRAMPS

Conclusion

Page 53: Programming Many-Core Systems with GRAMPS

53

Contributions Revisited

• GRAMPS programming model design– Graph of heterogeneous stages and queues

• Good results from actual implementation – Broad scope: Wide range of applications– Multi-platform: Three different architectures– Performance: High parallelism, good footprint

Page 54: Programming Many-Core Systems with GRAMPS

54

Anecdotes and intuitions

• Structure helps: an explicit graph is handy.• Simple (principled) dynamic scheduling works.• Queues impedance match heterogeneity.• Graphs with cycles and push both paid off.• (Also: Paired instrumentation and visualization

help enormously)

Page 55: Programming Many-Core Systems with GRAMPS

55

Conclusion: Future trends revisited

• Core counts are increasing– Parallel programming models

• Memory and bandwidth are precious– Working set, locality (i.e., footprint) management

• Power, performance driving heterogeneity– All ‘cores’ need to communicate, interoperate

GRAMPS fits them well.

Page 56: Programming Many-Core Systems with GRAMPS

56

Thanks

• Eric, for agreeing to make this happen.• Christos, for throwing helpers at me.• Kurt, Mendel, and Pat, for, well, a lot.

• John Gerth, for tireless computer servitude.• Melissa (and Heather and Ada before her)

Page 57: Programming Many-Core Systems with GRAMPS

57

Thanks

• My practice audiences• My many collaborators• Daniel, Kayvon, Mike, Tim• Supporters at NVIDIA, ATI/AMD, Intel• Supporters at VMware• Everyone who entertained, informed,

challenged me, and made me think

Page 58: Programming Many-Core Systems with GRAMPS

58

Thanks

• My funding agencies:– Rambus Stanford Graduate Fellowship– Department of the Army Research– Stanford Pervasive Parallelism Laboratory

Page 59: Programming Many-Core Systems with GRAMPS

59

Q&A

• Thank you for listening!• Questions?

Page 60: Programming Many-Core Systems with GRAMPS

Extra Material (Backup)

Page 61: Programming Many-Core Systems with GRAMPS

61

Data: CPU-Like & GPU-Like

Page 62: Programming Many-Core Systems with GRAMPS

62

Footprint Data: Native

Page 63: Programming Many-Core Systems with GRAMPS

63

Tunability

• Diagnosis:– Raw counters, statistics, logs– Grampsviz

• Optimize / Control:– Graph topology (e.g., sort-middle vs. sort-last)– Queue watermarks (e.g., 10x win for ray tracing)– Packet size: Match SIMD widths, share data

Page 64: Programming Many-Core Systems with GRAMPS

64

Tunability– Grampsviz (1)

• GPU-Like: Rasterization pipeline

Page 65: Programming Many-Core Systems with GRAMPS

65

Tunability– Grampsviz (2)

• CPU-Like: Histogram (MapReduce)

Reduce Combine

Page 66: Programming Many-Core Systems with GRAMPS

66

• Graph topology/design:

Tunability– Knobs

Sort-Middle Sort-Last

• Sizing critical queues:

Page 67: Programming Many-Core Systems with GRAMPS

Alternatives

Page 68: Programming Many-Core Systems with GRAMPS

69

A few other tidbits

• In-place Shader stages /coalescing inputs

Image Histogram Pipeline

• Instanced Thread stages

• Queues as barriers /read all-at-once

Page 69: Programming Many-Core Systems with GRAMPS

70

Performance– Good Scale-out

• (Footprint: Good; detail a little later)

Para

llel S

peed

up

Hardware Threads

Page 70: Programming Many-Core Systems with GRAMPS

71

Seeing is believing (ray tracer)G

RA

MP

S

Sta

tic (

SA

S)

Tas

k-S

teal

ing

Bre

adth

-Firs

t

Page 71: Programming Many-Core Systems with GRAMPS

72

Perc

enta

ge o

f Tim

e

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Comparison: Execution time

• Small ‘Sched’ time, even with large graphs