57
Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale

Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Embed Size (px)

Citation preview

Page 1: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale

Page 2: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Scientific Discovery and Business Analytics

Driving an Insatiable Demand for More Computing Performance

Page 3: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Compute Communication Memory &

Storage

HPC Analytics

Page 4: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End of Historic Scaling

Page 5: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

“Moore’s Law gives us more transistors…

Dennard scaling made them useful.”

Bob Colwell, DAC 2013, June 4, 2013

Page 6: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

18,688 NVIDIA Tesla K20X GPUs

27 Petaflops Peak: 90% of Performance from GPUs

17.59 Petaflops Sustained Performance on Linpack

Numerous real science applications

2.14GF/W – Most efficient accelerator

TITAN

Page 7: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

18,688 NVIDIA Tesla K20X GPUs

27 Petaflops Peak: 90% of Performance from GPUs

17.59 Petaflops Sustained Performance on Linpack

Numerous real science applications

2.14GF/W – Most efficient accelerator

TITAN

Page 8: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

20PF 18,000 GPUs

10MW 2 GFLOPs/W ~10

7 Threads

You Are Here

1,000PF (50x) 72,000HCNs (4x)

20MW (2x) 50 GFLOPs/W (25x)

~1010 Threads (1000x)

2013

2020

Page 9: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

0

5

10

15

20

25

30

35

40

45

50

2013 2014 2015 2016 2017 2018 2019 2020

GFLO

PS/W

Needed

Process

EFFICIENCY GAP

Page 10: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery
Page 11: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

CIRCUITS 3X

PROCESS 2.2X

1

10

2013 2014 2015 2016 2017 2018 2019 2020

GFLO

PS/W

Needed

Process

Page 12: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Simpler Cores = Energy Efficiency

Source: Azizi [PhD 2010]

Page 13: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

CPU 1690 pJ/flop

Optimized for Latency

Caches

Westmere 32 nm

GPU 140 pJ/flop

Optimized for Throughput

Explicit Management of On-chip Memory

Kepler 28 nm

Page 14: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

1

10

2013 2014 2015 2016 2017 2018 2019 2020

GFLO

PS/W

Needed

Process

CIRCUITS 3X

PROCESS 2.2X

ARCHITECTURE 4X

Page 15: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Programmers, Tools, and Architecture Need to Play Their Positions

Programmer

Architecture Tools

Page 16: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Programmers, Tools, and Architecture Need to Play Their Positions

Algorithm

All of the parallelism

Abstract locality

Fast mechanisms

Exposed costs

Combinatorial optimization

Mapping

Selection of mechanisms

Programmer

Architecture Tools

Page 17: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

L2

Banks

XBAR

NOC

SM

La

ne

La

ne

La

ne

La

ne

La

ne

La

ne

La

ne

La

ne

SM

SM

Page 18: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

• <1ms Latency

• Scalable bandwidth

• Small messages 50% @ 32B

• Global adaptive routing

• PGAS

• Collectives & Atomics

• MPI Offload

An Enabling HPC Network

Page 19: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

An Open HPC Network Ecosystem

Common:

• Software-NIC API

• NIC-Router Channel

Processor/NIC Vendors

System Vendors

Networking Vendors

Page 20: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Programmer

Architecture Tools

Programming

Parallelism

Heterogeneity

Hierarchy

Power

25x Efficiency with 2.2x from process

Page 21: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

“Super” Computing From Super Computers to Super Phones

Page 22: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Backup

Page 23: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery
Page 24: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Source: Moore, Electronics 38(8) April 19, 1965

In The Past, Demand Was Fueled

by Moore’s Law

Page 25: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Source: Dally et al. “The Last Classical Computer”, ISAT Study, 2001

ILP Was Mined Out in 2001

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

1980 1990 2000 2010 2020

Perf (ps/Inst)

Linear (ps/Inst)

30:1

1,000:1

30,000:1

Page 26: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Source: Moore, ISSCC Keynote, 2003

Voltage Scaling Ended in 2005

Page 27: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Moore’s law is alive and well, but…

Instruction-level parallelism (ILP) was mined out in 2001

Voltage scaling (Dennard scaling) ended in 2005

Most power is spent on communication

What does this mean to you?

Summary

Page 28: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End of Historic Scaling

Page 29: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

All performance is from parallelism

Machines are power limited (efficiency IS performance)

Machines are communication limited (locality IS performance)

In the Future

Page 30: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Two Major Challenges

Programming

Parallel (1010 threads)

Hierarchical

Heterogeneous

Energy Efficiency

25x in 7 years (~2.2x from process)

Page 31: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

1

10

2013 2014 2015 2016 2017 2018 2019 2020

GFLO

PS/W

Needed

Process

Page 32: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

How is Power Spent in a CPU?

In-order Embedded OOO Hi-perf

Clock + Control Logic

24%

Data Supply

17%

Instruction Supply

42%

Register File

11%

ALU 6% Clock + Pins

45%

ALU

4%

Fetch

11%

Rename

10%

Issue

11%

RF

14%

Data Supply

5%

Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

Page 33: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Processor Technology 40 nm 10nm

Vdd (nominal) 0.9 V 0.7 V

DFMA energy 50 pJ 7.6 pJ

64b 8 KB SRAM Rd 14 pJ 2.1 pJ

Wire energy (256 bits, 10mm) 310 pJ 174 pJ

Memory Technology 45 nm 16nm

DRAM interface pin bandwidth 4 Gbps 50 Gbps

DRAM interface energy 20-30 pJ/bit 2 pJ/bit

DRAM access energy 8-15 pJ/bit 2.5 pJ/bit

Keckler [Micro 2011], Vogelsang [Micro 2010]

Energy Shopping List

FP Op lower bound

=

4 pJ

Page 34: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery
Page 35: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

CIRCUITS 3X

PROCESS 2.2X

1

10

2013 2014 2015 2016 2017 2018 2019 2020

GFLO

PS/W

Needed

Process

Page 36: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Throughput-Optimized Core (TOC)

Latency-Optimized Core

(LOC)

PC

PC

Branch Predict

I$

Register Rename

ALU 1 ALU 2 ALU 4 ALU 3

Reorder Buffer

Instruction Window

ALU 1 ALU 2 ALU 4 ALU 3

I$

Select

Register File

PCs

Register File

Page 37: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

SIMT Lanes

Streaming Multiprocessor (SM)

Warp Scheduler

Shared Memory 32 KB

15% of SM Energy Main Register File

32 banks

ALU SFU MEM TEX

Page 38: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Hierarchical Register File

0%

20%

40%

60%

80%

100%

Perc

ent

of

All V

alu

es

Pro

duced

Read >2 Times

Read 2 Times

Read 1 Time

Read 0 Times

0%

20%

40%

60%

80%

100%

Perc

ent

of

All V

alu

es

Pro

duced

Lifetime >3

Lifetime 3

Lifetime 2

Lifetime 1

Page 39: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Register File Caching (RFC)

S F U

M E M

T E X

Operand Routing

Operand Buffering

MRF 4x128-bit Banks (1R1W)

RFC 4x32-bit (3R1W) Banks

ALU

Page 40: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Energy Savings from RF Hierarchy 54% Energy Reduction

Source: Gebhart, et. al (Micro 2011)

Page 41: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Two Major Challenges

Programming

Parallel (1010 threads)

Hierarchical

Heterogeneous

Energy Efficiency

25x in 7 years (~2.2x from process)

Page 42: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Skills on LinkedIn Size (approx) Growth (rel)

C++ 1,000,000 -8%

Javascript 1,000,000 -1%

Python 429,000 7%

Fortran 90,000 -11%

MPI 21,000 -3%

x86 Assembly 17,000 -8%

CUDA 14,000 9%

Parallel programming 13,000 3%

OpenMP 8,000 2%

TBB 389 10%

6502 Assembly 256 -13%

Source: linkedin.com/skills (as of Jun 11, 2013)

Mainstream

Programming

Parallel and

Assembly

Programming

Page 43: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

forall molecule in set: # 1E6 molecules

forall neighbor in molecule.neighbors: # 1E2 neighbors ea

forall force in forces: # several forces

# reduction

molecule.force += force(molecule, neighbor)

Parallel Programming is Easy

Page 44: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

We Can Make It Hard

pid = fork() ; // explicitly managing threads

lock(struct.lock) ; // complicated, error-prone synchronization

// manipulate struct

unlock(struct.lock) ;

code = send(pid, tag, &msg) ; // partition across nodes

Page 45: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Programmers, Tools, and Architecture Need to Play Their Positions

Programmer

Architecture Tools

Page 46: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Programmers, Tools, and Architecture Need to Play Their Positions

Algorithm

All of the parallelism

Abstract locality

Fast mechanisms

Exposed costs

Combinatorial optimization

Mapping

Selection of mechanisms

Programmer

Architecture Tools

Page 47: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

OpenACC: Easy and Portable

!$acc parallel loop do i = 1, 20*128 !dir$ unroll 1000 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do

do i = 1, 20*128 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do Serial Code: SAXPY

Page 48: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Conclusion

Page 49: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End of Historic Scaling

Page 50: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Parallelism is the source of all performance

Power limits all computing

Communication dominates power

Page 51: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Two Challenges

Programming

Parallelism

Heterogeneity

Hierarchy

Power

25x Efficiency with 2.2x from process

Page 52: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Programmer

Architecture Tools

Programming

Parallelism

Heterogeneity

Hierarchy

Power

25x Efficiency with 2.2x from process

Page 53: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

“Super” Computing From Super Computers to Super Phones

Page 54: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

64-bit DP 20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficient off-chip link

256-bit buses

16 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM 50 pJ

20mm

Communication Takes More Energy Than Arithmetic

Page 55: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Key to Parallelism: Independent operations on independent

data

sum(map(multiply, x, x))

every pair-wise multiply is independent

parallelism is permitted

Page 56: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Flat computation

Key to Locality: Data decomposition should drive mapping

total = sum(x)

vs.

tiles = split(x)

partials = map(sum, tiles)

total = sum(partials)

Page 57: Efficiency and Programmability: Enablers for ExaScaleon-demand.gputechconf.com/.../2013/...Exascale.pdf · Efficiency and Programmability: Enablers for ExaScale . Scientific Discovery

Explicit decomposition

Key to Locality: Data decomposition should drive mapping

total = sum(x)

vs.

tiles = split(x)

partials = map(sum, tiles)

total = sum(partials)