37
Programmable Accelerators Jason Lowe-Power [email protected] cs.wisc.edu/~powerjg

Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power [email protected]

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Programmable AcceleratorsJason Lowe-Power

[email protected]

cs.wisc.edu/~powerjg

Page 2: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

2

Increasing specialization

Need to programthese accelerators

Challenges1. Consistent pointers2. Data movement3. Security (Fast)

This talk: GPGPUs *NVIDIA via anandtech.com

Page 3: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Programming accelerators (baseline)

int main() {int a[N], b[N], c[N];

init(a, b, c);

add(a, b, c);

return 0;}

void add(int*a, int*b, int*c) {for (int i = 0;

i < N;i++)

{c[i] = a[i] + b[i];

}}

3Accelerator-side code CPU-Side code

Page 4: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Programming accelerators (GPU)

void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);

i < N;i += get_global_size(0))

{c[i] = a[i] + b[i];

}}

int main() {int a[N], b[N], c[N];

init(a, b, c);

add(a, b, c);

return 0;}

4Accelerator-side code CPU-Side code

Page 5: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Programming accelerators (GOAL)

void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);

i < N;i += get_global_size(0))

{c[i] = a[i] + b[i];

}}

int main() {int a[N], b[N], c[N];int *d_a, *d_b, *d_c;

cudaMalloc(&d_a, N*sizeof(int));cudaMalloc(&d_b, N*sizeof(int));cudaMalloc(&d_c, N*sizeof(int));

init(a, b, c);

cudaMemcpy(d_a, a, N*sizeof(int),cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, N*sizeof(int),cudaMemcpyHostToDevice);

add_gpu(a, b, c);

cudaMemcpy(c, d_c, N*sizeof(int),cudaMemcpyDeviceToHost);

cudaFree(d_a); cudaFree(d_b);cudaFree(d_c);

return 0;}

5Accelerator-side code CPU-Side code

Page 6: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Programming accelerators (GOAL)

void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);

i < N;i += get_global_size(0))

{c[i] = a[i] + b[i];

}}

int main() {int a[N], b[N], c[N];

init(a, b, c);

add_gpu(a, b, c);

return 0;}

6Accelerator-side code CPU-Side code

Page 7: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Key challenges

Memory

CPU

MMU

a[i]

ld: 0x1000000

0x5000

Cache

Virtual addresspointer

Physical address

7

Page 8: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Cache

Key challenges

Memory

a[i]

ld: 0x1000000

0x5000

Virtual addresspointer

CPU

MMUCache

GPU

MMU

Consistent pointers Data movement

8

Page 9: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Data movementHeterogeneous System

Coherence

9

[MICRO 2014][HPCA 2014]

Page 10: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Heterogeneous System

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Shared memory controller

Directory

10

Page 11: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Why not CPU solutions?

It’s all about bandwidth!

Translating 100s of addresses

500 GB/s at the directory (many accesses per-cycle)

Theoretical Memory Bandwidth

GB

/s

11*NVIDIA via anandtech.com

Page 12: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

[MICRO 2014]

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Data movementHeterogeneous System

Coherence

12

[HPCA 2014]

Page 13: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Why virtual addresses?

Virtual memory13

Page 14: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Why virtual addresses?

GPU address space

Simply copy data

Transform to new pointers

Transform to new pointers

Virtual memory

14

Page 15: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Bandwidth problem

CPU

TLB

Virtual memoryrequests

Physical memoryrequests

15

Page 16: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Bandwidth problem

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

GPU ProcessingElements

(one GPU core)

16

Page 17: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

0.45x

Solution: Filtering

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Shared Memory(scratchpad)

Coalescer

1x

0.06x

GPU ProcessingElements

(one GPU core)

TLB 17

Page 18: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

18

Poor performanceAverage 3x slowdown

Page 19: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Bottleneck 1: Bursty TLB misses

Average: 60 outstanding requests

Max 140 requests

Huge queuing delays

Solution:Highly-threaded pagetable walker

19

Page 20: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Bottleneck 2: High miss rate

Large 128 entry TLB doesn’t help

Many address streams

Need low latency

Solution:Shared page-walk cache

20

Page 21: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Performance: Low overhead

21

Worst case: 12% slowdown

Average: Less than 2% slowdown

21

Page 22: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

22

Shared virtual memory is important

Non-exotic MMU design• Post-coalescer L1 TLBs

• Highly-threaded page table walker

• Page walk cache

Full compatibility with minimal overhead

Still room to optimize

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Page 23: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

[HPCA 2014]

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Data movementHeterogeneous System

Coherence

23

[MICRO 2014]

Page 24: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Directory

Memory

Legacy Interface

1.CPU writes memory

2.CPU initiates DMA

3.GPU direct access

High bandwidthNo directory access

24

Inv

Page 25: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Directory

Memory

CC Interface

1.CPU writes memory

2.GPU access

Bottleneck: Directory1. Access rate2. Buffering

25

Inv

Page 26: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Directory Bottleneck 1: Access rate

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

bp bfs hs lud nw km sd bn dct hg mm

Dir

ect

ory

acc

ess

es

pe

r cy

cle

26

Many requests per cycle

Difficult to design multi-ported directory

Page 27: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Directory Bottleneck 2: Buffering

1

10

100

1000

10000

100000

bp bfs hs lud nw km sd bn dct hg mmM

axim

um

MSH

Rs

27

Must track many outstanding requests

Huge queuing delays

Solution:Reduce pressure on

directory

Page 28: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Memory

HSC Design

Goal: Direct access (B/W)+ Cache coherence

Add: Region DirectoryRegion Buffers

Decouples permission from access

RegionDirectory

Region Buffer

Region Buffer

Only permission traffic

28

Page 29: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

HSC: Performance Improvement

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bp bfs hs lud nw km sd bn dct hg mmN

orm

aliz

ed

sp

ee

d-u

p

29

Page 30: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Want cache coherence without sacrificing bandwidth

Major bottlenecks in current coherence implementations

1. High bandwidth difficult to supportat directory

2. Extreme resource requirements

Heterogeneous System CoherenceLeverages spatial locality

Reduces bandwidth and resource requirements by 95%

Data movementHeterogeneous System

Coherence

30

Page 31: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

31

Increasing specialization

Need to programthese accelerators

Challenges1. Consistent pointers2. Data movement3. Security (Fast)

This talk: GPGPUs *NVIDIA via anandtech.com

Page 32: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Trusted

Security & tightly-integrated accelerators

32

Memory

CPU Core

L1

L2

CPU Core

L1

L2

Accelerator

L1

TLB

OS (Protected)

Data

Process Data

Accelerator

L1

TLB

What if accelerators come from 3rd parties?

Untrusted!

All accesses via IOMMUSafeLow performance

Bypass IOMMUHigh performanceUnsafe

IOMMU

Untrusted

Page 33: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Trusted

Border control: sandboxing accelerators

33

Memory

CPU Core

L1

L2

CPU Core

L1

L2

Accelerator

L1

TLB

OS (Protected)

Data

Process Data

Accelerator

L1

TLB

Solution:Border control

Key Idea: Decouple translation from safety

Safety + PerformanceIOMMU

Untrusted

Border Control

Border Control

[MICRO 2015]

Page 34: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Conclusions

34

Challenges

1. Consistent addressesGPU MMU Design

2. Data movementHeterogeneous System Coherence

3. SecurityBorder Control

Goal: Enable programmers to use the whole chip

Page 35: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Jason Power, Mark D. Hill, David A. Wood

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes[HPCA 2014]

*

Jason Power, Arkaprava Basu*, JunliGu*, Sooraj Puthoor*, Bradford M. Beckmann*, Mark D. Hill, Steven K.

Reinhardt*, David A. Wood

Data movementHeterogeneous System

Coherence[MICRO 2013]

Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood

SecurityBorder Control:

Sandboxing Accelerators

[MICRO 2015]

Contact:Jason Lowe-Power

[email protected]/~powerjg

I’m on the job market this year!

Graduating in Spring

Page 36: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Other work

36

Analytic database + Tightly-integrated GPUs

Simulation Infrastructure

When to use 3D Die-StackedMemory for Bandwidth-Constrained

Big-Data Workloads[BPOE 2016]

Towards GPUs being mainstreamin analytic processing

[DaMoN 2015]

Implications of Emerging 3D GPUArchitecture on the Scan Primitive

[SIGMOD Rec. 2015]

Jason Lowe-Power, Mark D. Hill, David A. Wood

Jason Power, Yinan Li, Mark D. Hill, Jignesh M.

Patel, David A. Wood

Jason Power, Yinan Li, Mark D. Hill, Jignesh M.

Patel, David A. Wood

gem5-gpu: A HeterogeneousCPU-GPU Simulator

[CAL 2014]Jason Power, Joel Hestness,

Marc S. Orr, Mark D. Hill, David A. Wood

Page 37: Programmable Accelerators - University of Wisconsin–Madisonpages.cs.wisc.edu/.../ibm-workshop-programmable-accelerators-for-p… · Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu

Comparison to CAPI/OpenCAPI

Same virtual address space

Cache coherent

System safetyfrom

accelerator

Assumes on-chip accel.

Allows accel.physical caches

Allows pre-translation

CAPI Yes Yes Yes No No No

My work Yes Yes Yes Yes Yes Yes

37

Allows for high-performanceaccelerator optimizations