35
Hybrid access-specific software cache techniques for the Cell BE architecture April 16, 2015

Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Hybrid access-specific software cache techniquesfor the Cell BE architecture

April 16, 2015

Page 2: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Cell BE architecture

Page 3: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Traditional cache approach

Page 4: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Software cache

Page 5: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

High locality cache

Page 6: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Transactional cache

Page 7: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

C code transformation

Page 8: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Cell benchmark

Page 9: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Cache overhead

Page 10: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Cell vs POWER5 benchmark

Page 11: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Scalability of Cell

Page 12: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Scalability of POWER5

Page 13: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Efficient computation of sum-products on GPUsthrough software-managed cache

April 16, 2015

Page 14: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Marginalization of a product of functions

Page 15: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

MPF bucketization

Page 16: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Computing MPF

Page 17: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

MPF access patterns

Page 18: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

MPF problem

Page 19: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

MPF kernel

Page 20: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Arithmetic intensity

A = compute operationsmemory operations

Page 21: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Arithmetic intensity (example 1)

k(x) = f (x)⊗ g(x)

A = 13

Page 22: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Arithmetic intensity (example 2)

Matrices M × N and N × K

A = 2N−12N+1

Page 23: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Arithmetic intensity (example 2 with cache)

Matrices M × N and N × K

A = 2N−1N( 1M+ 1

K )+1=

2− 1N

1M+ 1

N+1K

Page 24: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Speed

Page 25: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Cached arithmetic intensity

Page 26: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Index vector

< x , z ,w , y >

Page 27: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Benchmarks

Page 28: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Random data performance

Page 29: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Random data speedup

Page 30: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Performance by cache size

Page 31: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Performance by # cache pages per thread block

Page 32: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Loop unrolling

Page 33: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Texture cache

Page 34: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Speedup overhead

Page 35: Hybrid access-speci c software cache techniques for the Cell BE …zz124/cs516_spring2015/... · 2015. 4. 20. · 1K Complexlty (MFLOP) 10K . 50 0 10 0.1 10 - Speedup not includmg

Speedup with/without overhead