57
. . Alleviating memory-bandwidth limitations for scalability and energy efficiency Lessons learned from the optimization of SpMxV Georgios Goumas [email protected] Computing Systems Laboratory National Technical University of Athens Oct 3, 2013 Parallel Processing for Energy Efficiency Parallel Processing for Energy Efficiency (PP4EE) 1

Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

.

......

Alleviating memory-bandwidth limitations forscalability and energy efficiency

Lessons learned from the optimization of SpMxV

Georgios [email protected]

Computing Systems LaboratoryNational Technical University of Athens

Oct 3, 2013Parallel Processing for Energy Efficiency

Parallel Processing for Energy Efficiency (PP4EE) 1

Page 2: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 2

Page 3: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 3

Page 4: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Application classes(based on their performance on shared memory systems)

..core. core. core. core. L1. L1. L1. L1.

L2

.

L2

.

main memory (or off-chip cache)

Parallel Processing for Energy Efficiency (PP4EE) 4

Page 5: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Application classes(based on their performance on shared memory systems)

4 Good scalability4 temporal locality4 no synchronization4 load balance

..core. core. core. core. L1. L1. L1. L1.

L2

.

L2

.

main memory (or off-chip cache)

Parallel Processing for Energy Efficiency (PP4EE) 4

Page 6: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Application classes(based on their performance on shared memory systems)

7 Applications with intensive memory accesses7 (very) poor temporal locality7 high memory-to-computation ratio7 limited scalability due to contention on memory

..core. core. core. core. L1. L1. L1. L1.

L2

.

L2

.

main memory (or off-chip cache)

Parallel Processing for Energy Efficiency (PP4EE) 4

Page 7: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Applications with intensive memory accesses(Example: memcomp benchmark)

memcomp

1 LOAD

k ADDs

FP (double)

unrolled

1 2 3 4 5 6 7 8

cores utilized

1

2

3

4

5

6

7

8

speedup

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

Parallel Processing for Energy Efficiency (PP4EE) 5

Page 8: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Improving performance using compression

exchange memory cycles for CPU cycles

..c. m..

serial

. c. m.c

.

c

.

c

..

parallel (4 cores)

.

c’

.

m’

..

decompression cost

.

c’

.

m’

.

c’

.

c’

.

c’

.

cost amortization

Parallel Processing for Energy Efficiency (PP4EE) 6

Page 9: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Improving performance using compression

exchange memory cycles for CPU cycles

..c. m..

serial

. c. m.c

.

c

.

c

..

parallel (4 cores)

.

c’

.

m’

..

decompression cost

.

c’

.

m’

.

c’

.

c’

.

c’

.

cost amortization

Parallel Processing for Energy Efficiency (PP4EE) 6

Page 10: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Improving performance using compression

exchange memory cycles for CPU cycles

..c. m..

serial

. c. m.c

.

c

.

c

..

parallel (4 cores)

.

c’

.

m’

..

decompression cost

.

c’

.

m’

.

c’

.

c’

.

c’

.

cost amortization

Parallel Processing for Energy Efficiency (PP4EE) 6

Page 11: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxVStorage formatsSparse-matrix vector multiplication: SpMxVSpMxV performance

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 7

Page 12: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Sparse Matrices

Dominated by zeroes

Applications:PDEsGraphsLinear Programming

Efficient Representation(space and computation)

non-zero values (value data)structure (index data)

Sparse storage formatsCOO: BasicCSR: most common, base-lineBCSR: state of the art

Parallel Processing for Energy Efficiency (PP4EE) 8

Page 13: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSR (Compressed Sparse Row)

..

..5.4 ..1.1 ..0 ..0 ..0 ..0

..0 ..6.3 ..0 ..7.7 ..0 ..8.8

..0 ..0 ..1.1 ..0 ..0 ..0

..0 ..0 ..2.9 ..0 ..3.7 ..2.9

..9.0 ..0 ..0 ..1.1 ..4.5 ..0

..1.1 ..0 ..2.9 ..3.7 ..0 ..1.1

.

.

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

.

nnz

.

nrows+1

......

Parallel Processing for Energy Efficiency (PP4EE) 9

Page 14: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSR (Compressed Sparse Row)

..

..5.4 ..1.1 ..0 ..0 ..0 ..0

..0 ..6.3 ..0 ..7.7 ..0 ..8.8

..0 ..0 ..1.1 ..0 ..0 ..0

..0 ..0 ..2.9 ..0 ..3.7 ..2.9

..9.0 ..0 ..0 ..1.1 ..4.5 ..0

..1.1 ..0 ..2.9 ..3.7 ..0 ..1.1

.

.

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

.

nnz

.

nrows+1

......

Parallel Processing for Energy Efficiency (PP4EE) 9

Page 15: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

SpMxV (sparse-matrix vector multiplication)

.

. y = A · x, A is sparse

Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work

....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44

.

.

.

..x1

..x2

..x3

..x4

.

.

.

..y1 =∑

a1i ·xi..y2 =

∑a2i ·xi

..y3 =∑

a3i ·xi..y4 =

∑a4i ·xi

.

.

. ·. =

..

0

.

0

.

a21 ·x1 + a24 ·x4

Parallel Processing for Energy Efficiency (PP4EE) 10

Page 16: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

SpMxV (sparse-matrix vector multiplication)

.

. y = A · x, A is sparse

Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work

....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44

.

.

.

..x1

..x2

..x3

..x4

.

.

.

..y1 =∑

a1i ·xi..y2 =

∑a2i ·xi

..y3 =∑

a3i ·xi..y4 =

∑a4i ·xi

.

.

. ·. =.

.

0

.

0

.

a21 ·x1 + a24 ·x4

Parallel Processing for Energy Efficiency (PP4EE) 10

Page 17: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

SpMxV (sparse-matrix vector multiplication)

.

. y = A · x, A is sparse

Important computational kernelSolving PDEs (GMRES, CG) for CFD, economic modelingGraphs (PageRank)abundant amount of research work

....a11 ..a12 ..a13 ..a14..a21 ..a22 ..a23 ..a24..a31 ..a32 ..a33 ..a34..a41 ..a42 ..a43 ..a44

.

.

.

..x1

..x2

..x3

..x4

.

.

.

..y1 =∑

a1i ·xi..y2 =

∑a2i ·xi

..y3 =∑

a3i ·xi..y4 =

∑a4i ·xi

.

.

. ·. =..

0

.

0

.

a21 ·x1 + a24 ·x4

Parallel Processing for Energy Efficiency (PP4EE) 10

Page 18: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSR SpMxV

......for (i=0; i < N; i++)

for (j=row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += values[j] * x[col_ind[j]];

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5

.

row limits

.

i=3

.

(indirect access)

.

(∗)

.

(∑

)

Parallel Processing for Energy Efficiency (PP4EE) 11

Page 19: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSR SpMxV

..........for (i=0; i < N; i++)

for (j=row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += values[j] * x[col_ind[j]];

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5

.

row limits

.

i=3

.

(indirect access)

.

(∗)

.

(∑

)

Parallel Processing for Energy Efficiency (PP4EE) 11

Page 20: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSR SpMxV

..........for (i=0; i < N; i++)

for (j=row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += values[j] * x[col_ind[j]];

.

..row_ptr: . . . ..0 ..2 ..5 ..6 ..9 ..12 ..16 .

..col_ind: ..0 ..1 ..1 ..3 ..5 ..2 ..2 ..4 ..5 ..0 ..3 ..4 ..0 ..2 ..3 ..5

..x: . . . . ..x0 ..x1 ..x2 ..x3 ..x4 ..x5

..values: ..5.4 ..1.1 ..6.3 ..7.7 ..8.8 ..1.1 ..2.9 ..3.7 ..2.9 ..9.0 ..1.1 ..4.5 ..1.1 ..2.9 ..3.7 ..1.1

..y: . . . . ..y0 ..y1 ..y2 ..y3 ..y4 ..y5

.

row limits

.

i=3

.

(indirect access)

.

(∗)

.

(∑

)

Parallel Processing for Energy Efficiency (PP4EE) 11

Page 21: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Traditional SpMxV optimization methods

traditional goal: optimizing computation

specialized sparse storage formats(exploitation of “regularities”)

examples (regularity↔ format):2D blocks of constant size↔ BCSR [Im and Yelick ’01]1D blocks of variable size↔ [Pinar and Heath ’99]Diagonals↔ DIAG

Parallel Processing for Energy Efficiency (PP4EE) 12

Page 22: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Traditional SpMxV optimization: BCSR[Im and Yelick ’01]

CSR extension: r×c blocks instead of elements⇒ per-block index informationoptimize computation (register blocking)⇒ specialized SpMxV versions for r×c

padding may be required

..

A =

4.6 9.3 0 0 0 0 2.4 5.68.6 8.2 0 0 0 0 5.3 1.6

0 0 0 0 1.9 7.9 0 00 0 0 0 7.1 0 0 0

0 0 8.6 1.7 2.4 7.6 0 00 0 3.9 2.2 3.0 3.3 0 0

0 0 0 0 1.8 0 7.9 1.20 0 0 0 0 7.8 1.0 5.3

brow ptr : 0 2 3 5 7

bcol ind : (0 6 4 2 4 4 6 )

blocks :4.6 9.3

8.6 8.2

2.4 5.6

5.3 1.6

1.9 7.9

7.1 0

8.6 1.7

3.9 2.2

2.4 7.6

3.0 3.3

1.8 0

0 7.8

7.9 1.2

1.0 5.3

bval : ( 4.6 9.3 8.6 8.2 2.4 5.6 5.3 1.6 1.9 7.9 7.1 0.0 . . . )

....

...

...

Parallel Processing for Energy Efficiency (PP4EE) 13

Page 23: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Traditional SpMxV optimization: BCSR[Im and Yelick ’01]

CSR extension: r×c blocks instead of elements⇒ per-block index informationoptimize computation (register blocking)⇒ specialized SpMxV versions for r×cpadding may be required

..

A =

4.6 9.3 0 0 0 0 2.4 5.68.6 8.2 0 0 0 0 5.3 1.6

0 0 0 0 1.9 7.9 0 00 0 0 0 7.1 0 0 0

0 0 8.6 1.7 2.4 7.6 0 00 0 3.9 2.2 3.0 3.3 0 0

0 0 0 0 1.8 0 7.9 1.20 0 0 0 0 7.8 1.0 5.3

brow ptr : 0 2 3 5 7

bcol ind : (0 6 4 2 4 4 6 )

blocks :4.6 9.3

8.6 8.2

2.4 5.6

5.3 1.6

1.9 7.9

7.1 0

8.6 1.7

3.9 2.2

2.4 7.6

3.0 3.3

1.8 0

0 7.8

7.9 1.2

1.0 5.3

bval : ( 4.6 9.3 8.6 8.2 2.4 5.6 5.3 1.6 1.9 7.9 7.1 0.0 . . . )

.......

Parallel Processing for Energy Efficiency (PP4EE) 13

Page 24: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

SpMxV performance(CSR)

related work→ several performance issues

performance evaluation in 100matrices [Goumas et. al. ’09]

memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)

1 2 4 80.8

1.2

1.6

2.0

2.4

2.8

3.2 allaverage

compression for improving SpMxV performance(reduce working set)

for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14

Page 25: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

SpMxV performance(CSR)

related work→ several performance issues

performance evaluation in 100matrices [Goumas et. al. ’09]

memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)

1 2 4 80.8

1.2

1.6

2.0

2.4

2.8

3.2 allaverage

compression for improving SpMxV performance(reduce working set)

for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14

Page 26: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

SpMxV performance(CSR)

related work→ several performance issues

performance evaluation in 100matrices [Goumas et. al. ’09]

memory bandwidth is the bottleneck(optimization attempts to improve computations are misguided)

1 2 4 80.8

1.2

1.6

2.0

2.4

2.8

3.2 allaverage

compression for improving SpMxV performance(reduce working set)

for matrices larger than cacheParallel Processing for Energy Efficiency (PP4EE) 14

Page 27: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSR SpMxV working set(nnz ≫ N)

sidx : column index sizesval : value sizennz : non-zero valuesws : working set size

ws =index data︷ ︸︸ ︷nnz · sidx +

value data︷ ︸︸ ︷nnz · sval

....sidx = 32 bit

..sval = 64 bit

.

. =⇒. index data. value data

Parallel Processing for Energy Efficiency (PP4EE) 15

Page 28: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matricesIndex compression to optimize SpMxVCSX: The Compressed Sparse eXtended storage formatCSX substructuresCSX implementationExperimental Evaluation

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 16

Page 29: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Index compression

Initial remarks:SpMxV is a memory-bound kernel

data compression can be a viable approach

index data seem a good target for compression (include a lot ofredundancy)

Specialized storage formatsindirectly may lead to index compression

typically exploit regularities: e.g., 2D blocks, diagonals, etc.

e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding

Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)

Parallel Processing for Energy Efficiency (PP4EE) 17

Page 30: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Index compression

Initial remarks:SpMxV is a memory-bound kernel

data compression can be a viable approach

index data seem a good target for compression (include a lot ofredundancy)

Specialized storage formatsindirectly may lead to index compression

typically exploit regularities: e.g., 2D blocks, diagonals, etc.

e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding

Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)

Parallel Processing for Energy Efficiency (PP4EE) 17

Page 31: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Index compression

Initial remarks:SpMxV is a memory-bound kernel

data compression can be a viable approach

index data seem a good target for compression (include a lot ofredundancy)

Specialized storage formatsindirectly may lead to index compression

typically exploit regularities: e.g., 2D blocks, diagonals, etc.

e.g., BSCR one column index per blockoriginal goal: register blockingbutmay lead to data increase due to padding

Storage formats that explicitly target index compressiondelta encoding (DCSR [Willcock and Lumsdaine ’06], CSR-DU)CSX (generalization of CSR-DU)

Parallel Processing for Energy Efficiency (PP4EE) 17

Page 32: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

First step towards index compression: Delta Encodingapplied in each matrix row

index data→ column indices

Delta encoding for column indices ([Willcock and Lumsdaine ’06])

store delta distance from previous index, not absolute value

instead of cii, store:δi = cii − cii−1 ⇒ δi ≤ cii ⇒ (potentially) less space per index

.......

Parallel Processing for Energy Efficiency (PP4EE) 18

Page 33: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

First step towards index compression: Delta Encodingapplied in each matrix row

index data→ column indices

Delta encoding for column indices ([Willcock and Lumsdaine ’06])

store delta distance from previous index, not absolute value

instead of cii, store:δi = cii − cii−1 ⇒ δi ≤ cii ⇒ (potentially) less space per index

.......

Parallel Processing for Energy Efficiency (PP4EE) 18

Page 34: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX motivationcurrent approaches are not aggressive enough – there are much more regularities to exploit

regularities and sparse storage formatsBCSR, VBL [Pinar and Heath ’99], DIAG

.

. . . .

. . . . .

. . .

. . .

. . ..

.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10

..1 .. .. .. .. .. .. .. .. .. ..

..2 .. .. .. .. .. .. .. .. .. ..

..3 .. .. .. .. .. .. .. .. .. ..

..4 .. .. .. .. .. .. .. .. .. ..

..5 .. .. .. .. .. .. .. .. .. ..

..6 .. .. .. .. .. .. .. .. .. ..

..7 .. .. .. .. .. .. .. .. .. ..

..8 .. .. .. .. .. .. .. .. .. ..

..9 .. .. .. .. .. .. .. .. .. ..

..10 .. .. .. .. .. .. .. .. .. ..

.

. ..

.

.. .

. . .. .. . .

.

. .

.

. ..

.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10

..1 .. .. .. .. .. .. .. .. .. ..

..2 .. .. .. .. .. .. .. .. .. ..

..3 .. .. .. .. .. .. .. .. .. ..

..4 .. .. .. .. .. .. .. .. .. ..

..5 .. .. .. .. .. .. .. .. .. ..

..6 .. .. .. .. .. .. .. .. .. ..

..7 .. .. .. .. .. .. .. .. .. ..

..8 .. .. .. .. .. .. .. .. .. ..

..9 .. .. .. .. .. .. .. .. .. ..

..10 .. .. .. .. .. .. .. .. .. ..

multiple regularities↔ composite formats [Agarwal et. al ’92]multiple sub-matrices — each in different formatA·x = (A0 + A1)·x = A0 ·x+ A1 ·x

Parallel Processing for Energy Efficiency (PP4EE) 19

Page 35: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX: Compressed Sparse eXtendedIn a nutshell

Objectivestarget memory-bandwidth limitations of SpMxV (implied largematrices)adapt to matrix structure and architecturegenerate efficient code

Approachapply aggressive index compressionexploit a wide set of matrix “regularities”employ code generation to produce efficient code tailored per matrix3 preprocessing phases:

detection of regularitiesmatrix encodingcode generation

drastically reduce preprocessing times

Parallel Processing for Energy Efficiency (PP4EE) 20

Page 36: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX substructures(regularities supported by CSX)

Horizontal

(delta run-length-encoding—drle)

..sequential elements .

(y, x+ i) → (y, x) (y, x+ 1) (y, x+ 2) . . .

.

..x ..x ..x ..x ..x

.

(e.g: col. indices: 1,2,3,4,5)

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)

Parallel Processing for Energy Efficiency (PP4EE) 21

Page 37: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX substructures(regularities supported by CSX)

Horizontal (delta run-length-encoding—drle)

..sequential elements with a constant difference δ.

(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..

..x ..x ..x ..x ..x

.

(e.g: col. indices: 2,4,6,8,10)

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)

Parallel Processing for Energy Efficiency (PP4EE) 21

Page 38: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX substructures(regularities supported by CSX)

Horizontal (delta run-length-encoding—drle)

..sequential elements with a constant difference δ.

(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..

..x ..x ..x ..x ..x

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)

Parallel Processing for Energy Efficiency (PP4EE) 21

Page 39: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX substructures(regularities supported by CSX)

Horizontal (delta run-length-encoding—drle)

..sequential elements with a constant difference δ.

(y, x+ i · δ) → (y, x) (y, x+ δ) (y, x+ 2 · δ) . . ..

..x ..x ..x ..x ..x

Other 1D directions (Vertical, Diagonal, Anti-Diagonal)

..

..x

..x

..x

..x

..x

...x. ..x. . ..x. . . ..x. . . . ..x

.. . . . ..x. . . ..x. . ..x. ..x..x

.

(y+ i·δ, x)

.

(y+ i·δ, x+ i·δ)

.

(y− i·δ, x+ i·δ)

2D blocks.. ..x ..x

..x ..x.

(x+ i)×(y+ j) (double nested loop)Parallel Processing for Energy Efficiency (PP4EE) 21

Page 40: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX substructures on matrices

xeno

n2A

SIC

_680

kto

rso3

Che

bysh

ev4

Ham

rle3

pre2

cage

13at

mos

mod

joh

ne2

kkt_

pow

erT

SO

PF

_RS

_b23

83G

a41A

s41H

72F

rees

cale

1ra

jat3

1F

1pa

rabo

lic_f

emof

fsho

reco

nsph

bmw

7st_

1G

3_ci

rcui

tth

erm

al2

m_t

1bm

wcr

a_1

hood

cran

kseg

_2nd

12k

af_5

_k10

1in

line_

1ld

oor

bone

S10

0

20

40

60

80

100

Non

-zer

o el

emen

ts c

over

age

(%)

d8d16d32h(1)h(2)v(1)v(2)d(1)d(3)d(11)d(857)d(1714)rd(1)br(2,2)br(2,3)br(2,4)br(2,6)

br(2,9)br(2,12)br(2,18)br(3,3)br(3,6)br(3,9)br(3,12)br(3,15)br(3,18)br(4,4)br(5,5)br(5,15)br(7,7)br(7,14)br(7,21)bc(3,2)

Parallel Processing for Energy Efficiency (PP4EE) 22

Page 41: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX Encoding

..1 .1 . 6.

8.variable

int

.variable

int

.xed

{8,16,32}.…

.xed

{8,16,32}.CTL

.nr .rjmp .id .size .ujmp .ucol .deltas

. .

.Head .Body

.

.

.

. .

.

.d(2)

.h(1) .ad(1)

.v(1)

.bc(4,2)

.bc(4,2).bc(3,2).

.. ..1 ..2 ..3 ..4 ..5 ..6 ..7 ..8 ..9 ..10

..1 .. .. .. .. .. .. .. .. .. ..

..2 .. .. .. .. .. .. .. .. .. ..

..3 .. .. .. .. .. .. .. .. .. ..

..4 .. .. .. .. .. .. .. .. .. ..

..5 .. .. .. .. .. .. .. .. .. ..

..6 .. .. .. .. .. .. .. .. .. ..

..7 .. .. .. .. .. .. .. .. .. ..

..8 .. .. .. .. .. .. .. .. .. ..

..9 .. .. .. .. .. .. .. .. .. ..

..10 .. .. .. .. .. .. .. .. .. ..

1 0 0 4 0h(1)

0 0 1 4 5ad(1)

1 1 2 8 1 0bc(4,2)

0 0 3 4 9v(1)

1 0 4 4 3d(2)

1 1 2 8 2 2bc(4,2)

1 0 5 6 2bc(3,2)

nr rjmp id size [ujmp]:ucol

Parallel Processing for Energy Efficiency (PP4EE) 23

Page 42: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX Code generation

..Encodedmatrix

.C code

generator

.Clang

front-end

. LLVMmodule

.Native SpMV code

.Functionpointer. . .

.. SpMVsource templates

.SpMV

.source.Emit.LLVM

.Execution.Engine

.call

top-level SpMxV templatebig case statement based on substructure

code for each substructure in the matrix

Parallel Processing for Energy Efficiency (PP4EE) 24

Page 43: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX preprocessing cost

Ü what about preprocessing (compression) cost?

depends on the application

frequently, the matrix is used across numerous SpMxV runsû sufficient repetitions→ overhead will be amortizedmethods to reduce preprocessing cost:

reduce the number of substructures scannedsample the matrix for substructuresparallelize preprocessing

Parallel Processing for Energy Efficiency (PP4EE) 25

Page 44: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Experimental preliminaries

matrix suite:30 matricesUniversity of Florida sparse matrix collection [Davis and Hu, 2011]real world applications, large variety of applicationsincluding problems without an underlying 2D/3D geometrydo not t into aggregate cache

compare against:CSRBCSR (we always select the best performing block)VBL (1D variable length blocks)

double (64-bit) oating point values

Parallel Processing for Energy Efficiency (PP4EE) 26

Page 45: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX Compression ratio

xeno

n2A

SIC

_680

kto

rso3

Che

bysh

ev4

Ham

rle3

pre2

cage

13at

mos

mod

joh

ne2

kkt_

pow

erT

SO

PF

_RS

_b23

83G

a41A

s41H

72F

rees

cale

1ra

jat3

1F

1pa

rabo

lic_f

emof

fsho

reco

nsph

bmw

7st_

1G

3_ci

rcui

tth

erm

al2

m_t

1bm

wcr

a_1

hood

cran

kseg

_2nd

12k

af_5

_k10

1in

line_

1ld

oor

bone

S10

-80

-60

-40

-20

0

20

40

Com

pres

sion

rat

io (

%)

BCSRVBLCSXMaximum

maximum: only consider valuesCSX always the best optionCSX never has negative compression

Parallel Processing for Energy Efficiency (PP4EE) 27

Page 46: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX: Evaluation on SMP

Harpertown

C

L1

C

L1

C

L1

C

L1

L2 L2

CPU CPU

2× 4 = 8 cores

L2: 6MiB, per 2 cores

Dunnington

C

L1

C

L1

C

L1

C

L1

C

L1

C

L1

L2 L2 L2

CPU

L3

CPU

L3

CPU

L3

CPU

L3

4× 6 = 24 cores

L2: 3MiB, per 2 cores

L3: 16MiB

Parallel Processing for Energy Efficiency (PP4EE) 28

Page 47: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX: SMP: Average speedup over serial CSR(share-all core lling policy)

Harpertown

1 2 4 8

Threads

1

2

3

Spe

edup

ove

r se

rial C

SR

CSRBCSRVBLCSX

improvement over MT CSRfor 8 threads:

CSX: 26.4%VBL: 18.5%BCSR: 4.1%

Dunnington

1 2 6 12 24

Threads

1

2

4

6

810

Spe

edup

ove

r se

rial C

SR

CSRBCSRVBLCSX

improvement over MT CSRfor 24 threads:

CSX: 61%VBL: 28.8%BCSR: 6.3%

Parallel Processing for Energy Efficiency (PP4EE) 29

Page 48: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

CSX: Preprocessing cost

Dunnington

CSX-delta

CSX-horiz

CSX-samplingCSX-full

8 32 128 512

Serial CSR SpMV operations

1

1.2

1.4

1.6

Per

f. im

pr. o

ver

M/T

CS

R

Gainestown (NUMA)

CSX-delta

CSX-horiz

CSX-sampling CSX-full

16 64 256 1024

Serial CSR SpMV operations

1

1.1

1.2

Per

f. im

pr. o

ver

M/T

CS

R

Parallel Processing for Energy Efficiency (PP4EE) 30

Page 49: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Energy and power measurements

Total energy (idle cores included)matrix af_5_k101

0

200

400

600

800

1000

1200

1400

csr csx csr csx csr csx csr csx csr csx csr csx csr csx 1

2

3

4

5

6

7

8

9

10

Ener

gy (J

)

Spee

dup(

w.r.t

. CSR

1 th

read

)

#threads

Total energy breakdown

uncore

core

dram

csr speedup

csx speedup

3224168421

Powerdensematrix

1 2 4 8 16 24 32#Threads

5

10

15

20

25

30

35

40

45

Aver

age

Pow

er(W

)

coreuncoredram

Parallel Processing for Energy Efficiency (PP4EE) 31

Page 50: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

More info on CSX (papers)

CSX details:K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. “CSX: an extended compression format for spmv on sharedmemory systems,”16th ACM symposium on Principles and practice of parallel programming (PPoPP ’11). ACM, New York, NY, USA, 247-256.

V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, N. Koziris, “An Extended Compression Format for the Optimization of SparseMatrix-Vector Multiplication,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 10, pp. 1930-1940, Oct., 2013.

CSX for symmetric matrices:T. Gkountouvas, V. Karakasis, K. Kourtis, G. Goumas, and N. Koziris. “Improving the performance of the symmetric sparsematrix-vectormultiplication inmulticore”. In 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS’13), Boston, MA,USA, 2013.

CSX integrated with Elmer multiphysics simulation softwareV. Karakasis, G. Goumas, K. Nikas, N. Koziris, J. Ruokolainen, and P. Råback. “Using State-of-the-Art Sparse Matrix Optimizations forAccelerating the Performance of Multiphysics Simulations”. In PARA 2012: Workshop on State-of-the-Art in Scienti c and ParallelComputing, Helsinki, Finland, 2012. Springer.

Value compression for SpMxVK. Kourtis, G. Goumas and N. Koziris, “Exploiting Compression Opportunities to Improve SpMxV Performance on SharedMemorySystems,” ACM Transactions on Architecture and Code Optimization (TACO), Vol 7, No 3, December 2011.

The energy pro le of CSX and CSRJ. C. Meyer, V. Karakasis, J. Cebrián, L. Natvig, D. Siakavaras, and K. Nikas. “Energy-efficient sparse matrix autotuning with CSX – Atrade-off study”. In NinthWorkshop on High-Performance, Power-Aware Computing (HPPAC’13), IPDPS’13, Boston, MA, USA, 2013.

Parallel Processing for Energy Efficiency (PP4EE) 32

Page 51: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

More info on CSX

download code:http://www.cslab.ece.ntua.gr/csx/

https://github.com/cslab-ntua/csx

current status:working on the release of an API and librarysupport tools (disk representation, le format converters)

Parallel Processing for Energy Efficiency (PP4EE) 33

Page 52: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Outline

...1 Compression as an approach to scale up memory-bound applications

...2 Sparse Matrices and SpMxV

...3 CSX: A new storage format for sparse matrices

...4 Conclusions – Areas of future research – Discussion

Parallel Processing for Energy Efficiency (PP4EE) 34

Page 53: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Conclusions

Compression can improve SpMxV performanceCSX applies aggressive index data compression to optimize SpMxV

supports arbitrary regularitiestunable preprocessing cost

yet, preprocessing can be a concern

outperforms baseline and state-of-the-art alternatives

Parallel Processing for Energy Efficiency (PP4EE) 35

Page 54: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Areas of current and future researchand (hopefully) opportunities for collaboration

relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)

generalize to other applications

investigate opportunities for hardware support (scalability, space,energy)

contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)

performance prediction

energy-aware computingpower and energy-aware algorithms and techniques

predict execution behavior based on power consumption snapshots

Parallel Processing for Energy Efficiency (PP4EE) 36

Page 55: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Areas of current and future researchand (hopefully) opportunities for collaboration

relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)

generalize to other applications

investigate opportunities for hardware support (scalability, space,energy)

contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)

performance prediction

energy-aware computingpower and energy-aware algorithms and techniques

predict execution behavior based on power consumption snapshots

Parallel Processing for Energy Efficiency (PP4EE) 36

Page 56: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

Areas of current and future researchand (hopefully) opportunities for collaboration

relevant to CSX, SpMxV and compressionapply compression to the oating-point values of the matrix (recall:these consume 2/3 of the data!)

generalize to other applications

investigate opportunities for hardware support (scalability, space,energy)

contention-aware schedulingtime and space scheduling of resource-hungry applications (forhomogeneous and heterogeneous CMPs)

performance prediction

energy-aware computingpower and energy-aware algorithms and techniques

predict execution behavior based on power consumption snapshots

Parallel Processing for Energy Efficiency (PP4EE) 36

Page 57: Alleviating memory-bandwidth limitations for scalability and energy … · 2013. 10. 11. · Alleviatingmemory-bandwidthlimitationsfor scalabilityandenergyefficiency LessonslearnedfromtheoptimizationofSpMxV

EOF

Thank you!Questions?

Parallel Processing for Energy Efficiency (PP4EE) 37