45
Spring 2010 Spring 2010 Memory Optimization Memory Optimization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Spring 2010Spring 2010 - MIT OpenCourseWare

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spring 2010Spring 2010 - MIT OpenCourseWare

Spring 2010Spring 2010

Memory OptimizationMemory Optimization

Saman AmarasingheComputer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

Page 2: Spring 2010Spring 2010 - MIT OpenCourseWare

Outline

• Issues with the Memory SystemIssues with the Memory System

• Loop Transformations

• Data Transformations

P f t hi• Prefetching

• Alias Analysisas a ys s

Page 3: Spring 2010Spring 2010 - MIT OpenCourseWare

Memory Hierarchyy y

1 - 2 ns 32 – 512 BRegisters

3 - 10 ns 16 – 128 KBL1 Private Cache

8 - 30 ns 1 – 16 MBL2/L3

Shared Cache

60 - 250 ns 1 GB – 128 GB

Shared Cache

Main Memory 60 250 ns

5 - 20 ms

1 GB 128 GB

250 MB – 4 TB

(DRAM)

S5 - 20 ms 250 MB – 4 TB

Saman Amarasinghe 3 6.035 ©MIT Fall 2006

Permanent Storage(Hard Disk)

Page 4: Spring 2010Spring 2010 - MIT OpenCourseWare

Processor-Memory Gapy p

µProcµProc60%/year(2/1.5yr)

1000

nce

10

100

form

an

DRAM9%/year(2 /10 )1

10

Perf

Y

(2/10 yrs)1

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

1982

Saman Amarasinghe 4 6.035 ©MIT Fall 2006

Year

Page 5: Spring 2010Spring 2010 - MIT OpenCourseWare

Cache Architecture

Pentium D Core Duo Core 2 Duo Athelon 64

L1 code(per core)

size 12 K uops 32 KB 32 KB 64 KBassociativity 8 way 8 way 8 way 2 way(per core)

Line size 64 bytes 64 bytes 64 bytes 64 bytes

L1 datasize 16 KB 32 KB 32 KB 64 KB

associativity 8 way 8 way 8 way 8 way(per core)associativity 8 way 8 way 8 way 8 way

Line size 64 bytes 64 bytes 64 bytes 64 bytes

L1 to L2 Latency 4 cycles 3 cycles 3 cycles 3 cyclesi 4 MB 4 MB 4 MB 1 MB

L2 shared

size 4 MB 4 MB 4 MB 1 MBassociativity 8 way 8 way 16 way 16 way

Line size 64 bytes 64 bytes 64 bytes 64 bytes

Saman Amarasinghe 5 6.035 ©MIT Fall 2006

L2 to L3(off) Latency 31 cycles 14 cycles 14 cycles 20 cycles

Page 6: Spring 2010Spring 2010 - MIT OpenCourseWare

Cache Misses• Cold misses

– First time a data is accessed

• Capacity misses– Data got evicted between accesses because a lot of other data

(more than the cache size) was accessed(more than the cache size) was accessed

• Conflict misses– Data got evicted because a subsequent access fell on the same

h li (d i i i )cache line (due to associativity)

• True sharing misses (multicores)– Another processor accessed the data between the accessesAnother processor accessed the data between the accesses

• False sharing misses (multicores)– Another processor accessed different data in the same cache line

between the accessesbetween the accesses

Saman Amarasinghe 6 6.035 ©MIT Fall 2006

Page 7: Spring 2010Spring 2010 - MIT OpenCourseWare

Data Reuse

• Temporal Reuse for i = 0 to N– A given reference accesses the

same location in multiple iterations

for j = 0 to NA[j] =

iterations

• Spatial Reuse for i = 0 to N

– Accesses to different locations within the same cache line

for j = 0 to NB[i, j] =

• Group Reuse– Multiple references access the for i = 0 to NMultiple references access the

same location

Saman Amarasinghe 7 6.035 ©MIT Fall 2006

A[i] = A[i-1] + 1

Page 8: Spring 2010Spring 2010 - MIT OpenCourseWare

Outline

• Issues with the Memory SystemIssues with the Memory System

• Loop Transformations

• Data Transformations

P f t hi• Prefetching

• Alias Analysisas a ys s

Page 9: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Transformationsp

• Transform the iteration space to reduce the number of misses

• Reuse distance – For a given access, number of other data items accessed before that data is accessed againR di t h i• Reuse distance > cache size– Data is spilled between accesses

Saman Amarasinghe 9 6.035 ©MIT Fall 2006

Page 10: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Transformationsp

for i = 0 to Nf j 0 t Nfor j = 0 to N

for k = 0 to NA[k,j]

Reuse distance = N2

If Cache size < 16 doubles?A lot of capacity misses

Saman Amarasinghe 10 6.035 ©MIT Fall 2006

Page 11: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Transformationsp

for i = 0 to Nf j 0 t Nfor j = 0 to N

for k = 0 to NA[k,j]

Loop Interchange

for j = 0 to Nfor i = 0 to N

f k 0 t Nfor k = 0 to NA[k,j]

Saman Amarasinghe 11 6.035 ©MIT Fall 2006

Page 12: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Transformationsp

for j = 0 to Nf i 0 t Nfor i = 0 to N

for k = 0 to NA[k,j]

Cache line size > data sizeCache line size = LCache line size LReuse distance = LN

If h i < 8 d bl ?If cache size < 8 doubles?Again a lot of capacity misses

Saman Amarasinghe 12 6.035 ©MIT Fall 2006

Page 13: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Transformationsp

for j = 0 to Nf i 0 t Nfor i = 0 to N

for k = 0 to NA[k,j]

Loop Interchange

for k = 0 to Nfor i = 0 to N

f j 0 t Nfor j = 0 to NA[k,j]

Saman Amarasinghe 13 6.035 ©MIT Fall 2006

Page 14: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Transformationsp

for i = 0 to Nf j 0 t Nfor j = 0 to N

for k = 0 to NA[i,j]= A[i,j]+ B[i,k]+ C[k,j]

• No matter what loop transformation you do one array access has to traverse the full array multiple timesy p

Saman Amarasinghe 14 6.035 ©MIT Fall 2006

Page 15: Spring 2010Spring 2010 - MIT OpenCourseWare

Example: Matrix Multiplyp p y1024

1

1

4

1

1

x =1024

1024 1024 1024 Data Accessed

x =

1024

1

1024

1024

1024

11,050,624

Saman Amarasinghe 15 6.035 ©MIT Fall 2006

Page 16: Spring 2010Spring 2010 - MIT OpenCourseWare

Loop Tilingp g

for i = 0 to Nfor j = 0 to N

for ii = 0 to ceil(N/b)for jj = 0 to ceil(N/b)jj ( / )

for i = b*ii to min(b*ii+b-1, N)for j = b*jj to min(b*jj+b-1, N)

Saman Amarasinghe 16 6.035 ©MIT Fall 2006

Page 17: Spring 2010Spring 2010 - MIT OpenCourseWare

Example: Matrix Multiplyp p y1024

1

1

4

1

1

x =1024

1024 1024 1024 Data Accessed

x =

1024

1

1024

1024

1024

11,050,624

1024 32 32

x =

32

1024

32

66,560

Saman Amarasinghe 17 6.035 ©MIT Fall 2006

Page 18: Spring 2010Spring 2010 - MIT OpenCourseWare

Outline

• Issues with the Memory SystemIssues with the Memory System

• Loop Transformations

• Data Transformations

P f t hi• Prefetching

• Alias Analysisas a ys s

Page 19: Spring 2010Spring 2010 - MIT OpenCourseWare

False Sharing Missesg

for J =forall I =

X(I, J) = …

Cache Lines

Saman Amarasinghe 19 6.035 ©MIT Fall 2006

Array X

Page 20: Spring 2010Spring 2010 - MIT OpenCourseWare

Conflict Misses

for J =forall I =

X(I, J) = …

Saman Amarasinghe 20 6.035 ©MIT Fall 2006

Array X Cache Memory

Page 21: Spring 2010Spring 2010 - MIT OpenCourseWare

Eliminating False Sharing and Conflict MissesConflict Misses

for J =forall I =

X(I, J) = …

Saman Amarasinghe 21 6.035 ©MIT Fall 2006

Page 22: Spring 2010Spring 2010 - MIT OpenCourseWare

Data Transformations

• Similar to loop transformationsp

• All the accesses have to be updated• All the accesses have to be updated– Whole program analysis is required

Saman Amarasinghe 22 6.035 ©MIT Fall 2006

Page 23: Spring 2010Spring 2010 - MIT OpenCourseWare

Strip-MindingCreate two dims from one

on

With blocksize=4

Stor

age

ecla

rati

o

N4

SD

era

yes

s

i

Arr

Acc i

Mem

ory

Layo

ut

Saman Amarasinghe 23 6.035 ©MIT Fall 2006

M L

Page 24: Spring 2010Spring 2010 - MIT OpenCourseWare

Strip-MindingCreate two dims from one

PermutationChange memory layout

on

With blocksize=4 With permutation matrix 0 11 0

Stor

age

ecla

rati

o

N4 N1

N2

N2

N1SD

era

yes

s

ii1 i2

Arr

Acc i

i2 i1

Mem

ory

Layo

ut

Saman Amarasinghe 24 6.035 ©MIT Fall 2006

M L

Page 25: Spring 2010Spring 2010 - MIT OpenCourseWare

Data Transformation Algorithmg

• Rearrange data: Each processor’s data is contiguous• Use data decomposition

– *, block, cyclic, block-cyclic

• Transform each dimension according to the decomposition• Transform each dimension according to the decomposition• Use a combination of strip-mining and permutation primitives

Saman Amarasinghe 25 6.035 ©MIT Fall 2006

Page 26: Spring 2010Spring 2010 - MIT OpenCourseWare

Example I: (Block, Block)i

p ( , )i1

ii2

i1

i

Saman Amarasinghe 26 6.035 ©MIT Fall 2006

i2

Page 27: Spring 2010Spring 2010 - MIT OpenCourseWare

Example I: (Block, Block)i

p ( , )i1

1

i

i1 mod

i2

Strip-Mine i1 mod

i1 /i1

i

Saman Amarasinghe 27 6.035 ©MIT Fall 2006

i2

i2

Page 28: Spring 2010Spring 2010 - MIT OpenCourseWare

Example I: (Block, Block)i

p ( , )i1

1 1

i

i1 mod i1 mod

i2

Strip-Mine Permutei1 mod

i1 /i1

i

i1 mod

i2

Saman Amarasinghe 28 6.035 ©MIT Fall 2006

i2

i2

i1 /

Page 29: Spring 2010Spring 2010 - MIT OpenCourseWare

Example I: (Cyclic, *)i

p ( y , )i1

ii2

i1

i

Saman Amarasinghe 29 6.035 ©MIT Fall 2006

i2

Page 30: Spring 2010Spring 2010 - MIT OpenCourseWare

Example I: (Cyclic, *)i

p ( y , )i1

1

3

i

i1 mod P

i2

Strip-Mine i1 mod P

i1 / Pi1

i

Saman Amarasinghe 30 6.035 ©MIT Fall 2006

i2

i2

Page 31: Spring 2010Spring 2010 - MIT OpenCourseWare

Example I: (Cyclic, *)i

p ( y , )i1

1 1

3 3

i

i1 mod P i1 mod P

i2

Strip-Mine Permutei1 mod P

i1 / Pi1

i

i1 mod P

i2

Saman Amarasinghe 31 6.035 ©MIT Fall 2006

i2

i2

i1 / P

Page 32: Spring 2010Spring 2010 - MIT OpenCourseWare

PerformanceLU Decomposition

(256x256)5 point stencil

(512x512)

1 2 4 4 6 8 10 12 16 18 20 22 24 26 28 30 32 1 2 4 4 6 8 10 12 16 18 20 22 24 26 28 30 32

LU Decomposition(1Kx1K)

Pa alleli ing o te loopParallelizing outer loop

Best computation placement

Saman Amarasinghe 32 6.035 ©MIT Fall 2006

+ data transformations

1 2 4 4 6 8 10 12 16 18 20 22 24 26 28 30 32

Page 33: Spring 2010Spring 2010 - MIT OpenCourseWare

Optimizationsp

• Modulo and division operations in the index calculation– Very high overhead

• Use standard techniquesUse standard techniques– Loop invariant removal, CSE– Strength reduction exploiting properties of modulo and division

U k l d b t th– Use knowledge about the program

Saman Amarasinghe 33 6.035 ©MIT Fall 2006

Page 34: Spring 2010Spring 2010 - MIT OpenCourseWare

Outline

• Issues with the Memory SystemIssues with the Memory System

• Loop Transformations

• Data Transformations

P f t hi• Prefetching

• Alias Analysisas a ys s

Page 35: Spring 2010Spring 2010 - MIT OpenCourseWare

Prefetchingg• Cache miss stalls the processor for hundreds of cycles

– Start fetching the data early so it’ll be available when needed

• Pros– Reduction of cache misses increased performance

• Cons– Prefetch contents for fetch bandwidth

• Solution: Hardware only issue prefetches on unused bandwidth• Solution: Hardware only issue prefetches on unused bandwidth

– Evicts a data item that may be used• Solution: Don’t prefetch too early

Pretech is still pending when the memory is accessed– Pretech is still pending when the memory is accessed• Solution: Don’t prefetch too late

– Prefetch data is never used• Solution: Prefetch only data guaranteed to be used• Solution: Prefetch only data guaranteed to be used

– Too many prefetch instructions• Prefetch only if access is going to miss in the cache

Saman Amarasinghe 35 6.035 ©MIT Fall 2006

Page 36: Spring 2010Spring 2010 - MIT OpenCourseWare

Prefetchingg

• Compiler inserted – Use reuse analysis to identify misses– Partition the program and insert prefetches

• Run ahead thread (helper threads) C t t th d th t h d f th i– Create a separate thread that runs ahead of the main thread

– Runahead only does computation needed for control-Runahead only does computation needed for controlflow and address calculations

– Runahead performs data (pre)fetches

Saman Amarasinghe 36 6.035 ©MIT Fall 2006

Page 37: Spring 2010Spring 2010 - MIT OpenCourseWare

Outline

• Issues with the Memory SystemIssues with the Memory System

• Loop Transformations

• Data Transformations

P f t hi• Prefetching

• Alias Analysisas a ys s

Page 38: Spring 2010Spring 2010 - MIT OpenCourseWare

Alias Analysisy

• Aliases destroy local reasoning– Simple, local transformations require global reasoning in the

presence of aliases– A critical issue in pointer-heavy code– This problem is even worse for multithreaded programs

• Two solutions• Two solutions– Alias analysis

• Tools to tell us the potential aliases

Ch h i l– Change the programming language• Languages have no facilities for talking about aliases• Want to make local reasoning possible

Saman Amarasinghe 38 6.035 ©MIT Fall 2006

Courtesy of Alex Aiken. Used with permission.

Page 39: Spring 2010Spring 2010 - MIT OpenCourseWare

Aliases

• DefinitionTwo pointers that point to the same location

are aliasesare aliases

E ample• ExampleY = &ZX = Y*X = 3 /* changes the value of *Y */X 3 / changes the value of Y /

Courtesy of Alex Aiken. Used with permission.

Page 40: Spring 2010Spring 2010 - MIT OpenCourseWare

Examplep

foo(int * A, int * B, int * C, int N)f i 0 t N 1for i = 0 to N-1

A[i]= A[i]+ B[i] + C[i]

• Is this loop parallel?

• Depends

i t X[1000] i t X[1000]int X[1000];int Y[1000];int Z[1000]

int X[1000];foo(&X[1], &X[0], &X[2], 998);

Saman Amarasinghe 40 6.035 ©MIT Fall 2006

foo(X, Y, Z, 1000);

Page 41: Spring 2010Spring 2010 - MIT OpenCourseWare

Points-To Analysis y

• Consider:P = &QY &Z

X Y

Y = &ZX = Y*X = P PX P

• Informally:– P can point to Q

Y can point to Z

Z

– Y can point to Z– X can point to Z– Z can point to Q Q

Courtesy of Alex Aiken. Used with permission.

Page 42: Spring 2010Spring 2010 - MIT OpenCourseWare

Points-To Relations

• A graph– Nodes are program names– Edge (x,y) says x may point to y

• Finite set of namesI li h t h ll– Implies each name represents many heap cells

– Correctness: If *x = y in any state of any execution, then (x,y) is an edge in the points-to graphthen (x,y) is an edge in the points to graph

Courtesy of Alex Aiken. Used with permission.

Page 43: Spring 2010Spring 2010 - MIT OpenCourseWare

Sensitivityy

• Context sensitivity– Separate different uses of functions– Different is the key – if the analysis think the input is

the same reuse the old resultsthe same, reuse the old results

• Flow sensitivity• Flow sensitivity• For insensitivity makes any permutation of program

statements gives same result• Flow sensitive is similar to data-flow analysis

Page 44: Spring 2010Spring 2010 - MIT OpenCourseWare

Conclusion

• Memory systems are designed to give a huge performance boost for “normal” operations

• The performance gap between good and bad• The performance gap between good and bad memory usage is huge

• Programs analyses and transformations are needed

• Can off-load this task to the compiler

Saman Amarasinghe 44 6.035 ©MIT Fall 2006

Page 45: Spring 2010Spring 2010 - MIT OpenCourseWare

MIT OpenCourseWarehttp://ocw.mit.edu

6.035 Computer Language Engineering Spring 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.