31
Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Memory hierarchies Memory locality optimizations

Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Lecture 2 CSE 260 – Parallel Computation

(Fall 2015) Scott B. Baden

Memory hierarchies

Memory locality optimizations

Page 2: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Announcements •  Wait list •  Can you login to Bang and Moodle? •  Did you submit the XSEDE form ? •  Office hours •  Programming Lab #1

Scott B. Baden / CSE 260, UCSD / Fall '15 2

Page 3: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

•  Parallelism is inevitable owing to technological factors limiting clock speeds

•  Increased parallelism on chip, computation is becoming cheaper

•  But data motion is becoming more costly relative to computation

•  Resultant disruption has a huge impact on how we write high quality applications

Summary from last time

Scott B. Baden / CSE 260, UCSD / Fall '15 3

Page 4: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

I increased performance – so what’s the catch? •  A well behaved single processor algorithm may

behave poorly on a parallel computer, and may need to be reformulated numerically

•  There currently exists no tool that can convert a serial program into an efficient parallel program … for all applications … all of the time… on all

hardware

•  Many Performance programming issues !  The new code may look dramatically different

from the original

Scott B. Baden / CSE 260, UCSD / Fall '15 4

Page 5: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Application-specific knowledge is important •  The more we know about the application…

… specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance

•  Particularly challenging for irregular problems •  Domain-specific knowledge is important in optimizing

performance, especially locality •  Parallelism introduces many new tradeoffs

u  Redesign the software u  Rethink the problem solving technique

Scott B. Baden / CSE 260, UCSD / Fall '15 5

Page 6: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Today’s Lecture •  Memory hierarchies •  Managing memory locality •  HW #1

Scott B. Baden / CSE 260, UCSD / Fall '15 6

Page 7: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

The processor-memory gap •  The result of technological trends •  Difference in processing and memory speeds

growing exponentially over time

Scott B. Baden / CSE 260, UCSD / Fall '15 7

http://www.extremetech.com (see http://tinyurl.com/ohmbo7y)

Page 8: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

An important principle: locality •  Memory accesses generally come from nearby

addresses: u  In space (spatial locality) u  In Time (temporal locality)

•  Often involves loops •  Opportunities for reuse •  What are examples of non-localized accesses?

for t=0 to T-1 for i = 1 to N-2

u[i]= (u[i-1] + u[i+1]) / 2

Scott B. Baden / CSE 260, UCSD / Fall '15 8

Page 9: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Memory hierarchies •  Ruse data (and instructions) using a

hierarchy of smaller but faster memories •  Put things in faster memory

if we reuse them frequently

O(100) CP

O(10) CP (10 - 100 B)

Disk

DRAM

L2

CPU

L1 2-3 CP (10 to 100 B)

Many GB or TB

256KB to 4 MB

O(106) CP GB

32 to 64 KB

1CP (1 word)

Scott B. Baden / CSE 260, UCSD / Fall '15 9

Page 10: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

•  Intel “Clovertown” processor •  Intel Xeon E5355 (Introduced: 2006) •  Two “Woodcrest” dies (Core2)

on a multichip module •  Two “sockets” •  Intel 64 and IA-32 Architectures Optimization Reference Manual, Tab 2.16

32K L1

FSB

32K L1 32K L1 32K L1

4MB Shared L2

4MB Shared L2

Core2 Core2 Core2 Core2

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write) 21.3 GB/s(read)

10.66 GB/s

32K L1

FSB

32K L1 32K L1 32K L1

10.66 GB/s

4MB Shared L2

4MB Shared L2

Sam Williams et al.

Core2 Core2 Core2 Core2

Associativity

8

16

Access latency, throughput (clocks)

3, 1

14*, 2

Line Size = 64B (L1 and L2)

* Software-visible latency will vary depending on access patterns and other factors

techreport.com/articles.x/10021/2

Write update policy: Writeback

Bang’s Memory Hierarchy

Scott B. Baden / CSE 260, UCSD / Fall '15 10

Page 11: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

•  /proc/cpuinfo summarizes the processor u  vendor_id : GenuineIntel u  model name : Intel(R) Xeon(R) CPU E5345 @2.33GHz u  cache size : 4096 KB u  cpu cores : 4

•  processor : 0 through processor : 7 •  Detailed information at

/sys/devices/system/cpu/cpu*/cache/index*/*

32K L1

FSB

32K L1 32K L1 32K L1

4MB Shared L2

4MB Shared L2

Core2 Core2 Core2 Core2

10.66 GB/s

32K L1

FSB

32K L1 32K L1 32K L1

10.66 GB/s

4MB Shared L2

4MB Shared L2

Core2 Core2 Core2 Core2

Examining Bang’s Memory Hierarchy

Scott B. Baden / CSE 260, UCSD / Fall '15 11

Page 12: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

•  /proc/cpuinfo summarizes the processor u  vendor_id : GenuineIntel u  model name : Intel®Xeon®CPU E5-2680 0 @ 2.70GHz u  cache size : 20480 KB u  cpu cores : 8

•  processor : 0 through processor : 16 •  Detailed information at

/sys/devices/system/cpu/cpu*/cache/index*/*

Stampede’s Sandy Bridge Memory Hierarchy

www.cac.cornell.edu/Stampede/CodeOptimization/multicore.aspx

Scott B. Baden / CSE 260, UCSD / Fall '15 12

Page 13: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

The 3 C’s of cache misses

•  Cold Start •  Capacity •  Conflict

realworldtech.com

Scott B. Baden / CSE 260, UCSD / Fall '15 13

Page 14: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Today’s Lecture •  Memory hierarchies •  Managing memory locality •  HW #1

Scott B. Baden / CSE 260, UCSD / Fall '15 17

Page 15: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Matrix Multiplication

•  An important core operation in many numerical algorithms

•  Given two conforming matrices A and B, form the matrix product A × B

A is m × n B is n × p

•  Operation count: O(n3) multiply-adds for an n × n square matrix

•  Discussion follows from Demmel www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect02.html

Scott B. Baden / CSE 260, UCSD / Fall '15 18

Page 16: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Performance of Matrix Multiply

8.14 GFlops

R∞ = 4*2.33 = 9.32 Gflops ~87% of peak

Scott B. Baden / CSE 260, UCSD / Fall '15 19

Page 17: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Unblocked Matrix Multiplication

for i = 0 to n-1 for j = 0 to n-1

for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j]

Scott B. Baden / CSE 260, UCSD / Fall '15 20

n

=

C(i,j) = Σk A(i,k) B(k,j)

A B C

i i

Page 18: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Analysis of performance for i = 0 to n-1 // for each iteration i, load all of B into cache for j = 0 to n-1 // for each iteration (i,j), load A[i,:] into cache

// for each iteration (i,j), load and store C[i,j] for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j]

+= * C[i,j] A[i,:]

B[:,j]

Scott B. Baden / CSE 260, UCSD / Fall '15 21

Page 19: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Analysis of performance for i = 0 to n-1 // n × n2 / L loads = n3/L, L=cache line size B[:,:] for j = 0 to n-1

// n2 / L loads = n2/L A[i,:] // n2 / L loads + n2 / L stores = 2n2 / L C[i,j] for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j] Total:(n3 + 3n2) / L

+= * C[i,j] A[i,:]

B[:,j]

Scott B. Baden / CSE 260, UCSD / Fall '15 22

Page 20: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Computational Intensity

•  Performance is limited by the ratio of computational done and the volume of data moved between processor and main memory

•  We call this ratio the computational intensity, or q; it is intrinsic to the application

≈ 2 as n→∞ €

q = 2n3

n3 + 3n2

Scott B. Baden / CSE 260, UCSD / Fall '15 23

Page 21: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Blocking for cache: motivation •  To increase performance, we need to move less data •  Assume a 2 level memory hierarchy, fast and slow •  Minimum running time = f × tf when all data is in

fastest memory, f = # arithmetic ops, tf = time / flop •  Actual time

f × tf + m × tm = f × tf × (1 + 1/(tf /tm) × 1/q) m = # words transferred tm = slow memory access time q = f /m = computational intensity tf /tm Establishes machine balance “Hardware computational intensity”

Scott B. Baden / CSE 260, UCSD / Fall '15 24

Page 22: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Blocked Matrix Multiply

•  Divide A, B, C into N × N sub blocks (bxb) •  All 3 blocks must fit into cache •  Assume we have a good quality library to

perform matrix multiplication on subblocks

Scott B. Baden / CSE 260, UCSD / Fall '15 25

Θ(M1/2)

n

=

Page 23: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Blocked Matrix Multiplication

for i = 0 to N-1 for j = 0 to N-1 // load each block C[i,j] into cache, once : n2

// b = n/N = block size for k = 0 to N-1 // load each block A[i,k] and B[k,j] N3 times

// = 2N3 × (n/N)2 = 2Nn2 C[i,j] += A[i,k] * B[k,j] // do the matrix multiply // write each block C[i,j] once : n2

= + * C[i,j] C[i,j] A[i,k]

B[k,j]

Total: (2*N+2)*n2

Scott B. Baden / CSE 260, UCSD / Fall '15 26

Page 24: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Lower Bound on Performance •  If Mfast = size of fast memory 3b2 ≤ Mfast ⇒ b ≤ (Mfast/3)1/2

•  To run at half peak speed Mfast ≥ 3b2 = 3(tm/tf)2

•  We change the order of arithmetic, so slightly different answers due to roundoff

•  Lower bound Theorem (Hong & Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to O( (Mfast)

1/2 )#words moved between fast and slow memory = Ω (n3 / (Mfast)1/2

•  From previous slide: (2N+2)*n2 , N = n/b ≈ 2N3/b

Scott B. Baden / CSE 260, UCSD / Fall '15 27

Page 25: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Computational Intensity

•  Blocking raises the computational intensity by reducing the number of reads from main memory

≈ n/N = b

as n→∞ €

q = 2n3

(2N + 2)n2 = nN +1

Scott B. Baden / CSE 260, UCSD / Fall '15 28

Page 26: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

The results

8.14 GFlops

R∞ = 4*2.33 = 9.32 Gflops ~87% of peak

Scott B. Baden / CSE 260, UCSD / Fall '15 29

Page 27: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

The roofline model

•  As we’ve seen, the time spent in matrix multiplication includes both data motion time as well as the time spent computing

•  In a modern processor, we may be able to perform both at the same time

•  Thus, the total running time is •  We can vary q (and the computation)

Scott B. Baden / CSE 260, UCSD / Fall '15 30

= max Computation (f × tf )

q-1 × Main memory bandwidth (BW = tm-1)

Page 28: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Roofline model •  Peak performance intrinsic

to the processor, theoretical limit based on clock speed

•  Lower rooflines correspond to successively degraded performance as we remove optimizations

•  If our hadware as a multiplier and an adder, but algorithm can use only adders, then we can do no better than ½ peak

•  Similarly with SIMD parallelism and ILP

q (FLOP:Byte ratio)

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

mul / add imbalance

Sam Willaims Scott B. Baden / CSE 260, UCSD / Fall '15 31

Page 29: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Roofline model – impact of locality •  In matrix multiply, we only

count compulsory misses •  But conflict misses can

occur as well

Actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

mul / add imbalance

Sam Willaims Scott B. Baden / CSE 260, UCSD / Fall '15 32

only compulsory m

iss traffic

+conflict miss traffic

Column of matrix is stored in red cache lines

Larry Carter

Cache lines

Page 30: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Avoiding conflict misses •  Copying and cache friendly layouts

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

11

15

19

Row major

Cache lines Column of matrix is stored in red cache lines

Larry Carter

0

2

4

6

1

3

5

7

8

10

12

13

9

11

14

15

Reorganized into 2x2 blocks

Scott B. Baden / CSE 260, UCSD / Fall '15 33

Page 31: Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott

Programming Assignment #1 •  Implement high performance matrix multiplication •  Provided code

u  Single level blocked matrix multiplication code u  Reference: calls highly optimized matrix multiply, provided

by ATLAS •  Further optimization required

u  Hierarchical blocking (at least 2 levels of cache, TLB, too) u  Cache friendly layouts u  SSE (vectorization, next time) u  Compiler optimizations u  Other hand optimizations may be needed

•  See www.cs.berkeley.edu/~demmel/cs267_Spr12/Lectures/lecture02_memhier_jwd12.ppt

Scott B. Baden / CSE 260, UCSD / Fall '15 34