41
1 Benchmark Analysis of Benchmark Analysis of Multi-Core Processor Multi-Core Processor Memory Contention Memory Contention Tyler Simon, Computer Sciences Corp. Tyler Simon, Computer Sciences Corp. James McGalliard, FEDSIM James McGalliard, FEDSIM SCMG SCMG Richmond/Raleigh April 2009 Richmond/Raleigh April 2009

Benchmark Analysis of Multi-core Processor Memory Contention April 2009

Embed Size (px)

Citation preview

Page 1: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1

Benchmark Analysis of Benchmark Analysis of Multi-Core Processor Multi-Core Processor Memory ContentionMemory Contention

Tyler Simon, Computer Sciences Corp.Tyler Simon, Computer Sciences Corp.James McGalliard, FEDSIMJames McGalliard, FEDSIM

SCMGSCMGRichmond/Raleigh April 2009Richmond/Raleigh April 2009

Page 2: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

22

Presentation OverviewPresentation Overview

Multi-Core Processors & PerformanceMulti-Core Processors & Performance Discover System & WorkloadDiscover System & Workload Benchmark Results & DiscussionBenchmark Results & Discussion

• Cubed SphereCubed Sphere Wall ClockWall Clock MPI TimeMPI Time

• Memory KernelMemory Kernel Concluding RemarksConcluding Remarks

Page 3: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

33

Multi-Core Processors & Multi-Core Processors & PerformancePerformance

Moore’s Law worked for more than 30 Moore’s Law worked for more than 30 yearsyears

Problems with current leakage and heatProblems with current leakage and heat Processors can’t grow much larger at Processors can’t grow much larger at

current clock ratescurrent clock rates

Page 4: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

44

Multi-Core Processors & Multi-Core Processors & PerformancePerformance

Industry has turned to multi-core Industry has turned to multi-core processors to continue price/performance processors to continue price/performance improvementsimprovements

Early multi-core designs were just multiple Early multi-core designs were just multiple single-core processors attached to each single-core processors attached to each otherother

Page 5: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

55

Discover System & WorkloadDiscover System & Workload

The NASA Center for Computational The NASA Center for Computational Sciences (NCCS) in Greenbelt MDSciences (NCCS) in Greenbelt MD

Goddard is the world’s largest Goddard is the world’s largest organization of Earth scientists and organization of Earth scientists and engineersengineers

Page 6: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

66

Discover System & WorkloadDiscover System & Workload

““Discover” is the largest NCCS system currently. Discover” is the largest NCCS system currently. Linux Networx & IBM cluster system with 4648 Linux Networx & IBM cluster system with 4648 CPUs – dual- and quad-core Dempsey,Woodcrest CPUs – dual- and quad-core Dempsey,Woodcrest and Harpertown processors.and Harpertown processors.

Page 7: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

77

Discover System & WorkloadDiscover System & Workload

The predominant NCCS The predominant NCCS workload is global workload is global climate modeling. An climate modeling. An important modeling important modeling code is the cubed code is the cubed sphere.sphere.

Today’s presentation Today’s presentation includes cubed sphere includes cubed sphere benchmark results as benchmark results as well synthetic kernels. well synthetic kernels.

Page 8: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

88

Cubed Sphere Benchmark Results Cubed Sphere Benchmark Results & Discussion& Discussion

The benchmark tests were run with the finite volume cubed sphere dynamic core application running the “benchmark 1” test case, a 15-day simulation from a balanced Hydrostatic Baroclinic state at 100-km resolution with 26-levels .

The test case was run on Discover using only the Intel Harpertown Quad core processors using 3, 6 and 12 nodes and varying the active cores per node by 2,4, 6 and 8.

Page 9: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

99

Cores, Processors, NodesCores, Processors, Nodes Cores = central processing units, including the logic Cores = central processing units, including the logic

needed to execute the instruction set, registers & local needed to execute the instruction set, registers & local cachecache

Processors = one or more cores on a single chip, in a Processors = one or more cores on a single chip, in a single sockets, including shared cache, network and single sockets, including shared cache, network and memory access connectionsmemory access connections

Node = a board with one or more processors and local Node = a board with one or more processors and local memory, network attachedmemory, network attached

8-core, 2-socket Node

Note: this diagram does not represent physical layout

4-core Harpertown Processor

4-core Harpertown Processor

Local Memory

Page 10: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1010

Benchmark Results & Discussion

Note, the largest performance problem with the GEOS-5 workload is I/O, a problem that is growing for everybody in the HPC community.

The dynamical core is about ¼ of the processor workload at this time. The remaining ¾ is physics, which is much more cache and memory friendly than the dynamics.

Page 11: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1111

Benchmark Results & DiscussionBenchmark Results & Discussion

Nodes

Cores per

NodeTotal Cores

Wall Time

% Communic

ation12 2 24 371.1 5.312 4 48 212.8 10.19

12 6 72 181.9 17.33

12 8 96 178.5 20.74

6 2 12 676.1 3.96

6 4 24 411.6 7.01

6 6 36 339.2 12.82

6 8 48 318.3 14.16

3 2 6 1336.9 2.77

3 4 12 771.2 5.58

3 6 18 658.8 11.68

3 8 24 601.3 9.88

Page 12: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1212

Benchmark Results & Discussion

Page 13: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1313

Benchmark Results & DiscussionBenchmark Results & Discussion

Running the cubed sphere benchmark shows that by using fewer cores per node we can improve the runtime by 38%.

Results show performance degradation in regards to cores per node in application runtime, MPI behavior and on chip cache.

Page 14: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1414

Benchmark Results & Discussion

Parallel efficiency is a function of single core execution time divided by the processor-time (runtime * total cores).

Page 15: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1515

Benchmark Results & Discussion

This chart shows that by just reducing the number of cores per node efficiency is increased by an average of 53%.

Page 16: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1616

Benchmark Results & Discussion

The application’s use of MPI and placement of work into ranks then onto cores is important when considering overall runtime.

The following charts show how MPI performance is affected by various core configurations for the 24 core problem size at 2, 4 and 8 cores per node.

Page 17: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1717

Benchmark Results & Discussion

Charts 2a 2b and 2c show with the green line the total amount of time spent in MPI, per rank, for the entire benchmark run at 2,4 and 8 cores respectively.

For 2 cores the time is around 23 seconds, 4 cores 26 seconds and 8 cores we see more fluctuation, but the MPI use costs around 60 seconds. Also system time slowly creeps up as we increase per node core count.

Page 18: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1818

Benchmark Results & Discussion

2 active cores

Runtime ~370 seconds

MPI ~ 20 seconds

Page 19: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

1919

Benchmark Results & Discussion

4 active cores

Runtime ~410 seconds

MPI ~30 seconds

Page 20: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2020

Benchmark Results & Discussion

8 active cores

Runtime ~600 seconds (almost double vs 2 cores)

MPI ~60 seconds (triple vs 2 cores)

Page 21: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2121

Benchmark Results & Discussion

Next charts break down the previous charts regarding overall MPI use.

Show how time is spent within MPI on a per rank basis.

How each MPI rank is affected by the core configuration.

Note the variation of maximum value on the y axis between charts.

Page 22: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2222

Benchmark Results & Discussion

Components of the green line in the earlier charts

2 active cores About 20

seconds total MPI time, the same as before

Page 23: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2323

Benchmark Results & Discussion

4 active cores

About 30 seconds total MPI time

Page 24: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2424

Benchmark Results & Discussion

8 active cores

About 60 seconds total MPI time

More variability

Page 25: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2525

Benchmark Results & Discussion

On-chip resource contention between cores is a well documented byproduct of the “multi-core revolution” and manifests itself in significant reductions in available memory bandwidth.

This fight for cache bandwidth between cores affects application runtime and, as expected, is exacerbated by increasing the number of cores used per node.

Page 26: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2626

Benchmark Results & Discussion

The following charts provide a brief overview of the costs associated with using a multi core chip (Woodcrest vs. Harpertown) in terms of reading and writing to the cache with various block sizes, cache miss latency and cache access time.

Note that the Woodcrest nodes run at 2.66 GHz and have a 4MB shared L2 Cache, thus 2MB per core with each core having access to a 32K L1 (16k for data and 16k for instructions).

The Harpertown nodes have 2 quad core chips running at 2.5 GHz. Within each quad core chip each set of 2 cores has access to a 6MB L2 cache and a 64K L1 cache (32k x 2).

Chip CoresChips/ Node Clock L1 L2

Woodcrest 2 2 2.66 Ghz 32K ea4M

shared

Harpertown 4 2 2.5 Ghz 64K pair

6M shared

/pair

Page 27: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2727

Benchmark Results & Discussion

Page 28: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2828

Benchmark Results & Discussion

Page 29: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

2929

Benchmark Results & Discussion

0

100

200

300

400

500

600

700

0.001 0.01 0.1 1 10 100 1000 10000 100000

Stride Size in Kbytes

R+W

tim

e (n

s)

2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)

Page 30: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3030

Benchmark Results & Discussion

There is a dramatic increase in latency as the number of active cores increases. From 2 to 4, approximately a factor of 4 increase.

From 2 to 8 cores (with roughly similar performance between Woodcrest and Harpertown), about a factor of 8 increase.

There are plateaus and steep inclines, particularly in the single core results, showing that there is sensitivity to locality. Locality is under the programmer’s control.

Page 31: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3131

Concluding RemarksConcluding Remarks This presentation examined performance

differences for the Cubed Sphere benchmark on Harpertown nodes by varying the active core count per node, e.g., 38% better runtime on 2 cores per node vs. 8

MPI performance degrades if core count and problem size are fixed but core density increases, e.g., tripling MPI time going from 2 to 8 cores per node

Page 32: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3232

Concluding Remarks Cache read and write times also degrade

as core density increases, e.g., about 8-fold going from 2 to 8 cores per node

Runtime seems to be affected by the number of cores used per node due to the resource contention in the multicore environment.

Page 33: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3333

Concluding RemarksConcluding Remarks

Scheduling a job to run on fewer processors can Scheduling a job to run on fewer processors can improve run-time, at a cost of reduced processor improve run-time, at a cost of reduced processor utilization (which may not be a problem).utilization (which may not be a problem).

To date, most NCCS user/scientists are more To date, most NCCS user/scientists are more concerned with science details and portability concerned with science details and portability than with performance optimization. We expect than with performance optimization. We expect their concern to grow with fixed clock rates and their concern to grow with fixed clock rates and memory contention.memory contention.

Page 34: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3434

Concluding RemarksConcluding Remarks

Application-level optimization is more work Application-level optimization is more work for the user.for the user.

The direction of all processor designs is The direction of all processor designs is towards cells and hybrids. Unknown if this towards cells and hybrids. Unknown if this will work.will work.

A hybrid approach, e.g., using OpenMP or A hybrid approach, e.g., using OpenMP or cell processors may improve performance cell processors may improve performance by avoiding off-chip memory contention.by avoiding off-chip memory contention.

Page 35: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3535

POCsPOCs

[email protected]@nasa.gov [email protected]@gsa.gov

Page 36: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3636

ReferenceReference

Hennessy, John and Patterson, David. Hennessy, John and Patterson, David. Computer Computer Architecture: A Quantitative ApproachArchitecture: A Quantitative Approach, 2nd , 2nd Edition. Morgan Kauffmann, San Mateo, Edition. Morgan Kauffmann, San Mateo, California. The memory stride kernel is adapted California. The memory stride kernel is adapted from a code appearing on page 477. from a code appearing on page 477.

Page 37: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3737

Diagram of a Generic Dual Core Diagram of a Generic Dual Core ProcessorProcessor

Page 38: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3838

Page 39: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

3939

Sandia Multi-Core Performance Sandia Multi-Core Performance PredictionPrediction

Page 40: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

4040

0

5

10

15

20

25

30

35

40

45

50

0.001 0.01 0.1 1 10 100 1000 10000 100000

Stride Size in Kbytes

R+W

tim

e (n

s)

2 active cores (Woodcrest) 4 active cores (Woodcrest) 8 active cores (Harpertown)

Page 41: Benchmark Analysis of Multi-core Processor Memory Contention April 2009

4141