32
Intel Xeon Processor 7500/6500 Series Memory Performance Optimization October 2010 Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors By Dustin Fredrickson Ganesh Balakrishnan Charles Stephan Alicia Wimbush George Patsilaras IBM System x and BladeCenter Performance

Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Embed Size (px)

Citation preview

Page 1: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Intel Xeon Processor 7500/6500 Series Memory Performance Optimization October 2010

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors

By Dustin Fredrickson Ganesh Balakrishnan Charles Stephan Alicia Wimbush George Patsilaras IBM System x and BladeCenter Performance

Page 2: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 2

TABLE OF CONTENTS

Abstract ................................................................................................................................................. 3

Introduction ........................................................................................................................................... 3

System Architecture.............................................................................................................................. 4

x3850 X5 and x3950 X5 .................................................................................................................... 5

x3690 X5........................................................................................................................................... 8

HX5 Blade Server............................................................................................................................ 11

Memory Performance.......................................................................................................................... 15

Memory Speed................................................................................................................................ 15

Factors Controlling Memory Frequency ..................................................................................... 16

Processor Type......................................................................................................................... 16

DIMM frequency........................................................................................................................ 16

System Settings........................................................................................................................ 16

Memory Frequency Impacts on Memory Benchmarks ............................................................... 17

Memory Speed Impacts on Applications.................................................................................... 19

DIMM Types and Ranks .................................................................................................................. 20

DIMM Ranks and STREAM Triad.............................................................................................. 21

DIMM Ranks and Applications Performance.............................................................................. 21

Memory Population and Balance ..................................................................................................... 22

Hemisphere Mode: What is it and how do I enable it?................................................................ 22

Memory Balance Impact on Memory Throughput....................................................................... 23

Memory Balance Impact on Application Performance................................................................ 25

Memory Population Across Processors ..................................................................................... 27

Performance Impact of Using Multiple DIMM Types and Capacities........................................... 27

Memory Redundancy and Performance........................................................................................... 29

Normal Memory Mode............................................................................................................... 29

Memory Mirroring Mode ............................................................................................................ 29

Memory Sparing........................................................................................................................ 29

Memory Throughput and Latency for Various Memory Modes ................................................... 29

Best Practices for Optimal Memory Performance.............................................................................. 31

Conclusion........................................................................................................................................... 31

Page 3: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 3

Abstract This paper examines the architecture and memory performance for IBM System x platforms utilizing the Intel® Xeon® 7500 and 6500 Series processors. While the all-new architecture of these platforms can provide significant performance gains over predecessor products, there are many new memory performance considerations as well. The performance analysis in this paper will cover memory latency, bandwidth and application performance, and will address performance issues related to memory speed and population of memory for each IBM platform. The paper will also examine optimal memory configurations and best practices for the platforms and make recommendations on configuring these platforms for optimal performance.

Introduction The Xeon 7500 Series is the latest generation of Intel processors, with up to 8 cores and 16 threads per processor socket. Based on 45nm manufacturing technology, these processors are designed to enable systems to scale from 2 to 8 processor sockets, locally connected or scaled via a node controller. Natively supporting 2 integrated memory controllers and 4 high-speed scalable memory interfaces, as well as up to 24MB of shared L3 cache, these processors provide headroom for the most demanding, highly threaded applications. The Xeon 6500 Series processors are architecturally similar to the 7500 Series, but are limited to 2-socket local or 4-socket scaled configurations only, and are not available in the fastest clock speeds or with the largest L3 cache sizes. These processors are designed to enable a price-optimized 2-socket, enterprise class server with all the RAS and performance features of the 7500 Series. The remainder of this paper will refer only to the 7500 Series processors, however it should be noted that the 6500 Series will function and perform similarly in many cases, subject to the restrictions mentioned above. The 7500 Series provides a common building block across a number of new IBM platforms, including the IBM BladeCenter® HX5, the 2U IBM System x3690 X5 and the 4U x3850 X5 and x3950 X5 rack servers.

A number of key performance features are introduced with the Xeon 7500 Series processors that are significantly different from previous 4-socket (or higher) capable processors. As depicted in Figure 1, these include:

• Dual memory controllers integrated into each processor to maximize bandwidth.

– Like the Xeon 5500 and 5600 Series processors, the 7500 Series processors exploit a Non-Uniform Memory Access (NUMA) architecture, as opposed to the shared front-side bus and shared memory controller architectures of previous generation systems.

• Dual Scalable Memory Interconnect (SMI) Channels per memory controller (4 SMI channels per CPU socket), which operate in lock-step.

• Memory buffers connected to each SMI channel that provide a bridge between the SMI channel and two DDR3 memory channels.

– Memory frequency is a fixed ratio to the SMI speed as follows: 6.4 GTps SMI = 1066MHz memory frequency

5.86 GTps SMI = 978MHz memory frequency 4.8 GTps SMI = 800MHz memory frequency

– Each processor supports up to 8 total DDR3 memory channels, and 16 total DIMMs

• Support for running memory at its top speed, regardless of the number of DIMMs installed or the number of ranks per DIMM.

– This is unlike the Xeon 5500 or 5600 Series, which downclock memory depending on the number of DIMM slots populated and the number of ranks per DIMM.

Page 4: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 4

• Hemisphere Memory Mode (explained in detail later), an advanced memory interleaving mechanism that optimizes memory performance in the Xeon 7500 and 6500 Series.

• A high speed serial interconnection called the QuickPath Interconnect (QPI) which connects the processors to each other, and to the IO subsystem. – QPI links are capable of 6.4, 5.86 or 4.8 GTps (Gigatransfers per second), depending on

the processor SKU.

Figure 1. Xeon 7500 Series processor and memory architecture Note that the Xeon 7500 Series architecture requires that DIMMs must be installed in lock-stepped pairs, at equivalent DIMM locations across the two memory buffers of a processor’s memory controller. Figure 1 illustrates one such lock-stepped pair, each DIMM fed from the second memory channel on each of memory buffers A and B, respectively. Each DIMM pair has a similar requirement, as will be outlined for each system in the following sections.

System Architecture In this section, we explore the system architectures of the IBM System x and BladeCenter servers based on the 7500 Series processors. Figure 2 shows a block diagram of the memory subsystem of each processor and the processors’ QPI interconnects for a 4-socket Xeon 7500 processor-based system. Note that the 4 QPI links on each processor allow a direct connection to each of the 3 neighboring processors as well as to the I/O subsystem (not shown). This directly connected arrangement allows for maximum bandwidth and lowest possible latency for communication between processors.

QPI Links

QPI Links

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

DDR3 DIMMs

DDR3 Channels

Memory Buffers SMI->DDR3

Lock Step DIMM Population Requirement

Page 5: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 5

Figure 2. Xeon 7500 Series 4-socket system architecture

x3850 X5 and x3950 X5 The x3850 X5 is the follow-on 4-socket rack platform, replacing the x3850 M2. This system can be configured with up to 64 DIMM slots (1TB), the maximum supported natively in a 4-socket Intel Xeon 7500 processor-based system. The 64 DIMMs are organized with 16 attached to each processor, as shown in Figure 2. In the x3850 X5, the 16 DIMMs attached to each processor are distributed across 2 memory cards, 8 DIMMs on each. The x3950 X5 is identical to the x3850 X5, with the addition of a scalability kit, but used in multiple-node (chassis) configurations. (The x3850 X5 can be upgraded to an x3950 X5 simply by adding a scalability kit.) A 2-node configuration contains twice as much of everything, including processors, memory, I/O slots, ports, etc., as the x3850 X5. In addition, an IBM MAX5 memory expansion unit can be attached to the x3850 X5. The 1U MAX5 unit adds an additional 32 DIMM slots, increasing the memory capacity of the server by 50% (up to 1.5TB per node). For the purposes of this paper, however, the benchmark testing was limited to a single-chassis configuration without MAX5.

QPI QPI

QPI Links

QPI Links

Xeon 7500 Processor 0

MB

Memory Controller 1

Memory Controller 2

MB MB MB

DDR3 DIMMs

Xeon 7500 Processor 1

MB

Memory Controller 1

Memory Controller 2

MB MB MB

DDR3 DIMMs

Xeon 7500 Processor 2

MB

Memory Controller 1

Memory Controller 2

MB MB MB

DDR3 DIMMs

Xeon 7500 Processor 3

MB

Memory Controller 1

Memory Controller 2

MB MB MB

DDR3 DIMMs

Page 6: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 6

Each of the memory cards is inserted vertically into the x3850/x3950 X5 chassis. The memory card locations and their SMI connections to each processor are shown in Figure 3.

Figure 3. x3850 X5 memory card population, top chassis view Figure 4 shows the DIMM layout for these memory cards. Note that each card is passed 2 SMI channels (shown in blue) with each feeding a memory buffer. Each memory buffer then feeds 2 channels of DDR3 DIMMs (shown in green).

Memory Card 2

Memory Card 3

Memory Card 4

Memory Card 5

Memory Card 6

Processor 1

Processor 0

Memory Card 8

Memory Card 7

FRONT of System BACK

Processor 2

Processor 3

Memory Card 1

Page 7: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 7

Figure 4. x3850/x3950 X5 memory card DIMM layout—side view Table 1 provides the recommended DIMM population order per memory card. While the second column of Table 1 indicates the minimum lock-stepped DIMM pairs that must be populated together for the system to function, the first column indicates the groupings of DIMMs that should be installed together to ensure that all memory channels are populated. Note that Balanced DIMM Group 1 should be populated across both memory cards of a processor before Balanced DIMM Group 2 is populated. These DIMM groups correspond to the first and second DIMMs on each memory channel respectively, and the DIMMs within each group should be composed of similar type and capacity for optimal performance.

Memory Card Population Order on x3850/x3950 X5

Balanced Memory Channel Group

DIMM Population Order per Card

Populate These DIMM Slots on each Card

Pair 1 1 and 8 Balanced DIMM Group 1

Pair 2 3 and 6

Pair 3 2 and 7 Balanced DIMM Group 2

Pair 4 4 and 5

Table 1. DIMM population order for each x3850/x3950 X5 memory card Note: The two memory cards attached to each processor must be configured identically, with the same capacity and type of DIMMs located in the same DIMM locations on each card in order to enable an important memory performance optimization called Hemisphere Mode, to be discussed further in the section on Hemisphere Mode, below.

MEMORY CARD TOP

DIMM 1

Connector to Processor

Board

DIMM 2

DIMM 3

DIMM 4

DIMM 5

DIMM 6

DIMM 7

DIMM 8

Memory Buffer

Memory Buffer

Page 8: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 8

Optimal memory performance will always occur when all the DIMM slots are populated with the same capacity and type of DIMM.

x3690 X5 The x3690 X5 is a 2-socket, enterprise-class platform supporting the Intel Xeon 7500 and 6500 Series processors and up to 32 DIMM slots (512GB). For configurations of more than 16 DIMMs, and to ensure memory is populated to both processor sockets, an optional memory riser card must be installed. Like the x3850 X5 and x3950 X5, an IBM MAX5 memory expansion unit can be attached to the x3690 X5 chassis, adding an additional 32 DIMM slots and doubling the memory capacity of the server (up to 1TB). For the purposes of this paper, however, the benchmark testing was limited to local memory configuration only, without usage of MAX5. Figure 5 shows the basic CPU and memory architecture of the 2-socket x3690 X5 system, with the optional riser card configuration highlighted. Note that unlike the 4-socket platform, this system architecture allows dual QPI links to be used for connection between the processors. This feature enables up to 2x the throughput for remote processor memory accesses.

Figure 5. x3690 X5 2-socket architecture Figure 6 shows the DIMM orientation of the x3690 X5 from a top-down view of the server chassis.

Optional Memory Mezzanine Card

QPI QPI

Xeon 7500 Processor 0

MB

Memory Controller 1

Memory Controller 2

MB MB MB

DDR3 DIMMs

Xeon 7500 Processor 1

MB

Memory Controller 1

Memory Controller 2

MB MB MB

DDR3 DIMMs

Optional Memory Mezzanine Card

Page 9: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 9

Figure 6. x3690 X5 mainboard layout Note that with just one processor installed, all 16 memory slots of the base system are available for use without having to install the optional memory riser card. Figure 7 shows the layout of the optional memory riser card of the x3690 X5. This card is installed horizontally over the top of the DIMMs on the system’s mainboard. Note that all of the second processor’s DIMMs are located on the memory mezzanine card.

Processor 1

BACK

DIMM Slot 5

DIMM Slot 6

DIMM Slot 7

DIMM Slot 8

MB-A

Processor 0

DIMM Slot 9

DIMM Slot 10

DIMM Slot 11

DIMM Slot 12

DIMM Slot 13

DIMM Slot 14

DIMM Slot 15

DIMM Slot 16

DIMM Slot 3

DIMM Slot 2

DIMM Slot 1

DIMM Slot 4

MB-C

MB-D

FRONT OF CHASSIS

Memory MezzanineCo

nnectors

MB-B

Page 10: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 10

Figure 7. x3690 X5 memory riser card DIMM layout The x3690 X5 DIMM population order for processors 0 and 1 is provided in Table 2 and Table 3, respectively. While minimum functional requirements of the system only dictate that DIMMs be installed in equivalent DIMM pairs, per the third column of each table, this population does not necessarily guarantee the best performance. To optimize memory performance on the x3690 X5, it is recommended that equivalent capacity and type of DIMMs be installed across all 8 DDR3 memory channels of a processor to create balanced DIMM configurations. The first columns of Table 2 and Table 3 show the two available Balanced DIMM Channel Groups of the x3690 X5. Balanced DIMM Group 1 consists of the first DIMMs in all 8 DDR3 channels, and Balanced DIMM Group 2 consists of the second DIMMs in each of the channels.

Processor 1

BACK

DIMM Slot 21

DIMM Slot 22

DIMM Slot 23

DIMM Slot 24

MB-E

MB-F

Processor 0

DIMM Slot 25

DIMM Slot 26

DIMM Slot 27

DIMM Slot 28

DIMM Slot 29

DIMM Slot 30

DIMM Slot 31

DIMM Slot 32

DIMM Slot 19

DIMM Slot 18

DIMM Slot 17

DIMM Slot 20

MB-G

MB-H

FRONT OF CHASSIS

Memory MezzanineCo

nnectors

Page 11: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 11

Note: While the two Balanced DIMM Groups on a processor do not have to be populated with the same DIMM types to ensure Hemisphere Mode and memory balance, optimal interleaving will only occur when all the DIMM slots are populated with the same capacity and type of DIMM. If Balanced DIMM Groupings are not possible, DIMMs should at least be installed in their Hemisphere Groups, comprising 2 DIMM pairs each. These are specified in the second columns of Table 2 and Table 3.

CPU-0 Memory Population on x3690 X5 Balanced Memory

Channel Group

Hemisphere Group

DIMM Population Order Populate These DIMM Slots

Pair 1 1 and 8 CPU-0 Hemisphere 1

Pair 2 9 and 16

Pair 3 3 and 6

CPU-0 Balanced

DIMM Group 1 CPU-0 Hemisphere 2 Pair 4 11 and 14

Pair 5 2 and 7 CPU-0 Hemisphere 3

Pair 6 10 and 15

Pair 7 4 and 5

CPU-0 Balanced

DIMM Group 2 CPU-0 Hemisphere 4 Pair 8 12 and 13

Table 2. DIMM population order for x3690 X5 processor 0

CPU-1 Memory Population on x3690 X5 (requires optional mezzanine card) Balanced Memory

Channel Group

Hemisphere Group

DIMM Population Order Populate These DIMM Slots

Pair 1 17 and 24 CPU-1 Hemisphere 1

Pair 2 25 and 32

Pair 3 19 and 22

CPU-1 Balanced

DIMM Group 1 CPU-1 Hemisphere 2 Pair 4 27 and 30

Pair 5 18 and 23 CPU-1 Hemisphere 3

Pair 6 26 and 31

Pair 7 20 and 21

CPU-1 Balanced

DIMM Group 2 CPU-1 Hemisphere 4 Pair 8 28 and 29

Table 3. DIMM population order for x3690 X5 processor 1 with memory mezzanine card

HX5 Blade Server The HX5 is a scalable 2-socket Xeon 7500 processor-based blade with 8 DIMM slots per processor, for a total of 16 DIMM slots per blade. From an architectural standpoint, the key differences between HX5 and its rack counterparts, as shown in Figure 8, are as follows:

• HX5 supports 8 DIMMs per processor instead of 16.

• HX5 uses more energy-efficient Memory Buffers, supporting a maximum frequency of 978MHz.

• HX5 supports the Speed Burst Card option which, like the dual QPI links of the x3690 X5, doubles the QPI bandwidth between the two processors.

Page 12: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 12

Figure 8. Single-node HX5 architecture In addition, the HX5 can be scaled to two nodes for a total of 4 processors. Two HX5 blades can be connected by installing a 2-node Scalability Connector as shown in Figure 9. By scaling the number of nodes from 1 to 2, the system not only doubles the number of processors and DIMMs but also the IO capability, resulting in better overall scaling.

QPI QPI

QPI Links

Xeon 7500 Processor 0

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Xeon 7500 Processor 1

MB-E

Memory Controller 1

Memory Controller 2

MB-F MB-G MB-H

Speed Burst card

DDR3 DIMMs DDR3 DIMMs

Page 13: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 13

Figure 9. Two-node HX5 architecture As shown in the Figure 10, each processor is connected to four memory buffers using high speed SMI links. Each Memory Buffer connects to two DIMM slots, one DIMM slot per DDR3 memory channel. Note: All DIMM slots must be populated in order to use all memory channels.

QPI QPI

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-D MB-E

Memory Controller 1

Memory Controller 2

MB-F MB-G MB-H

QPI

Memory Controller 2

MB-A MB-B MB-C MB-D

QPI

Memory Controller 2

MB-E

Memory Controller 1

MB-F MB-G MB-H

Xeon 7500 Processor 1

2-Node Scalability Card

Blade 1

Blade 2

MB-C

Memory Controller 1

Xeon 7500 Processor 3

Xeon 7500 Processor 0

Xeon 7500 Processor 2

DDR3 DIMMs DDR3 DIMMs

DDR3 DIMMs DDR3 DIMMs

Page 14: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 14

Figure 10. DIMM layout for HX5 single node Table 4 and Table 5 show the memory population order for processors 0 and 1 of the HX5 blade, respectively. While minimum functional requirements of the system only dictate that DIMMs be installed in equivalent DIMM pairs, per the second column of each table, it is highly recommended that DIMMs of equivalent capacity and type be used within each of the Hemisphere Groups, defined in the first column. Note: While different DIMMs may be used between the Hemisphere groups, the best interleaving performance is achieved only when both Hemisphere groups of a processor are identically configured. For the HX5, this means that best performance is achieved only when all DIMMs are of equal type and capacity. Similarly, while not a requirement for memory channel interleaving or Hemisphere Mode, best performance is typically achieved when both CPU’s are configured with equivalent memory capacity to ensure optimal NUMA locality for most application types.

FRONT OF BLADE BACK

DIMM Slot 1

DIMM Slot 2

DIMM Slot 3

DIMM Slot 4

DIMM Slot 5

DIMM Slot 6

DIMM Slot 7

DIMM Slot 8

MB-A

MB-C

MB-B

MB-D

DIMM Slot 9

DIMM Slot 10

DIMM Slot 11

DIMM Slot 12

DIMM Slot 13

DIMM Slot 14

DIMM Slot 15

DIMM Slot 16

MB-E

MB-G

MB-F

MB-H

Processor 0

Processor 1

Page 15: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 15

CPU-0 Memory Population on HX5

Hemisphere Group DIMM Population Order Populate These DIMM Locations

Pair 1 DIMM slots 1 and 4 CPU-0 Hemisphere 1 Pair 2 DIMM slots 5 and 8

Pair 3 DIMM slots 2 and 3 CPU-0 Hemisphere 2 Pair 4 DIMM slots 6 and 7

Table 4. DIMM population order for HX5 processor 0

CPU-1 Memory Population on HX5

Hemisphere Group DIMM Population Order Populate These DIMM Locations

Pair 1 DIMM slots 9 and 12 CPU-1 Hemisphere 1 Pair 2 DIMM slots 13 and 16

Pair 3 DIMM slots 10 and 11 CPU1- Hemisphere 2 Pair 4 DIMM slots 14 and 15

Table 5. DIMM population order for HX5 processor 1 Like its rack-optimized cousins, the HX5 also supports a MAX5 memory expansion unit, in an adjacent blade slot, creating a double-wide 60mm server. The MAX5 unit for HX5 offers 24 DIMM slots, allowing the HX5 to support up to 40 DIMMs (320GB).

Memory Performance There are many variables that affect memory performance in a Xeon 7500 processor-based server. Some of these include memory speed, memory ranks and memory population across the server’s many memory channels and processors. While one might theorize that having as many as 32 separate DDR3 memory channels in a 4-socket platform like the x3850 X5 and x3950 X5 would mitigate the need for careful attention to memory configuration, this is grossly inaccurate. In fact, incorrectly configuring a single pair of DIMMs can cause loss of over a third of the server’s memory performance! This section will walk through the critical memory configuration choices for this architecture and their impacts on performance.

Memory Speed Memory performance is controlled by a number of factors, memory speed being one of the most critical. So, it is important to understand not only the performance characteristics when memory frequency is changed, but also the factors that control memory speed in a particular architecture. In the Intel Xeon 7500 series, it is possible to clock memory at three frequencies: 1066MHz, 978MHz, and 800MHz. However, the maximum memory speed is controlled entirely by the processor type and the frequency of the DIMM. It is NOT controlled by the number of DIMMs per channel or the ranks per DIMM, unlike the Xeon 5600 Series processors.

Page 16: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 16

Factors Controlling Memory Frequency As stated above, memory frequency is limited by the maximum common frequency supported by the processors and DIMMs, and can be set lower, if desired, by the user.

Processor Type The processor type is one of the primary factors that determine the maximum possible memory frequency. Table 6 shows the maximum memory frequency for each processor.

Category Processor TDP

Rating (Watts)

Core Freq. (GHz)

Core Count

Cache Size

(MB)

QPI Speed (GTps)

Maximum Memory

Frequency (MHz)

Advanced X7560 130 2.26 8 24 6.4 1066

Advanced X7550 130 2.00 8 18 6.4 1066

Advanced X6550 130 2.00 8 18 6.4 1066

Standard E7540 105 2.00 6 18 6.4 1066

Standard E6540 105 2.00 6 18 6.4 1066

Standard E7530 105 1.86 6 12 5.8 978

Basic E7520 95 1.86 4 18 4.8 800

Basic E6510 105 1.73 4 12 4.8 800

LV L7555 95 1.86 8 24 5.86 978

LV L7545 95 1.86 6 18 5.86 978

HPC X7542 130 2.66 6 18 5.86 978

Table 6. Processor SKUs and capabilities DIMM frequency The DIMM type is another factor that controls the maximum memory frequency. The three frequencies at which the supported DDR3 DIMMs are available are 1333MHz, 1066MHz and 800MHz. The maximum memory frequency is the lower of the DIMM frequency and the maximum frequency supported by the processor. The DIMMs are clocked down automatically by the system if needed to match the frequency supported by the processor. So, the selection of DIMMs should be done in relation to the capability of the processor. For example, if using 1333MHz DIMMs with a X7560 processor (maximum 1066MHz), the memory would run at 1066MHz, so there is no benefit to using the faster and possibly more expensive memory.

System Settings If desired by the user, memory can be set to a frequency lower than the platform maximum via system settings in the system’s UEFI (Unified Extensible Firmware Interface) shell. Memory frequency is typically set lower to save energy in environments with little memory performance sensitivity.

Page 17: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 17

Memory Frequency Impacts on Memory Benchmarks To understand the performance impact of running memory at different frequencies, we used an x3850 X5 system configured with four X7560 processors and 64 4GB quad-rank DIMMs. Figure 11 shows the peak memory throughput per processor, measured using a highly optimized, IBM-internal, memory load generation tool. As shown in the figure, there is an 11% throughput increase when frequency is increased from 800MHz to 978MHz and another 6% throughput increase when increased from 978MHz to 1066MHz.

Figure 11. Peak memory bandwidth at different memory frequencies To calculate memory bandwidth for more than one processor, the numbers in this chart can be simply multiplied by the number of processors. This memory tool, like some HPC applications, is 100% NUMA local, meaning that the application runs separate threads of the application on each processor core with virtually no shared memory, and there are very few remote memory requests. Since we are also configured with a fully-connected QPI topology on this x3850 X5 platform, where all processors are directly connected to one another, the scaling for 1-to-4 processor configurations is near-linear. Note: The peak throughputs generated using this memory load generator represent the absolute maximum throughputs achievable with the highest speed “top bin” processor. For lower processor speeds or with less efficient memory loading applications, the processor cores can become the bottleneck and the amount of memory bandwidth sustainable at the application layer may be lower. Figure 12 shows the peak memory bandwidth for each of the X5 platforms.

Peak Memory Throughput Per Processor

27.1

2 5.6

2 3.0

0 5 10 15 20 25 30

1066 MHz

977MHz

800MHz

Mem

ory

Sp

eed

(MH

z)

Memory Throughput (GBps)

Page 18: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 18

Figure 12. Chart showing peak memory bandwidth by system The x3850 X5 was populated with four X7560 processors and 64 4GB quad-rank DIMMs running at 1066MHz. As expected, the observed memory throughput was 4 times that shown in Figure 11.

The x3690 X5 was populated with two X7560 processors and 32 4GB quad-rank DIMMs running at 1066MHz. The memory throughput observed on this platform was 2 times that shown in Figure 11.

The HX5 blade can clock memory at a maximum speed of only 978MHz because it uses low-voltage Memory Buffers and the ‘LV’ processor models, which have a maximum supported memory frequency of 978MHz. In addition, it has half the number of DIMM slots per processor. These factors account for the lower throughputs as compared to the other platforms shown.

One of the most commonly used benchmarks to gauge memory performance is the STREAM memory benchmark. This benchmark consists of 4 parts, but most commonly the TRIAD subcomponent is quoted. As shown in Figure 13, TRIAD throughput on the x3850 X5 is approximately 18.6GBps per processor.

Peak Memory Throughput By System

1 08.3

9 9 .3

54 .1

4 9 .6

0 20 40 60 80 100 120

x3850 X5 4P

HX5 4P

x3690 X5 2P

HX5 2P

Memory Throughput (GBps)

Page 19: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 19

Figure 13. Chart showing STREAM TRIAD performance The TRIAD component of STREAM involves 2 read operations and 1 write operation from the application’s perspective. However, under the covers, for every write operation submitted by the application, the processor has to first read the data into the processor cache from memory before writing the data back out to memory. This is called Read for Ownership, and is done to ensure that 2 processor caches are never modifying the same piece of data at the same time. This means that at the memory bus level, TRIAD generates 3 read operations and 1 write operation. Therefore, when TRIAD reports 18.6GBps as seen at the application-level, the actual bus-level memory throughput is 18.6 * 4/3 = 24.8 GBps. This is important to understand only because these bus-level bandwidths are sometimes quoted as memory throughput numbers, though this level of bandwidth isn’t actually what the application is able to see. Like other memory tools, STREAM bandwidth scales linearly with the number of processors in the IBM platforms discussed in this paper, and aggregate system bandwidth can be determined by multiplying the numbers shown in Figure 13 by the number of processors installed in the system. This scaling assumption is valid when each thread of STREAM is able to allocate memory local to the processor it is scheduled to run on, which may not occur in some non-optimized versions of the benchmark. While the STREAM benchmark is often quoted as the standard benchmark for memory bandwidth, the actual memory bandwidth able to be generated with this tool is considerably less than the peak memory bandwidth of the platform. Memory Speed Impacts on Applications Figure 14 compares actual application performance for differing memory frequencies. For integer applications, which generally tend to be more processor-frequency and cache sensitive than memory-frequency sensitive, the performance impacts due to memory speed changes are relatively low. The SPECjbb2005 benchmark falls into this category, and shows virtually no gains from increased memory frequency. The SPECint_rate_base2006 benchmark, in general, also shows very small increases due to memory frequency changes. However, not all integer-based applications are equal, and certain subcomponent workloads of this benchmark, like the

STREAM (TRIAD)

27.1

24.8

18.6

0.0 5.0 10.0 15.0 20.0 25.0 30.0

Peak Memory Bandwidth

Triad Memory Bus Bandwidth

Reported Triad Bandwidth

Memory Throughput Per Processor (GBps)

Page 20: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 20

462.libquantum workload, show as much as 11% gain going from 800MHz to 1066MHz memory frequency. Floating point workloads are generally more memory intensive and tend to gain more performance from higher memory bandwidth. This trend is reflected in Figure 14, where we see a 4% gain going from 800MHz to 978MHz and another 4% gain from 978MHz to 1066MHz memory speeds. Of course, these are geometric mean values, and some subcomponent workloads like 433.milc can show gains approaching 33%, consistent with the memory speed increase from 800MHz to 1066MHz.

Figure 14. Chart showing application performance at various memory frequencies

DIMM Types and Ranks Earlier, we pointed out that the number of ranks per DIMM didn’t change DIMM frequency. However, the number of ranks per channel does change memory performance. In this section we will investigate how this works. The data in this section is presented to support the following assertions:

• Using all DDR3 memory channels (8 per processor) is critical for performance in environments which have any memory sensitivity.

• More ranks per memory channel is always better with the Xeon 7500 processors. Three types of DIMMs were used for this analysis: 4GB 4Rx8 (quad-rank x8), 2GB 2Rx8 (dual-rank x8), and 1GB 1Rx8 (single-rank x8) DIMMs. Performance was measured using a highly optimized low-level memory performance tool, as well as the industry standard STREAM memory benchmark, and the SPECjbb2005 application benchmark. In all cases, enough memory was provided to the processors so that the memory capacity differences among the configurations did not affect the results.

Three different memory configurations for each type of memory DIMM were used for the study. While the following performance information was collected on the x3850 X5 platform, the results and methodologies will apply across each of the Xeon 7500 processor-based platforms. The memory configurations used were as follows:

• Full-memory population, using 2 DIMMs on each memory channel — 8 DIMMs per x3850/x3950 X5 memory card

Application Performance

100 100 100101 10410010 3

108101

0

20

40

60

80

100

120

SPECint_rate_base SPECfp_rate_base SPECjBB

Rel

ativ

e P

erfo

rman

ce

800MHz 977MHz 1066 MHz

Page 21: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 21

• Half-memory population, using 1 DIMM on each memory channel — 4 DIMMs per x3850/x3950 X5 memory card (slots 1, 3, 6, and 8 in Figure 4)

• Quarter-populated memory, using 1 DIMM on just half the memory channels — 2 DIMMs per x3850/x3950 X5 memory card (slots 1 and 8 in Figure 4)

The memory was installed following “best practices” guidelines in order to maximize use of all available DDR3 memory channels where possible. In all cases, all memory cards are utilized with equivalent capacity and type of DIMM to provide optimal memory balance.

DIMM Ranks and STREAM Triad In this section we use the STREAM benchmark to stress the memory subsystem. Figure 15 illustrates the effects on memory throughput of varying the number of ranks per channel.

Figure 15. Streams throughput comparisons by DIMM ranks As seen in the data, a non-optimal memory configuration change can show as much as 58% lower bandwidth than the more optimized, full population of quad-rank DIMMs. Though not shown in this data, workloads with a greater memory-write component can be affected even more, with as much as 78% performance loss seen for 100% memory-write workloads, comparing the top and bottom memory configurations of Figure 15. DIMM Ranks and Applications Performance In this section we use the SPECjbb2005 benchmark to investigate what happens when we vary the number of ranks per channel. Unlike the STREAM benchmark used for the previous sections, SPECjbb2005 is largely core and cache sensitive and stresses the memory subsystem very little. Figure 16 illustrates the relative SPECjbb2005 performance with respect to ranks per memory channel. The smaller memory capacities were omitted from this testing since these configurations did not enable a large enough memory capacity for the benchmark to efficiently run, and any comparisons would be inaccurate.

Relative STREAM Triad Throughputby DIMM population per processor

1 00

9 5

8 9

98

8 9

7 3

55

5 2

42

0 20 40 60 80 100 120

16x 4GB (4R)

16x 2GB (2R)

16x 1GB (1R)

8x 4GB (4R)

8x 2GB (2R)

8x 1GB (1R)

4x 4GB (4R)

4x 2GB (2R)

4x 1GB (1R)

Relative Memory Throughput

Page 22: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 22

Figure 16. SPECjbb2005 performance comparison by DIMM ranks This data suggests that the number of ranks per channel is much less critical for environments that do not stress the memory subsystem, and the performance losses comparing optimal and non-optimal, balanced memory configurations for this application type were negligible.

Memory Population and Balance Memory population balance is more critical to performance for the Xeon 7500 Series processor than in any previous Xeon platform. The contributing factors for this are: memory interleaving optimizations across the many memory channels, memory balance in NUMA environments, and enablement of a new memory interleaving mechanism introduced with this processor called Hemisphere Mode. As described in the following sections, the impacts of not populating memory optimally can be extreme. The key rules for memory configuration are:

• Utilize configurations that enable Hemisphere Mode

• Populate memory on all memory channels Hemisphere Mode: What is it and how do I enable it? Hemisphere Mode is an important performance optimization of the Xeon 7500 Series that is enabled automatically by the system when the memory configuration allows. Fundamentally, Hemisphere Mode interleaves memory requests between the two memory controllers inside each processor (shown in Figure 1), enabling reduced latency and increased throughput. This mode also allows the processor to optimize its internal buffers to maximize memory throughput. Hemisphere Mode is a global optimization set at the system level. This means that if even one processor’s memory is incorrectly configured, the entire system will lose the performance benefits of this optimization. Thus, either all processors in the system use Hemisphere Mode, or all do not. Hemisphere Mode is enabled only when the memory configuration for each memory controller in a processor is identical. Because the Xeon 7500 Series memory population rules dictate that we install a minimum of 2 DIMMs on each memory controller at a time (one on each of the attached memory buffers), we must install DIMMs in quantities of 4 per processor to enable Hemisphere Mode. However, because 8 DIMMs per processor are required to utilize all memory channels, it is

Relative SPECjbb2005 Performanceby DIMM population per processor

9 9 .6

100 .0

99 .5

100 .0

100 .1

99 .6

0 20 40 60 80 100 120

16x 4GB (4R)

8x 4GB (4R)

4x 4GB (4R)

16x 2GB (2R)

8x 2GB (2R)

16x 1GB (1R)

Relative Performance

Page 23: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 23

highly recommended that 8 DIMMs per processor be installed initially for optimized memory performance. Note: In the x3850/x3950 X5, because one memory card is attached to each memory controller of a processor, Hemisphere Mode is enabled only when the 2 memory cards assigned to each processor are identically populated. Also, because the HX5 blade has only 8 DIMM slots per processor, the best performance is achieved only when all DIMM slots are populated with equivalent capacity and type of DIMMs. Hemisphere Mode does not require that the memory configuration of both CPUs be identical. For example, Hemisphere Mode will still be enabled if processor 0 is configured with 8 4GB DIMMs, and processor 1 is configured with 8 2GB DIMMs. Depending on the application characteristics, however, an unbalanced memory configuration like this could cause reduced performance due to forcing a larger number of remote memory requests to the processors with more memory. These performance rules can be confusing. To summarize:

• There are two memory channels per memory buffer, two buffers per memory controller, and two controllers per processor. Each memory channel should contain at least one DIMM.

• Within a processor, both memory controllers should contain identical DIMM configurations in order to enable Hemisphere mode. This means that for best results, at least 8 DIMMs per processor (spread across all memory cards) should be installed.

• Each processor can have a different configuration from the others, but this may not be optimal for some applications.

Memory Balance Impact on Memory Throughput Figure 17 shows the impact to peak memory throughput for six different memory configurations, some with and some without Hemisphere Mode.

Page 24: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 24

Figure 17. Memory throughput and Hemisphere Mode This data indicates that the performance impact from simply losing Hemisphere Mode is approximately 16%, comparing Config-1 and Config-2. The loss of memory channels is more severe, with nearly 40% memory throughput lost if only half of the memory channels are populated and Hemisphere Mode is maintained. If only a single memory controller is used (i.e., one memory card per processor on an x3850 X5), performance loss can range from 34-51%, depending on number of DIMMs.

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Config-1: 100

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

H

HXeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Config-2: 84 (Hemisphere Disabled)

Config-3: 61 Config-4: 50

Config-5: 49 Config-6: 66

Relative Memory Throughput by Memory Configuration

Page 25: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 25

Note, however, that this data assumes that all memory is allocated and all memory locations are utilized by the application equivalently. For unbalanced, or non-Hemisphere configurations, there is no guarantee that both memory controllers within the processor will be equivalently utilized if the application is not utilizing all memory locations equivalently. If an application uses only a portion of the memory space, the application may perform worse than this data indicates, and in general has a high degree of performance variability. In general, it is best to ensure that memory configurations supporting Hemisphere Mode are used, and that the memory configurations are balanced across all memory channels, in order to achieve consistently high performance levels. Memory Balance Impact on Application Performance This section once again utilizes the SPEC CPU2006 benchmark to provide an indicator for application level performance impacts. Figure 18 shows the aggregate SPECint_rate_base2006 and SPECfp_rate_base2006 results comparisons for the memory configurations shown in Figure 17, above.

100100

8790

88

77

83

71

82

71

88

79

50

60

70

80

90

100

110

SPECint_rate_base2006 SPECfp_rate_base2006

Relative Performance

SPEC CPU2006 Relative Performance(using memory configurations from Figure 17)

Config-1 Config-2 Config-3 Config-4 Config-5 Config-6

Figure 18. SPEC CPU2006 performance and memory balance As mentioned earlier in this paper and per the data above, floating point-based applications like SPECfp_rate generally have higher memory sensitivity than integer-based applications like SPECint_rate. The data from Figure 20 shows that the application impacts for these range from 10-29% performance attributed to the loss of Hemisphere Mode and memory channels, depending on application type and memory configuration. While Figure 18 represents the average impact over a number of integer and floating point workloads, Figure 19 and Figure 20 represent the performance characteristics of each individual workload, for some specific configurations.

Page 26: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 26

0

20

40

60

80

100

120

Relative Performance

SPECint_rate Workload BreakoutData Relative to Config-1

Config-6 Config-3

Figure 19. SPECint_rate_base2006 workload performance and Hemisphere Mode Note that approximately half of the integer applications show little to no impact from the loss of Hemisphere Mode, with the most memory sensitive workload of the mix showing ~40% performance loss, and the rest averaging between 10% and 20% performance loss.

0

20

40

60

80

100

120

Rel

ativ

e P

erfo

rman

ce

SPECfp_rate Workload BreakoutData Relative to Config-1

Config-6 Config-3

Figure 20. SPECfp_rate_base2006 workload performance and Hemisphere Mode

While some of the floating point applications of SPECfp_rate also show very low memory sensitivity, this benchmark also has a larger number of workloads that show impacts from loss of Hemisphere Mode in the 30%-40% range. Clearly, the impacts from balanced memory configurations are application-specific. For applications that are extremely processor and cache sensitive, but not memory sensitive, memory

Page 27: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 27

configuration is not critical. However, for application types that that do have sensitivity to memory performance, the impacts from non-optimal memory configurations can be significant. Memory Population Across Processors Because the Xeon 7500 Series processor uses a Non-Uniform Memory Architecture (NUMA), it is important to ensure that all memory controllers in the system are utilized by configuring all processors with memory. Generally speaking, it is optimal to populate all processors in an identical fashion to provide a balanced system. Using Figure 21 as an example, Processor 0 has DIMMs populated, but no DIMMs are populated on Socket 1. In this case, Processor 0 will have access to low latency local memory and high memory bandwidth. However, Processor 1 has access only to remote or “far” memory. So, threads executing on Processor 1 will have a longer latency to access memory as compared to threads on Processor 0. This is due to the latency penalty incurred to traverse the QPI links to access the data on the other processor’s memory controller. The bandwidth to remote memory is also limited by the capability of the QPI links. The latency to access remote memory is more than 50% higher than local memory access. For these reasons, it is advisable that all processors be populated with memory, keeping in mind the requirements necessary to ensure optimal interleaving and Hemisphere Mode.

Figure 21. Diagram showing local and remote memory access Performance Impact of Using Multiple DIMM Types and Capacities There are many reasons for considering mixed DIMM configurations:

• Not all applications require the full memory capacity that a homogenous memory population provides

• Cost saving requirements may dictate using a lower memory capacity for some of the platform’s DIMMs

• Some configurations may attempt to use the DIMMs that came with the base platform along with optional DIMM parts of a different type

Regardless of the reason, configurations using mixed DIMM types will perform lower than those using full populations of equivalent DIMM types. Figure 22 illustrates the relative performance of three mixed memory configurations as compared to a baseline fully populated memory configuration. While these configurations use 4GB (4R x8) and 2GB (2R x8) DIMMs as specified, similar trends to this data are expected when using other mixed DIMM capacities. In all cases, memory is populated in minimum groups of four, as specified in the following configurations, to ensure Hemisphere Mode is maintained:

REMOTE LOCAL

QPI Links

Xeon 7500 Processor 0

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Xeon 7500 Processor 1

MB-E

Memory Controller 1

Memory Controller 2

MB-F MB-G MB-H

DDR3 DIMMs

Page 28: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 28

• Config-A: Full population of equivalent capacity DIMMs (2GB). This represents an optimally balanced configuration

• Config-B: Each memory channel is balanced with the same memory capacity, but half of the DIMMs are of one capacity (4GB), and half are of another capacity (2GB)

• Config-C: Eight DIMMs of one capacity (4GB) are populated across the eight memory channels, and four additional DIMMs (2GB) are installed one per memory buffer, so that Hemisphere Mode is maintained

• Config-D: Four DIMMs of one capacity (4GB) are populated across four memory channels, and four DIMMs of another capacity (2GB) are populated on the other four memory channels, with configurations balanced across the memory buffers, so that Hemisphere Mode is maintained

Figure 22. Mixed DIMM type memory performance As shown, mixing DIMM types can cause performance loss up to 18%, even if all channels are occupied and Hemisphere Mode is maintained. Note that the HX5 blade, which provides only 8 DIMMs per processor, will see the memory performance of Config-D in the best-case mixed memory configuration. For this platform, which uses just 1 DIMM per DDR3 channel, it is important to consider full population of equivalent DIMM capacities for any performance-sensitive environment.

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Config-A: 100

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Xeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

H

HXeon 7500 Processor

MB-A

Memory Controller 1

Memory Controller 2

MB-B MB-C MB-D

Config-B: 97

Config-C: 92 Config-D: 82

Relative Memory Throughput by Mixed DIMM Configuration

H

H

Page 29: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 29

Memory Redundancy and Performance In addition to memory speed, DIMM types and balance, memory performance is also affected by the various memory Reliability, Availability, and Serviceability (RAS) features that can be enabled from the UEFI shell. While these settings can increase the reliability of the system, there are performance trade-offs when these features are enabled. The available memory RAS settings are normal, mirroring, and sparing and are mutually exclusive. On the X5 platforms, these settings can be found under the Memory option menu in System Settings. This section is not meant to provide a comprehensive overview of the memory RAS features available in the Xeon 7500 processor, but rather provides a brief introduction to each mode, and their corresponding performance impacts. For more information on memory RAS features and platform-specific requirements, please see the systems’ documentation. Normal Memory Mode When the memory is set to Normal, also known as “flat memory mode”, the full capacity of RAM installed in the system is available to the operating system. Normal memory mode generally provides the greatest performance, and most systems use this mode. Memory Mirroring Mode Memory mirroring can be thought of as being “RAID-1” for system RAM, that is, it is a memory redundancy mechanism. Memory mirroring improves system reliability by maintaining two identical copies of data in memory. In the event of an uncorrectable error in one of the copies, the system can retrieve the data from the mirrored copy. As a result, when memory mirroring is enabled, only half of the system memory is available at one time. So if 64GB of memory is installed in a server, only 32GB will be available to the operating system when mirroring is enabled. The other 32GB is reserved in case of memory failure. Memory mirroring, by definition, requires that each memory write occurs in 2 separate locations in memory. This, of course, adds overhead and causes a performance impact, especially for memory writes. Memory Sparing When memory sparing is enabled, some degree of redundancy is provided but not as much as mirroring mode. In the event that a preset threshold of correctable errors occurs on a rank of memory, the memory sparing algorithm allocates a spare rank of memory from another of the installed memory DIMMS to use as active memory. DIMMs must be paired for memory sparing to work. One rank per channel is set aside for spare memory. The size of the rank that is set aside depends on the DIMMs being used and their capacity. Typically with memory sparing, more memory is available to the operating system than with mirroring, and there is less overhead involved in sparing than for mirroring. Memory Throughput and Latency for Various Memory Modes To understand the performance impact of selecting different memory modes, we use a system configured with X7560 processors and populated with 64 4GB quad-rank DIMMs. Figure 23 shows the peak system-level memory throughput for various memory modes measured using an IBM-internal memory load generation tool. As shown below, there is a 50% decrease in peak memory throughput when going from normal (non-mirrored) to a mirrored memory configuration. There is a 38% decrease when going from normal to a sparing memory configuration.

Page 30: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 30

Figure 23. Relative memory throughput by memory modes on the x3850/x3950 X5 Note that workloads which favor memory write operations can see even higher impacts, with nearly 70% performance loss possible in memory mirroring mode for environments doing only memory writes. Figure 24 suggests that while memory sparing does not seem to impact system memory latency, memory mirroring does, increasing average memory latency by 11%. Of course in the case of latency, lower is better.

Figure 24. Memory latency for various memory modes on the x3850/x3950 X5

Relative Memory Throughput by Memory Mode

10 0

5 0

62

0 20 40 60 80 100 120

Normal

Mirroring

Sparing

Relative Memory Throughput

Relative Memory Latency for Memory RAS features

1 00

1 1 1

1 00

80 85 90 95 100 105 110 115

Normal

Mirroring

Sparing

Relative Memory Latency

Page 31: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 31

Best Practices for Optimal Memory Performance For optimal memory configuration on the Xeon 7500 based platforms, we recommend following these rules:

• When possible, populate memory with 8 or 16 equivalent DIMMs per processor, which allows a perfectly balanced configuration of 1 or 2 DIMMs per DDR3 memory channel. This will guarantee the best performance.

– Non-homogenous memory configurations will have lower performance. Do not mix DIMM types or capacities if optimal performance is desired.

• Populate memory on all the memory channels. There are 8 memory channels per processor.

• Populate memory equivalently across the two memory controllers of each processor in order to guarantee that Hemisphere Mode is enabled.

– To enable Hemisphere Mode, 4 DIMMs of the same type and capacity must be installed at a time per processor.

• Populate each processor with an equivalent amount of memory to ensure a balanced NUMA system and reduce the amount of remote memory requests.

• Use the highest rank-count DIMMs possible for optimal performance. – Quad-rank DIMMs are optimal on Xeon 7500-series processors since memory is not

down-clocked when quad-rank DIMMs are used.

Conclusion The Intel Xeon 6500 and 7500 Series processors have enabled dramatic increases in system performance over the processors they replace, with as much as 3.6x performance improvement in some application benchmarks. However, the processor’s performance capability is only one of many important components in the server that must be optimized to enable a real-world application to perform well. To extract the most performance from the server’s processors, the memory subsystem also must be configured properly. While many existing architectures rarely saw more than 10-15% performance loss from typical non-optimal memory configurations, we have outlined multiple cases in this paper where impacts of 50% or more are seen from simple tuning or configuration choices. Using the recommendations and data from this document, we are able to configure the HX5, x3690 X5 and x3850/x3950 X5 systems to deliver maximum performance, or make educated performance tradeoff decisions where non-optimal configurations are required.

Page 32: Memory Performance Optimization for the IBM System … · Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon

Memory Performance Optimization for the IBM System x3850/x3950 X5, x3690 X5, and BladeCenter HX5 Platforms, Using Intel Xeon 7500 and 6500 Series Processors Page 32

For More Information IBM System x Servers http://ibm.com/systems/x

IBM BladeCenter Server and options http://ibm.com/systems/bladecenter IBM Systems Director Service and Support Manager http://ibm.com/support/electronic IBM System x and BladeCenter Power Configurator http://ibm.com/systems/bladecenter/resources/powerconfig.html IBM Standalone Solutions Configuration Tool http://ibm.com/systems/x/hardware/configtools.html

IBM Configuration and Options Guide http://ibm.com/systems/x/hardware/configtools.html IBM ServerProven Program http://ibm.com/systems/info/x86servers/serverproven/compat/us Technical Support http://ibm.com/server/support Other Technical Support Resources http://ibm.com/systems/support

Legal Information © IBM Corporation 2010 IBM Systems and Technology Group Dept. U2SA 3039 Cornwallis Road Research Triangle Park, NC 27709

Produced in the USA October 2010 All rights reserved.

For a copy of applicable product warranties, write to: Warranty Information, P.O. Box 12195, RTP, NC 27709, Attn: Dept. JDJA/B203. IBM makes no representation or warranty regarding third-party products or services including those designated as ServerProven® or ClusterProven®. Telephone support may be subject to additional charges. For onsite labor, IBM will attempt to diagnose and resolve the problem remotely before sending a technician.

IBM, the IBM logo, ibm.com, ClusterProven, ServerProven, and System x are trademarks of IBM Corporation in the United States and/or other countries. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. For a list of additional IBM trademarks, please see http://ibm.com/legal/copytrade.shtml.

Intel, the Intel logo, and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product and service names may be trademarks or service marks of others.

IBM reserves the right to change specifications or other product information without notice. References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. IBM PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you.

This publication may contain links to third party sites that are not under the control of or maintained by IBM. Access to any such third party site is at the user's own risk and IBM is not responsible for the accuracy or reliability of any information, data, opinions, advice or statements made on these sites. IBM provides these links merely as a convenience and the inclusion of such links does not imply an endorsement.

Information in this presentation concerning non-IBM products was obtained from the suppliers of these products, published announcement material or other publicly available sources. IBM has not tested these products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

MB, GB and TB = 1,000,000, 1,000,000,000 and 1,000,000,000,000 bytes, respectively, when referring to storage capacity. Accessible capacity is less; up to 3GB is used in service partition. Actual storage capacity will vary based upon many factors and may be less than stated.

Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will depend on considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here.

Maximum internal hard disk and memory capacities may require the replacement of any standard hard drives and/or memory and the population of all hard disk bays and memory slots with the largest currently supported drives available. When referring to variable speed CD-ROMs, CD-Rs, CD-RWs and DVDs, actual playback speed will vary and is often less than the maximum possible.

XSL03028-USEN-01