14
MICHAEL E. THOMADAKIS 1 Memory Scalabilty and Performance in Intel64 Xeon SMP Platforms MICHAEL E. THOMADAKIS Abstractcc-NUMA systems based on the Intel Nehalem and the Westmere processors are very popular with the scientific computing communities as they can achieve high floating- point and integer computation rates on HPC workloads. However, a closer analysis of the performance in their memory subsystem reveals that the per core and per thread memory bandwidth of either microprocessor is restricted to almost 1 3 of their ideal values. Multi-threaded memory access bandwith tops out at 2 3 of the maximum limit. In addition to this, the NUMA effect on latencies increasingly worsens as cores try to access larger memory resident data structures and the problem is exacerbated when the regular 4KiB page sizes are used. Moving from Nehalem to Westmere, read performance for data already owned by the same core scales gracefully with the number of cores and core clock speed. However, when data is within the L2 cache or beyond, write performance suffers in Westmere revealing scalability issues in their design when the system from the 4 to 6 cores. This problem gets more acute when multiple streams of data progress concurrently. The Westmere memory subsystem compared to that of the Nehalem, suffers from a worse performance degradation when threads on a core are modifying cache blocks owned by different cores within the same or another processor chip. Applications moving from Nehalem to Westmere based platforms, could experience unexpected memory access degradation even though Westmere was intented as a “drop-in” replacement for Nehalem. In this work we attempt to provide an accurate account of the on-chip and the system level bandwith and latency limitations of Xeons. We study how these two metrics scale as we move from one generation of a platform to subsequent ones where the clock speed, the number of cores and other architecture parameters are different. Here we also analyze how much locality, coherence state and virtual memory page size of data blocks affects memory performance. These last three factors are tra- ditionally overlooked, but if not given adequate attention can affect application performance significantly. We believe the performance analysis presented here can be used by application developers who strive to tune their code to use the underlying resources efficiently and avoid unecessary bottlenecks or surprising slowdowns. Fig. 1. A 2-socket ccNUMA Westmere-EP platform with a 6-core Xeon 5600 in each socket and a QPI for cache coherent data exchange between them. I. I NTRODUCTION Xeon 5500 and Xeon 5600 are highly successful processors, based on the recent Intel μ-architectures nicknamed respectively, “Nehalem” and “Westmere”. Nehalem implements the “Intel64” instruction set ar- chitecture (ISA), on a 45nm lithography, using high- k metal gate transistor technology [1]–[3]. Westmere is the follow-on implementation of an almost identical μ-architecture but on a 32nm lithography, with a 2 nd generation high-k metal gate technology [3]–[5]. Xeons are chip multi-processors (CMPs) designed to support varying numbers of cores per chip according to spe- cific packaging. A CMP is the paradigm of architect- ing several cores and other components on the same silicon die to utilize the higher numbers of transistors as they become available with each new generation of process technology. CMPs (or multi-core processors) were adopted by processor manufacturers in their effort to support feasible power and thermal limits [6]. The trend of packaging higher numbers of cores on the same chip is expected to continue in the foreseeable future as IC feature size continues to decrease. Intel co-designed the Westmere chip alongside Nehalem making provisions for the increased system resources necessary in a chip with higher number of cores [4]. In this work we focus on the two socket 4-core

IEEExeonmem

Embed Size (px)

Citation preview

MICHAEL E. THOMADAKIS 1

Memory Scalabilty and Performance in Intel64Xeon SMP Platforms

MICHAEL E. THOMADAKIS

Abstract—

cc-NUMA systems based on the Intel Nehalem and theWestmere processors are very popular with the scientificcomputing communities as they can achieve high floating-point and integer computation rates on HPC workloads.

However, a closer analysis of the performance in theirmemory subsystem reveals that the per core and per threadmemory bandwidth of either microprocessor is restrictedto almost 1

3 of their ideal values. Multi-threaded memoryaccess bandwith tops out at 2

3 of the maximum limit. Inaddition to this, the NUMA effect on latencies increasinglyworsens as cores try to access larger memory residentdata structures and the problem is exacerbated when theregular 4KiB page sizes are used. Moving from Nehalemto Westmere, read performance for data already ownedby the same core scales gracefully with the number ofcores and core clock speed. However, when data is withinthe L2 cache or beyond, write performance suffers inWestmere revealing scalability issues in their design whenthe system from the 4 to 6 cores. This problem gets moreacute when multiple streams of data progress concurrently.The Westmere memory subsystem compared to that of theNehalem, suffers from a worse performance degradationwhen threads on a core are modifying cache blocks ownedby different cores within the same or another processorchip. Applications moving from Nehalem to Westmerebased platforms, could experience unexpected memoryaccess degradation even though Westmere was intentedas a “drop-in” replacement for Nehalem.

In this work we attempt to provide an accurate accountof the on-chip and the system level bandwith and latencylimitations of Xeons. We study how these two metricsscale as we move from one generation of a platform tosubsequent ones where the clock speed, the number ofcores and other architecture parameters are different.

Here we also analyze how much locality, coherencestate and virtual memory page size of data blocks affectsmemory performance. These last three factors are tra-ditionally overlooked, but if not given adequate attentioncan affect application performance significantly. We believethe performance analysis presented here can be used byapplication developers who strive to tune their code to usethe underlying resources efficiently and avoid unecessarybottlenecks or surprising slowdowns.

����������

���

����

���������������

�� �� �

������

�������

������

�� �

�����

��� ���

��

���

��� ���

��

���

��� ���

��

���

��� ���

��

���

����������

��� ���

��

���

��� ���

��

���

������ ����������

���

����

��

������

���������������

���������

�� ������ ������������

������

�� �

�����

� �������� ���������������

��� ���

��

���

��� ���

��

���

��� ���

��

���

��� ���

��

���

��� ���

��

���

��� ���

��

���

����������������� �������

Fig. 1. A 2-socket ccNUMA Westmere-EP platform with a 6-coreXeon 5600 in each socket and a QPI for cache coherent data exchangebetween them.

I. INTRODUCTION

Xeon 5500 and Xeon 5600 are highly successfulprocessors, based on the recent Intel µ-architecturesnicknamed respectively, “Nehalem” and “Westmere”.Nehalem implements the “Intel64” instruction set ar-chitecture (ISA), on a 45nm lithography, using high-k metal gate transistor technology [1]–[3]. Westmereis the follow-on implementation of an almost identicalµ-architecture but on a 32nm lithography, with a 2nd

generation high-k metal gate technology [3]–[5]. Xeonsare chip multi-processors (CMPs) designed to supportvarying numbers of cores per chip according to spe-cific packaging. A CMP is the paradigm of architect-ing several cores and other components on the samesilicon die to utilize the higher numbers of transistorsas they become available with each new generation ofprocess technology. CMPs (or multi-core processors)were adopted by processor manufacturers in their effortto support feasible power and thermal limits [6]. Thetrend of packaging higher numbers of cores on the samechip is expected to continue in the foreseeable future asIC feature size continues to decrease. Intel co-designedthe Westmere chip alongside Nehalem making provisionsfor the increased system resources necessary in a chipwith higher number of cores [4].

In this work we focus on the two socket 4-core

MICHAEL E. THOMADAKIS 2

Xeon 5500 (“Nehalem-EP”) and 6-core Xeon 5600(“Westmere-EP”) platforms. Fig. 1 illustrates the basicsystem components of the 12-core Westmere-EP plat-form. Nehalem-EP platforms have an almost identicalsystem architecture but with 4 cores per chip. Otherdifferences between them will be discussed in latersections.

Xeons have been employed extensively in high-performance computing (HPC) platforms1 as theyachieve high floating-point and integer computationrates. Xeons’ high-performance is attributed to a numberof enabling technologies incorporated in their design. Atthe µ-architecture level they support, among others, spec-ulative execution and branch-prediction, wide decodeand issue instruction pipelines, multi-scalar out-of-orderexecution pipelines, native support for single-instructionmultiple-data (SIMD) instructions, simultaneous multi-threading and support for a relatively high degree ofinstruction level parallelism [3], [9]–[11].

Scientific applications usually benefit by higher num-bers of execution resources, such as floating point orinteger units. These are available in the out-of-order,back-end execution pipeline on Xeon cores. However,in order to sustain high instruction completion rates,the memory subsystem has to provide each core withdata and instructions at rates that will keep the pipelineunits utilized most of the time. The demand to feed thepipelines is exacerbated in multi-core systems like theXeons, since the memory system has to keep up withseveral cores simultaneously. Memory access is almostalways in the critical path of a computation. Clevertechniques are being devised on the architecture side tomitigate the memory performance bottleneck.

Xeons rely on a number of modern architecturalfeatures to speed up memory access by the cores. Theseinclude an on-chip integrated memory controller (IMC),multi-level hardware and software pre-fetching, deepqueues in load and store buffers, store-to-load forward-ing, three levels of cache, two levels of Translation-Lookaside Buffers, wide data paths, and high-speedcache coherent inter-chip communication over the QPIfabric [12], [13]. The on-chip integrated memory con-troller attaches a Xeon chip to a local DRAM throughthree independent DDR3 memory channels which forWestmere can go up to 1.333GTransfers/s. On theXeon-EP platform each one of the two processor chipsdirectly connects to physically distinct DRAM spaceforming a cache-coherent Non-Uniform Memory Access

1Xeon processors power 65% and 55% of the HPC systemsappearing respectively, in the June 2011 [7] and the Nov. 2010 “Top-500” lists [8].

(ccNUMA) system. Fig. 1 illustrates the cc-NUMA EPorganization with two processor sockets, separate on-chip IMC and DRAM per socket and the physicalconnectivity of the two sockets by the QPI. Separatememory controllers per chip support increased scalabilityand higher access bandwidths than were possible beforewith older generations of Intel processors which reliedon the (in)famous Front Side Bus architecture.

A. Motivation for this Study

Even though great progress has been achieved withXeons in speeding up memory access, a closer per-formance analysis of the memory subsystem revealsthat, the per core and per thread memory bandwidth ofeither microprocessor is restricted to almost one thirdof their theoretical values. The aggregate, multi-threadedmemory access bandwidth tops out at two thirds of themaximum limit. In addition to this, the NUMA effecton latencies increasingly worsens as cores try to accesslarger memory-resident data structures and the problemis exacerbated when the regular 4KiB page sizes areused. The Westmere memory subsystem compared tothat of the Nehalem, suffers from a worse performancedegradation when threads on a core are writing to cacheblocks owned by different cores within the same chip oranother processor chip.

Moving from Nehalem to Westmere, read performancefor data already owned by the same core scales grace-fully with the number of cores and core clock speed.However, when data is within the L2 cache or beyond,write performance suffers in Westmere revealing scala-bility issues in their design when the system moved fromthe 4 to 6 cores. Applications moving from Nehalem toWestmere based platforms, could experience unexpectedmemory access degradation even though Westmere wasintended as a “drop-in” replacement for Nehalem.

Application developers are faced with several chal-lenges trying to write efficient code for modern multi-core cc-NUMA platforms, such as those based on Xeon.Developers now typically have to partition the computa-tion into parallel tasks which should utilize the coresand memory efficiently. The cost of memory accessis given special attention since memory may quicklybecome the bottleneck resource. Developers implementcache-conscious code to maximize reuse of data alreadycached and avoid costlier access to higher levels inthe memory hierarchy. Another approach to increaseefficiency of multi-threaded applications, such as OMPcode in scientific applications, is to fix the location ofcomputation thread to a particular core and allocate dataelements from a particular DRAM module. Selecting the

MICHAEL E. THOMADAKIS 3

right location to place threads and data on particularsystem resources is a tedious and at times lengthy trialand error process. When memory access cost changesthe code has to be re-tuned.

This work attempts to accurately quantify memoryaccess cost by a core to memory locations which areresident in the different levels of memory hierarchyand owned by threads running on the same or othercores. We analyze performance scalability limits as wemove from one generation of a platform to subsequentones where the clock speed, the number of cores andother architecture parameters are different. The analysispresented here can be used by application developerswho strive to understand resource access cost and tunethe code to use the underlying resources efficiently.

B. Related Work

Babka and Tuma [14] attempted to experimentallyquantify the operating cost of Translation LookasideBuffers (TLBs) and cache associativity. Peng et al. [15]used a “ping-pong” methodology to analyze the latencyof cache-to-cache transfers compare the memory per-formance of older dual-core processors. The well-knowSTREAM benchmark [16] measures memory bandwidthat the user level but it disregards the impact in perfor-mance of relevant architectural features, such as, NUMAmemory. Molka and Hackenberg [17], [18] comparedthe latency and bandwidth of the memory subsystem onAMD Shanghai and Intel Nehalem when memory blocksare in different cache coherency states.

II. XEON MEMORY IDEAL PERFORMANCE LIMITS

Ideal data transfer figures are obtained by multiplyingthe transfer rates times the data width in bits of eachsystem channel. Vendors, usually, publish this and othermore intimate design details partially2.

Xeon processors chips consist of two parts, the “Core”and the “Un-core,” which operate on separate clockand power domains. Fig. 2 illustrates a 6-core West-mere chip, the Core and Un-core parts, intra-chip datapaths and some associated ideal date transfer rates. Theun-core consists of the Global Queue, the IntegratedMemory Controller, a shared Level 3 cache and QPIports connecting to the other processor chip and toI/O. It also contains performance monitoring and powermanagement logic. The Core part houses the processorcores.

����������

��

����

�������������� ���������������������������������������

������������������������

�����������������

��������� ����!��! �������"�����������#�!$� %�����������&���������

����"

����

�������

�����

�����

���

�����

���

���

��

��

� �

��

��!

!"�

��

�'&�'()��

�*&�*�()�� �*&�*�()��

'+&'+�(���

'+&'+�(���

# $ #

���!�������

��%���&

# $ #

���!�������

��%���'

# $ #

���!�������

��%���!

# $ #

���!�������

��%���#

# $ #

���!�������

��%����

# $ #

���!�������

��%����

���������,�' **-��'�

��������,�' **-��*�

Fig. 2. A 6-core Westmere (Xeon 5600) chip illustrating the Coreand Un-core parts.

������������ � �

���������������������� �����

��������������

���

���

� �

� �

���

� �

!"��#!

���

���

$��%$$�&����������������

!"�

!$� �$�

$�

��������

!$� �$�

$�

��������

!$� �$�

$�

��������

!$� �$�

$�

��������

�#!#�'

�#!

��������������

�#!#�'

��������������( )�*+

������������ ����

���������� �� �

�����������(,�*+

���������

������

�� ��������

���������

��������

�����

��� ���

�*

�#!*���-.���

Fig. 3. Detail of the Global Queue, the connectivity to IMC, L3, thecores and the QPI on a 4-core Nehalem chip, and associated idealtransfer rates.

A. The “Un-Core” Domain

The Un-core clock usually operates at twice the speedof the DDR3 channels and for our discussion we willassume it is at 2.667GHz. The L3 on Xeon-EP platformssupports 2MiB per core, which is 8MiB and 12MiB forthe Nehalem and Westmere, respectively. The L3 has32-Byte read and write ports and operates on the Un-core clock. The QPI subsystem operates on a separatefixed clock which for the systems we will be consideringsupports 6.4giga-transfers/s. The “Global Queue” (GQ)structure is the central switching and buffering mech-anism that schedules data exchanges among the cores,the L3, the IMC and the QPI. Fig. 3 illustrates GQdetails on a Nehalem chip. The GQ buffers requestsfrom the Core for memory reads, for write-backs tolocal memory and remote peer operations with 32, 16and 12 slot entries, respectively. The GQ plays a central

2 [19] offers a more complete discussion of the memory architec-ture on Nehalem-EP platforms and ideal performance figures whichalso apply to Westmere-EP.

MICHAEL E. THOMADAKIS 4

�����������

���� ��

�� �

��������

�������� ��

� �� �

����������������

�� ������

����� ��!��"���

#$%!���

&��������'������

�� ������

����� �

$����(��)���#

�� ������

����������� �� (

*#$�����

�#��������� ���+��"'�,$- �

����������� �� (

�#�&����+��"�,$- �'�

%������������� �� (

���.���-�� /

�$�+��"�$* - �

%���������� �� (

���.���-�� /

�� � �

���� ����

������� �

��� �����

�,�+��"'�%0 �

# ���������� �� (

���.���-�� /

1"���

��������

02���

+��������

3��

���������

��� ����

+���+���-�&�2� �

)��+��+���-&�2� �

02���������������40��5��������.�������������.��������� ���.�����

���������

����

6�����

1���

6�����

1���

&���

���� ����������

�������� ���������������

��������������

�����

������

���$*7,�8��������������

&��������

"�������

,$���� �

����������� �� (

����������

���������

9���� �!��"���

����������� �� (

9�����

������

��������������

+���

���������

������� �

���������

������

Fig. 4. Cache hierarchy and ideal performance limits in a Xeoncore.

role in the the operations and performance of the entirechip [20]. However, few technical details are availableconcerning GQ. Westmere increased the peak CPU andI/O bandwidth to DRAM memory by increasing theper socket un-core buffers to 88 from 64 in Nehalem[4]. This “deeper” buffering was meant to support moreoutstanding memory access operations per core thanpossible in Nehalem-EP.

Ideally, the IMC can transfer data to the locallyattached DRAM at the maximum aggregate bandwidthof the DDR3 paths to the memory DIMMs. The threeDDR3 channels to local DRAM support a bandwidth of31.992GB/s = 3×8×1.333giga-transfers/s. Each corein a socket should be able to capture a major portion ofthis memory bandwidth.

The QPI links are full-duplex and their ideal transferrate is 12.8 GB/s per direction. When a core accessesmemory locations resident at the DRAM attached to theother Xeon chip (see Fig. 1), data is transferred overthe QPI link connecting the chips together. The availablebandwidth through the QPI link is approximately %40 ofthe theoretical bandwidth to the local DRAM and is theabsolute upper bound to access remote DRAM. The QPI,L3, GQ and IMC include logic to support the “source-snooping” MESIF-QPI cache-coherence protocol thattheXeon-EP platform [12], [13] employs. The QPI logicuses separate virtual channels to transport data or cache-coherence messages according to their type. It alsopre-allocates fixed numbers of buffers for each source-destination QPI pair. This likely exacerbates congestionbetween end-points with high traffic.

B. The “Core” Domain

On the Core domain, each core supports two levelsof cache memory, L1 instruction and data, and a unifiedL2. Fig. 4 presents details of the cache memory hierar-chy, associated connectivity and some ideal performancelevels in a Nehalem core. Structure of Westmere coresis very similar. The L2s connect to the L3 which isshared by all cores, via the GQ structure. Each core has 2levels of TLB structures for instructions and one for data.There are separate TLBs for 4KiB and 2MiB size pages.Each core includes a “Memory Order Buffer” with 48,32 and 10 load, store and fill buffers, respectively. Fillbuffers temporarily store new incoming cache blocks.There can be at most ten cache misses in progress ata time, placing an upper bound on data retrieval ratesper core. All components in the Core domain operate onthe same clock as the processor. This implies that idealtransfer rates scale with the clock frequency.

III. XEON MEMORY SYSTEM SCALABILITY

ANALYSIS

A. Design of Experiments

In this Section we analyze performance and scalabil-ity limits in the Xeon and the EP platforms memorysystems. Conventional performance evaluations measurememory bandwidth and latency regardless of (a) localityor residency, that is, where this data is cached or resideand (b) the cache coherence state it is on at the time ofthe access. Another factor which is usually overlookedis (c), the virtual memory page size the system is usingto map virtual addresses3 into physical ones.

In our analysis we take all of these aspects into ac-count since, as we show, bandwidth performance figuresvary drastically with locality and cache state. Page sizeaffects mostly latency and to a smaller extend bandwidth.

Application developers along with the conventionalraw performance and scalability limits, also have to payincreased attention to data locality and coherence stateand select a more proper page size to tune their codeaccordingly.

We refer to Fig. 1 to illustrate details of our inves-tigation applying to all of our experiments. We dividethe investigation into single-core and aggregate, multi-core performance analysis. The single core focuses onbandwidth and latency figures a single thread, that isfixed on a particular physical core, experiences while ac-cessing memory. The multi-core focuses on the aggregatesystem-level performance figures when threads on everycore are all performing the same memory operation.The

3The term “effective address” is used for this address.

MICHAEL E. THOMADAKIS 5

latter one evaluates how well contention for commonresources is handled by the architecture and it revealslimitations and opportunities.

A single-core access pattern pins a software threadon Core 1 (or “CPU 0”) and evaluates the bandwidthaccessing data on the L1 and L2 cache memories belong-ing to the same core and on the L1 and L2 memoriesbelonging to each one of the other cores. It also evaluatesthe bandwidth accessing data on the L3 and the DRAMsattached to the same and to the other processor chips.

One each of the experiments the cache blocks thethreads access can be in different coherence states perthe MESIF-QPI protocol. We investigate the scalabilityas we move from 4-core Nehalem to 6-core Westmereprocessors and as we move from cores running at acertain frequency to cores running at higher frequencies.Wherever possible, we compare the attained performancewith the ideal performance numbers in that platform anddiscuss its dependence on locality and on the particularcoherence state.

We have selected three different Xeon platform con-figurations on two different working systems. All threeconfigurarions operate in the so called “IA-32e, full 64-bit protected sub-mode” [9] which is the fully Intel64compliant 64-bit mode.

The first system is an IBM iDataPlex clustercalled “Eos”, maintained by the SupercomputingFacility at Texas A&M University [21]. Eos con-sists of a mixture of Nehalem-EP (Xeon-5560) andWestmere-EP (Xeon-5660) nodes, with all cores runningat 2.8GHz. Each node has 24GiBs of DDR3 DRAMoperating at 1.333 GT/s.

The second system is Dell PowerEdge M610 bladecluster, called “Lonestar” and is a maintained by theTexas Advanced Computing Center (TACC) in the Uni-versity of Texas at Austin. Lonestar consists only ofWestmere-EP (Xeon-5680) nodes with cores runningat 3.33GHz. Each node has 24GiBs of DDR3 DRAMoperating at 1.333 GT/s.

For all experiments we utilize the “BenchIT” [22],[23] open source package4 which was built on the targetXeon EP systems. We used a slightly modified versionof a collection of benchmarks called “X86membench”[18]. These kernels use Intel64 assembly instructions toread and write to memory and obtain timming measure-ments using the “Cycle-counter” hardware register thatis available on each Xeon core.

B. Xeon Memory Bandwidth Analysis

4Official web site of the BenchIt project at http://www.benchit.org.

5

10

15

20

25

30

35

40

45

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-reader

read bandwidth CPU0 locallyread bandwidth CPU0 accessing CPU2 memoryread bandwidth CPU0 accessing CPU3 memoryread bandwidth CPU0 accessing CPU4 memoryread bandwidth CPU0 accessing CPU5 memoryread bandwidth CPU0 accessing CPU6 memoryread bandwidth CPU0 accessing CPU7 memory

Fig. 5. Bandwidth of a Single Reader Thread, Nehalem-EP, 4KiBPages (EOS)

1) Single Core Data Retrieval: In this experiment weinvestigate the effective data retrieval rates by a singlecore from the different levels of memory hierarchy andall possible data localities in the system. This capturesthe portion of the system capacity a single core canutilize effectively. A single reader thread is pinned on“CPU0” (core 1 in Fig. 1) and reads memory seg-ments with sizes varying successively from 10KiBs to200MiBs.

The reader thread retrieves memory blocks from itsown data L1 and L2 caches, then from the L3 and theDRAM associated with its own processor chip. It thenretrieves data already cached on the L1, L2 of all othercores on the same chip. Finally it retrieves data cachedon the L1, L2 of all cores, the L3 and DRAM associatedwith the other processor chip.

All memory blocks, if already cached, are in the“Exclusive” MESIF-QPI state in the corresponding own-ing core. A data block enters this state when it hasbeen read and cached by exactly one core. By the QPIprotocol, a requested block may be retrieved directly outof an L3 instead of each home DRAM, if is alreadycached on that L3. As soon as a second core cachesa data block, the state of the first copy changes to“Shared” and the state in the newly cached one becomes“Forwarding”. MESIF protocol allows exactly one copyto be in the latter state permitting it to quickly forwardit to the next requestor. This operation avoids accessingthe slower home memory and is called a “cache to cacheintervention”.

Fig. 5, Fig. 7 and Fig. 9 plot data retrieval bandwidthson Nehalem and Westmere parts of EOS and on Lones-tar, respectively. The top curves plot the BW in GiB/swhen the core retrieves data from its own L1, L2, L3 andDRAM associated with its own and the remote chip. Theobserved bandwidths of 43.7 GB/s, 43.6 GB/s and 51.8

MICHAEL E. THOMADAKIS 6

5

10

15

20

25

30

35

40

45

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64_LP.memory_bandwidth.C.pthread.SSE2.single-reader

read bandwidth CPU0 locallyread bandwidth CPU0 accessing CPU2 memoryread bandwidth CPU0 accessing CPU3 memoryread bandwidth CPU0 accessing CPU4 memoryread bandwidth CPU0 accessing CPU5 memoryread bandwidth CPU0 accessing CPU6 memoryread bandwidth CPU0 accessing CPU7 memory

Fig. 6. Bandwidth of a Single Reader Thread, Nehalem-EP, 2MiBPages (EOS)

5

10

15

20

25

30

35

40

45

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-reader

read bandwidth CPU0 locallyread bandwidth CPU0 accessing CPU2 memoryread bandwidth CPU0 accessing CPU3 memoryread bandwidth CPU0 accessing CPU4 memoryread bandwidth CPU0 accessing CPU5 memoryread bandwidth CPU0 accessing CPU6 memoryread bandwidth CPU0 accessing CPU7 memoryread bandwidth CPU0 accessing CPU8 memoryread bandwidth CPU0 accessing CPU9 memory

read bandwidth CPU0 accessing CPU10 memoryread bandwidth CPU0 accessing CPU11 memory

Fig. 7. Bandwidth for a Single Reader Thread, Westmere-EP, 4KiBPages (EOS)

5

10

15

20

25

30

35

40

45

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64_LP.memory_bandwidth.C.pthread.SSE2.single-reader

read bandwidth CPU0 locallyread bandwidth CPU0 accessing CPU2 memoryread bandwidth CPU0 accessing CPU3 memoryread bandwidth CPU0 accessing CPU4 memoryread bandwidth CPU0 accessing CPU5 memoryread bandwidth CPU0 accessing CPU6 memoryread bandwidth CPU0 accessing CPU7 memoryread bandwidth CPU0 accessing CPU8 memory

Fig. 8. Bandwidth of a Single Reader Thread, Westmere-EP, 2MiBPages (EOS)

5

10

15

20

25

30

35

40

45

50

55

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-reader

read bandwidth CPU0 locallyread bandwidth CPU0 accessing CPU2 memoryread bandwidth CPU0 accessing CPU3 memoryread bandwidth CPU0 accessing CPU4 memoryread bandwidth CPU0 accessing CPU5 memoryread bandwidth CPU0 accessing CPU6 memoryread bandwidth CPU0 accessing CPU7 memoryread bandwidth CPU0 accessing CPU8 memoryread bandwidth CPU0 accessing CPU9 memory

read bandwidth CPU0 accessing CPU10 memoryread bandwidth CPU0 accessing CPU11 memory

Fig. 9. Bandwidth for a Single Reader Thread, Westmere-EP, 4KiBPages (LoneStar)

accessing L1 cache (32 KiB) is very close to the idealones. The ideal L1 BW for a 2.8GHZ and a 3.33GHzXeon is 44.8 GB/s (44.8GB/s = 2.8GHz × 16bytes)and 53.28 GB/s (44.8GB/s = 3.33GHz × 16bytes),respectively.

The L2 bandwidths are measured to 29.7 GB/s, 29.7GB/s and 35.3, respectively. The vendor does not providefigures on L2 performance except from the latency toretrieve an L2 block.

Data retrievals scale well when we move from 4 coresto 6 cores and also they scale well with the clock of core.For instance we can check that 3.33

2.8 29.7 = 35.32 whichmatches the measured L2 bandwidth on the Westmererunning at 3.33GHz.

L3 data retrieval figures are not provided by the vendorand are measured to 23.8 GB/s, 23.1 GB/s and 25.6,respectively. We notice that 3.33

2.8 > 25.623.1 , implying that

L3 access does not scale linearly with the core clock.This is expected since the L3 is at the Un-Core whichoperates at twice the DDR3 rates.

The local DRAM supports 11.8 GB/s, 10.9 GB/sand 11.1, respectively. The Remote DRAM supports 7.8GB/s, 7.7 GB/s and 7.7, respectively. Data from remoteDRAM traverse the QPI link but the QPI ideal rate doesnot appear to be the limiting factor.

The curves at the middle of the Fig. 5, Fig. 7 andFig. 9 plot retrieval rates of data items already cachedin the L1 or L2 of other cores within the same chip.L3 which is an inclusive cache also caches whateveris cached above in a L2 or L1. Thus L3 uses cacheintervention to pass a copy of this block up to core 1.This explains why accessing data already cached by otercores has the same performance as accessing data fromthe L3. Comparing 4-core and 6-core systems we seethat performance accessing blocks cached by other coreswithin the same chip is worse for the 6-core system.

MICHAEL E. THOMADAKIS 7

5

10

15

20

25

30

35

40

45

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-writer

memory bandwidth: CPU0 writing memory used by CPU0memory bandwidth: CPU0 writing memory used by CPU2memory bandwidth: CPU0 writing memory used by CPU3memory bandwidth: CPU0 writing memory used by CPU4memory bandwidth: CPU0 writing memory used by CPU5memory bandwidth: CPU0 writing memory used by CPU6memory bandwidth: CPU0 writing memory used by CPU7

Fig. 10. Bandwidth of Single Writer Thread, Nehalem-EP, 4KiBPages (EOS)

Finally the bottom curves show the rate for when core1 access blocks already cached by cores on the otherchip. Rates start for all cases at around 9 GB/s wheredata is supplied by the remote L3 and drop to around7.7 GB/s for larger requests where the remote DRAMhas to be accessed.

The same experiment has been carried out with 2MiBlarge VM pages on the Nehalem and Westmere partsof EOS. Fig. 6 and Fig. 8 plot the respective results.Bandwidth figures using large pages are similar to thoseof the regular 4KiB pages with the only difference thatperformance starts dropping a little later as we crossboundaries in the memory hierarchies.

Overall, measured data retrieval rates are close to theideal limits and scale well with clock rate and as wemove from 4 to 6 cores.

It is clear that a single core cannot utilize the entireavailable bandwidth to the DRAM. Resource limits alongthe path from a core’s Memory Order Buffer to the IMCare creating this artificial upper bandwidth bound. Thedifference in bandwidth quickly deterriorates when thecache memories cannot absorb the requests. Applicationdevelopers need to take this large performance disparitieswhen they tune their code for the architecture.

2) Single Core Data Updates: This experiment is thedata modification counterpart of the previous experimentwhere the writer thread is also pinned on Core 1. Allblocks are initialized to the state “Modified” before themeasurements.

When a core has to write to a data block, theMESIF protocol requires a “Read-for-Ownership” oper-ation which snoops and invalidates this memory block ifit is already stored on other caches.

Fig. 10, Fig. 11 and Fig. 12 plot measured bandwidths.The L1 rates closely match the retrieval rates. Perfor-mance of L2 and L3 is relatively worse than when data

5

10

15

20

25

30

35

40

45

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-writer

memory bandwidth: CPU0 writing memory used by CPU0memory bandwidth: CPU0 writing memory used by CPU2memory bandwidth: CPU0 writing memory used by CPU3memory bandwidth: CPU0 writing memory used by CPU4memory bandwidth: CPU0 writing memory used by CPU5memory bandwidth: CPU0 writing memory used by CPU6memory bandwidth: CPU0 writing memory used by CPU7memory bandwidth: CPU0 writing memory used by CPU8memory bandwidth: CPU0 writing memory used by CPU9

memory bandwidth: CPU0 writing memory used by CPU10memory bandwidth: CPU0 writing memory used by CPU11

Fig. 11. Bandwidth of a Single Writer Thread, Westmere-EP, 4KiBPages (EOS)

5

10

15

20

25

30

35

40

45

50

55

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-writer

memory bandwidth: CPU0 writing memory used by CPU0memory bandwidth: CPU0 writing memory used by CPU2memory bandwidth: CPU0 writing memory used by CPU3memory bandwidth: CPU0 writing memory used by CPU4memory bandwidth: CPU0 writing memory used by CPU5memory bandwidth: CPU0 writing memory used by CPU6memory bandwidth: CPU0 writing memory used by CPU7memory bandwidth: CPU0 writing memory used by CPU8memory bandwidth: CPU0 writing memory used by CPU9

memory bandwidth: CPU0 writing memory used by CPU10memory bandwidth: CPU0 writing memory used by CPU11

Fig. 12. Bandwidth of a Single Writer Thread, Westmere-EP, 4KiBPages (LoneStar)

is retrieved from these caches. Local and remote DRAMaccess is even worse. Writing local DRAm is 8.8 GB/s,7.7 GB/s and 8.2 GB/s. Writing to remote DRAM on isat around 5.5 for all three cases.

On 6-core systems writing to blocks already cached byother cores on the same chip is a less scalable operationthan on a 4-core system. For instance Fig. 10 shows thaton the 4-core Nehalem the L3 can absorb gracefully allblock updates from within the same chip to a stable per-formance of 17.6 GB/s. However, as we see on Fig. 11and Fig. 12, the same scenario on the 6-core Westmereattains 15.5 GB/s and 16.9 GB/s, respectively. The lastfigure comes from a systems which operates at 3.33GHz,that is 1.19 = 3.33

2.8 faster. However, the 2.8GHz Nehalemstill manages to attain 17.6 GB/s. More importandly, onthe 6-core systems the top curve (accessing data onlycached by itself) quickly deterriorates to 12.9 GB/s andto even below 10 GB/s. The same curve on the 4-corenehalem attains 17.6 GB/s and it appears much morestable.

Applications moving from 4-core to 6-core systemswill experience unexpected performance degradation.

MICHAEL E. THOMADAKIS 8

0

10

20

30

40

50

60

70

80

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-r1w1

bandwidth: CPU0 - CPU0bandwidth: CPU0 - CPU2bandwidth: CPU0 - CPU3bandwidth: CPU0 - CPU4bandwidth: CPU0 - CPU5bandwidth: CPU0 - CPU6bandwidth: CPU0 - CPU7

Fig. 13. Bandwidth of a Single Pair of Reader and Writer Streams,Nehalem-EP, 4KiB Pages (EOS)

0

10

20

30

40

50

60

70

80

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-r1w1

bandwidth: CPU0 - CPU0bandwidth: CPU0 - CPU2bandwidth: CPU0 - CPU3bandwidth: CPU0 - CPU4bandwidth: CPU0 - CPU5bandwidth: CPU0 - CPU6bandwidth: CPU0 - CPU7bandwidth: CPU0 - CPU8bandwidth: CPU0 - CPU9

bandwidth: CPU0 - CPU10bandwidth: CPU0 - CPU11

Fig. 14. Bandwidth of a Single Pair of Reader and Writer Streams,Westmere-EP, 4KiB Pages (EOS)

The hardware provisions made at design time movingfrom 4 to 6 cores per chip restrict the scalability andperformance.

3) Combined Retrieval and Update – Single StreamPair: In this experiment, a single thread pinned on acore, drives simultaneously a retrieval and an update

5

10

15

20

25

30

35

40

45

50

55

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-r1w1

bandwidth: CPU0 - CPU0bandwidth: CPU0 - CPU2bandwidth: CPU0 - CPU3bandwidth: CPU0 - CPU4bandwidth: CPU0 - CPU5bandwidth: CPU0 - CPU6bandwidth: CPU0 - CPU7bandwidth: CPU0 - CPU8bandwidth: CPU0 - CPU9

bandwidth: CPU0 - CPU10bandwidth: CPU0 - CPU11

Fig. 15. Bandwidth of a Single Pair of Reader and Writer Streams,Westmere-EP, 4KiB Pages (LoneStar)

stream from various localities in the memory hierarchytowards its local DRAM. This investigates the abilityof the memory to simultaneously retrieve and updatedifferent locations. The bandwidth figures reflect the factthat each data block is used in two memory operations.

Fig. 13, Fig. 14 and Fig. 15 plot measured bandwidthsfrom the three experimental configurations. The topcurves plot the case where both read and write streamsare on the same DRAM module. For data that can fit inthe L1s we attain 75.6GB/s, 75.8 GB/s and 53.3 GB/s,respectively. This shows that the two ports of the L1 canbe used simultaneously and attain respectable rates. Thecurves in the middle plot when the source is cached onthe L1 and L2 caches of cores within the same chip. Thebottom curves plot the bandwidths attained by movingblocks cached or resident on the other chip’s localities.However, since the cache memories end up caching twocopies of a data block, performance drops much fasteras soon as 1

2 of a cache is filled with the inboundblocks. The two EOS configurations have approximatelythe same performance, with the Westmere one havingsomewhat lower levels. The LoneStar system appears toperform with this workload in a more unstable fashionas soon as the L1 is overwhelmed by inbound andoutbound copies with bandwidth dropping to 27.5GB/s.However as soon as the segments get larger than 120KiBs bandwidth jumps back to 41.4 GB/s. With theexception of this abnormal bandwidth drop, the relativebandwidth in L1 and L2s follows the clock frequencyratio. However, the L2 performance drops on Westmerecompared to Nehalem as the occupancy in L2 increases.

Finally when data is streamed in from the remote chip,while inbound and outbound blocks fit in the L2 and L3the apparent bandwidth is ≈ 14GB/s–15GB/s, but dropsdown to ≈ 9GB/s.

4) Aggregate Data Retrieval Rates: In this experi-ment, nc threads are split evenly across all nc availablecores and read simultaneously from disjoint locations,memory segments of sizes up to 200MiBs. This inves-tigates the aggregate throughput a system can providesimultaneously to multiple cores.

Here one thread is pinned on each core. Cachedmemory blocks are in the Exclusive state. Fig. 16, Fig. 17and Fig. 18 plot the aggregate retriaval rates on our threesystems, with the x-axis being the sum of all blocks at aparticular size.. L1 rates are 353.3 GB/s, 524.2 GB/sand 620.9 GB/s, respectively giving 44.2 GB/s, 43.7GB/s and 51.7 GB/s, per core, all close to correspondingideal bandwidths. Also we notice that 620.9

524.2 '3.332.8 for

the two Westmere systems, implying performance scaleswith clock speed.

For the L2s we performance is at 237.5 GB/s, 368.8

MICHAEL E. THOMADAKIS 9

0

50

100

150

200

250

300

350

400

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-reader

read bandwidth 0-7

Fig. 16. Aggregate Bandwidth of 8 Reader Threads, Nehalem-EP,4KiB Pages (EOS)

0

50

100

150

200

250

300

350

400

450

500

550

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-reader

read bandwidth 0-11

Fig. 17. Aggregate Bandwidth of 12 Reader Threads, Westmere-EP,4KiB Pages (EOS)

GB/s and 420 GB/s, respectively giving 29.7 GB/s,30.7 GB/s and 35 GB/s, per core which all match thecorresponding single core rates times the number ofcores.

L3 supports 161.1 GB/s, 172.1 GB/s and 171.8 GB/s,respectively giving 20.2 GB/s, 14.3 GB/s and 14.3 GB/s,per core.

Finally, when all requests go directly to DRAM,aggregate read bandwidth settles to approximately 39.8GB/s, 38.4 GB/s and 38.9 GB/s, respectively, or 19.9GB/sec, 19.2 GB/s and 19.4 GB/s, correspondingly persocket.

This experiment shows that the IMC on a chip deliversdata at rates higher than an individual cores can attain.The bottleneck thus is not on the DRAM and IMC sidebut in the Un-Core cause by artificial limits on resourcesdedicated to service each individual core.

5) Aggregate Data Updates: This experiment is thecounterpart of the aggregate retrieval experiment ofSubsection III-B4. As before, nc threads are split evenly

0

100

200

300

400

500

600

700

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-reader

read bandwidth 0-11

Fig. 18. Aggregate Bandwidth of 12 Reader Threads, Westmere-EP,4KiB Pages (LoneStar)

0

50

100

150

200

250

300

350

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-writer

write bandwidth 0-7

Fig. 19. Aggregate Bandwidth of 8 Writer Threads, Nehalem-EP,4KiB Pages (EOS)

0

100

200

300

400

500

600

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-writer

write bandwidth 0-11

Fig. 20. Aggregate Bandwidth of 12 Writer Threads, Westmere-EP,4KiB Pages (EOS)

MICHAEL E. THOMADAKIS 10

0

100

200

300

400

500

600

700

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-writer

write bandwidth 0-11

Fig. 21. Aggregate Bandwidth of 12 Writer Threads, Westmere-EP,4KiB Pages (LoneStar)

across all nc available cores and update simultane-ously disjoint locations, memory segments of sizes upto 200MiBs. One thread is pinned on each core andmemory blocks are in the Modified state.

Fig. 19, Fig. 20 and Fig. 21 plot the aggregate updateperformance of the three different confiqurations, withthe x-axis being the sum of all blocks at a particularsize. Updates to L1 attain 347.1GB/s, 527.3 GB/s and617.3 GB/s, respectively giving 43.4 GB/s, 43.94 GB/sand 51.4 GB/s, per core, all close to corresponding idealbandwidths.

The L2s can update their contents at 222.9GB/s, 325.4GB/s and 384 GB/s, respectively giving 27.8 GB/s, 27.1GB/s and 32 GB/s, per core, all closely following L2retrieval rates.

However, when we update L3, we see that the attainedrates are 52GB/s, 51.1 GB/s and 50.6 GB/s, respectivelywhich are considerably lower than the L3 retrieval rates.The per core average rate of the aggregate updates is 1

3to 1

4 that of the single core update rates of SubsectionIII-B2. This slowdown is somehow expected since L3are shared among all cores and all updates to L3 areserialized by the MESIF protocol. It is clear that notenough bandwidth has been provisioned on all threeconfigurations to sustain simultenous updates by allcores.

Finally update rates for DRAM are much worse thanthose of the aggregate retrieval case. Here all aggregaterates are at ≈ 20 GB/s or ≈ 10 GB/s per socket. Lookingat the individual update rates of Subsection III-B2 we cansee that with more than two individual update streamsevenly split across IMCs, the memory system becomesthe bottleneck. Aggregate rates are only a little higherthan those a single core attains.

6) Aggregate Combined Retrieval and Update – Mul-tiple Stream Pairs: This experiment investigates the

0

100

200

300

400

500

600

700

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-r1w1

bandwidth 0-7

Fig. 22. Aggregate Bandwidth of 8 Read and Write Stream Pairs,Nehalem-EP, 4KiB Pages (EOS)

0

100

200

300

400

500

600

700

800

900

1000

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-r1w1

bandwidth 0-11

Fig. 23. Aggregate Bandwidth of 12 Read and Write Stream Pairs,Westmere-EP, 4KiB Pages (EOS)

0

250

500

750

1000

1250

10k 100k 1M 10M 100M 1G

ban

dw

idth

[G

B/s

]

data set size [Byte]

new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-r1w1

bandwidth 0-11

Fig. 24. Aggregate Bandwidth of 12 Read and Write Stream Pairs,Westmere-EP, 4KiB Pages (LoneStar)

MICHAEL E. THOMADAKIS 11

limits of the aggregate system ability to retrieve andupdate concurrently multiple data streams. A thread ispinned on each one of the nc cores on the systemand drives each stream pair. This particular memoryaccess pattern stresses all parts in the entire memoryinfrastructure.

Fig. 22, Fig. 23 and Fig. 24 plot the aggregate updateperformance of the three different configurations, withthe x-axis being the sum of all blocks at a particular size.Updates to L1 attain 615.2 GB/s, 918.7 GB/s and 1076.7GB/s, respectively giving 76.9 GB/s, 76.6 GB/s and 89.7GB/s, average per core, all close to the sums reported byaggregate retrieval and updates of Subsections III-B4 andIII-B5. Scaling per core count and clock frequency forL1 is attained.

For aggregate L2 we obtain 264.9 GB/s, 394 GB/sand 472 GB/s, respectively giving 33.1 GB/s, 32.8 GB/sand 39.3 GB/s, average per core. In each of these threecases the attained bi-directional bandwidth is only a littlehigher than the one from the corresponding aggregateretrieval or update case. Here the L2 cache memoryquickly shows its limitation to handle bi-directionalstreams of blocks. Making L2 dual-port memory couldmitigate this problem.

The performance of the L3 caches with bi-directionaltraffic is even further disappointing as the achievedfigures are 72.5 GB/s, 69.9 GB/s and 68.1 GB/s, respec-tively giving 9 GB/s, 5.8 GB/s and 5.7 GB/s, averageper core. These figures demonstrate that concurrent bi-directional streams cannot be handled adequately by theL3 sub-system.

Finally, bi-directional block streams are serviced byDRAM access at 26.4 GB/s, 25.4 GB/s and 26.4 GB/s,respectively.

This investigation reveals that Xeons cannot handlegracefully bi-directional streams beyond the L1 cache.Further investigation would require digging into unavail-able GQ and QPI details but there is definitely room forimprovement here.

C. Memory Hierachy Access Latencies

We explore the effect page size has on the latenciesto access the memory hierarchy. We focus on the 4 KiBand 2 MiB, available in the Xeons and where 4KiB isthe most widely used size. For all experiments a threadis pinned on Core 1 and accesses memory cached orhomed at the various localities.

In all subsequent figures, the bottom curve plots costto access L1, L2, L3 and local DRAM. The middle onesplot latencies to access L1, L2 and L3 on another corewithin the same chip. The top curves plot latencies to

0

20

40

60

80

100

120

140

160

180

10k 100k 1M 10M 100M 1G

Tim

e in

nan

o-s

econds

data set size [Byte]

new_arch_x86_64.memory_latency.C.pthread.0.read

memory latency CPU0 locally (time)

memory latency CPU0 accessing CPU1 memory (time)

memory latency CPU0 accessing CPU2 memory (time)

memory latency CPU0 accessing CPU3 memory (time)

memory latency CPU0 accessing CPU4 memory (time)

memory latency CPU0 accessing CPU5 memory (time)

memory latency CPU0 accessing CPU6 memory (time)

Fig. 25. Latency to Read a Data Block in nano-secs, Nehalem-EP,4KiB Pages (EOS)

0

20

40

60

80

100

120

10k 100k 1M 10M 100M 1G

Tim

e in

nan

o-s

econds

data set size [Byte]

new_arch_x86_64_LP.memory_latency.C.pthread.0.read

memory latency CPU0 locally (time)

memory latency CPU0 accessing CPU1 memory (time)

memory latency CPU0 accessing CPU2 memory (time)

memory latency CPU0 accessing CPU3 memory (time)

memory latency CPU0 accessing CPU4 memory (time)

memory latency CPU0 accessing CPU5 memory (time)

memory latency CPU0 accessing CPU6 memory (time)

Fig. 26. Latency to Read a Data Block in Seconds, Nehalem-EP,2MiB Pages (EOS)

0

20

40

60

80

100

120

140

160

180

200

10k 100k 1M 10M 100M 1G

Tim

e in

nan

o-s

econd

s

data set size [Byte]

new_arch_x86_64.memory_latency.C.pthread.0.read

memory latency CPU0 locally (time)

memory latency CPU0 accessing CPU1 memory (time)

memory latency CPU0 accessing CPU2 memory (time)

memory latency CPU0 accessing CPU3 memory (time)

memory latency CPU0 accessing CPU4 memory (time)

memory latency CPU0 accessing CPU5 memory (time)

memory latency CPU0 accessing CPU6 memory (time)

memory latency CPU0 accessing CPU7 memory (time)

memory latency CPU0 accessing CPU8 memory (time)

memory latency CPU0 accessing CPU9 memory (time)

memory latency CPU0 accessing CPU10 memory (time)

Fig. 27. Latency to Read a Data Block in Seconds, Westmere-EP,4KiB Pages (EOS)

MICHAEL E. THOMADAKIS 12

0

20

40

60

80

100

120

10k 100k 1M 10M 100M 1G

Tim

e in

nan

o-s

econds

data set size [Byte]

new_arch_x86_64_LP.memory_latency.C.pthread.0.read

memory latency CPU0 locally (time)

memory latency CPU0 accessing CPU1 memory (time)

memory latency CPU0 accessing CPU2 memory (time)

memory latency CPU0 accessing CPU3 memory (time)

memory latency CPU0 accessing CPU4 memory (time)

memory latency CPU0 accessing CPU5 memory (time)

memory latency CPU0 accessing CPU6 memory (time)

memory latency CPU0 accessing CPU7 memory (time)

memory latency CPU0 accessing CPU8 memory (time)

memory latency CPU0 accessing CPU9 memory (time)

memory latency CPU0 accessing CPU10 memory (time)

Fig. 28. Latency to Read a Data Block in Seconds, Westmere-EP,2MiB Pages (EOS)

0

20

40

60

80

100

120

140

160

180

10k 100k 1M 10M 100M 1G

Tim

e in

nan

o-s

econds

data set size [Byte]

new_arch_x86_64.memory_latency.C.pthread.0.read

memory latency CPU0 locally (time)

memory latency CPU0 accessing CPU1 memory (time)

memory latency CPU0 accessing CPU2 memory (time)

memory latency CPU0 accessing CPU3 memory (time)

memory latency CPU0 accessing CPU4 memory (time)

memory latency CPU0 accessing CPU5 memory (time)

memory latency CPU0 accessing CPU6 memory (time)

memory latency CPU0 accessing CPU7 memory (time)

memory latency CPU0 accessing CPU8 memory (time)

memory latency CPU0 accessing CPU9 memory (time)

memory latency CPU0 accessing CPU10 memory (time)

Fig. 29. Latency to Read a Data Block in Seconds, Westmere-EP,4KiB Pages (LoneStar)

access L1, L2 and L3 of data already cached by coreson the other chip and finally by remote DRAM. All timesare in nano-seconds.

Fig. 25, Fig. 27 and Fig. 29 plot latencies when 4 KiBsize pages are used. Access to local DRAM can take upto 114.29 ns, 118.58 ns and 114.95 ns, respectively. Thelatency accessing from remote DRAM is, respectively170.37 ns, 181.8 ns and 173.3 GB/s. The differencein core clock frequency does not make any significantdifference. The NUMA effect, that is the disparity in thecost to access local vs. remote DRAM is 56.08 ns, 63.22ns and 58.35 ns, respectively. In terms of percentage,remote DRAM latency is higher by %49, %53.3 and%51, which are rather significant.

Using 2 MiB size, as Fig. 26 and Fig. 28 show,latencies to local DRAM can take up to 70.7 ns and74.7 ns, respectively, for the two EOS configurations.Accessing remote DRAM takes, respectively 112.14 nsand 114.31 ns. The NUMA latency disparity is 41.44 nsand 40 ns or remote access is longer by %59 and %54,respectively.

The important observation is that by using 2MiB pages

we can shorten the latency to local DRAM by 43.59 nsand 44.28 ns and to the remote DRAM by 58.23 nsand 67.49 ns, respectively. Percentage-wise we shortenthe latencies to local DRAM by %38 and %37 and toremote one, respectively by %34 and %37. There arealso more minor improvements when the capacities ofthe cache memories are reached.

It is clear that the platform does not provide sufficientaddress translation resources, such as TLBs, for theregular 4KiB page sizes. Applications required to accesslong lists of memory locations, as in pointer chasing,will definitely suffer performance degradations with the4 KiB pages.

IV. CONCLUSIONS

In this work we analyzed and quantified in detail per-core and system-wide performance and scalability limitsof the memory in recent Xeon platforms. We focused ona number of fundamental access patterns, with varyingdegrees of concurrency and considered blocks in certaincoherence states. Overall, data retrievals scale well withsystem size and stream count but when data updates areinvolved, the platforms exhibit behavior which meritssystem improvements to correct performance and scala-bility issues. There is a disparity in memory performanceaccording to the locality, concurrency and coherencestate of each data block which requires adequate atten-tion by system designers and application developers.

Specifically, for data retrieval of blocks in Exclusivestate, the platforms scale well moving from 4 to 6 coresand when the clock frequency increases as long as thedata fits in the various levels of cache hierarchy.

The per-core retrieval rates from local DRAM rangefrom 10.9 to 11.8 GB/s and are ≈ 1

2 those available(19.2 to 19.9 GB/s) and a little than ≈ 1

4 of aggregate(38.4 to 39.8 GB/s). This applies to all core counts andclock frequencies and it is due to resource scarcities onthe DRAM to core path, likely in the Un-Core. Thiscan be alleviated by providing more resources to serviceeach core, such as deeper GQ or per-core IMC buffers.Single core retrievals from remote DRAMs attain ≈ %64to %70 of the local DRAM bandwidth and the QPI isnot the bottleneck. Multi-stream data retrievals scale wellwith core count and clock frequency

However, when updates are involved with blocks inthe Modified state, the results are mixed. 4-core chipshandle updates more gracefully as data sizes increase.6-core systems, however, experience unexpected slowdowns and unstable performance as soon as L3 is in-volved. Code tuned for a 4-core system will experienceperformance drop when it starts using a 6-core system,likely requiring 6-core specific tunings. The bandwidth

MICHAEL E. THOMADAKIS 13

available to update local or remote DRAM is signifi-cantly lower than when data is retrieved from them.

Multi-stream updates scale well until the L3 is en-gaged. At this stage performance drops significantlypointing to an inability of the platform to scale withconcurrent update streams, likely due to resource scarcityin the Un-Core or QPI. 2 streams can already saturatethe memory system.

Single or multiple pairs of retrieve-update streamsscale well across core count and clock until L2 isengaged at which point performance drops significantly.L2s can alleviate this by adding more ports. WhenL3s or DRAM are involved performance drops further,pointing to issues with handling single or concurrent bi-directional streams. The platforms will have to includefurther provisions for this type of access pattern whichis not uncommon in HPC applications.

Since updates present problematic performance on6-core systems, moving from a 4-core to a 6-coresystem reveals inadequate system provisioning at thedesign stage. As core counts are expected to increasethis issue has to be addressed in a scalable way. Theareas which need improvement should include increasingthe efficiency of L3’s, the QPI coherence protocol andthe GQ structures. In particular, their ability to handleconcurrent streams should be increased to allow morememory operations to proceed concurrently. This willrequire reworking L3 and GQ structure and restrictingunnecessary coherence broadcast operations with snoopfilters or other mechanisms.

With the widely used 4 KiB pages, access latency toDRAM suffers, due to scarcity in TLB and other addresstranslation resources. The use of large 2 MiB pagescould mitigate the latency problem and reduce the costby %34 to %38. System designers will have to increasethe translation resources for smaller page sizes in futureplatforms.

Conventionally, people focus on the cost in remote vs.local memory access or the various levels in cache hier-archy. A class of “communications avoiding” algorithmshas been devised to take this into consideration and im-prove performance. However updates are much costlierthan retrievals, especially so when multiple streams arein progress concurrently, a situation which is commonwith HPC workloads. Worse as applications move toplatforms with more cores the disparities may becomeeven worse requiring another round of lengthy tuning.Both of these have to be considered

We hope that this work provides to application de-velopers some tools to understand the cost of accessingsystem resources in a more quantifiable way and tunetheir code accordingly.

ACKNOWLEDGEMETS

We are most grateful to the SupercomputingFacility at Texas A&M University and TACC centerin the University of Texas at Austin for allowing us touse their HPC resources for this investigation.

REFERENCES

[1] S. Gunther and R. Singhal, “Next generation intel R© microar-chitecture (nehalem) family: Architectural insights and powermanagement,” in Intel Developer Forum. San Francisco: Intel,Mar. 2008.

[2] R. Singhal, “Inside intel next generation nehalem microarchi-tecture,” in Intel Developer Forum. San Francisco: Intel, Mar.2008.

[3] Intel R© 64 and IA-32 Architectures Software Developer’s Man-ual Volume 1:Basic Architecture, Intel, May 2011.

[4] D. Hill and M. Chowdhury, “Westmere xeon56xx “tick” cpu,”IEEE, Palo Alto, CA, Tech. Rep., Aug. 2010.

[5] N. A. Kurd, S. Bhamidipati, C. Mozak, J. L. Miller, P. Mosa-likanti, T. M. Wilson, A. M. El-Husseini, M. Neidengard, R. E.Aly, M. Nemani, M. Chowdhury, and R. Kumar, “A familyof 32-nm ia processors,” IEEE Journal of Solid-State Circuits,vol. 46, no. 1, pp. 119–130, Jan. 2011.

[6] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer,J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek,D. Wessel, and K. Yelick, “A view of the parallel computinglandscape,” Communications of the ACM, vol. 52, pp. 56–67,October 2009. [Online]. Available: http://doi.acm.org/10.1145/1562764.1562783

[7] “Top 500 supercomputers, jun. 2011,” Jun. 2011. [Online].Available: http://www.top500.org/lists/2011/06

[8] “Top 500 supercomputers, nov. 2010,” Nov. 2010. [Online].Available: http://www.top500.org/lists/2010/11

[9] Intel R© 64 and IA-32 Architectures Software Developer’s Man-ual Volume 3A: System Programming Guide, Part 1, Intel, May2011.

[10] Intel R© 64 and IA-32 Architectures Software Developer’s Man-ual Volume 3B: System Programming Guide, Part 2, Intel, May2011.

[11] Intel R© 64 and IA-32 Architectures Optimization ReferenceManual, Intel, May 2010.

[12] R. A. Maddox, G. Singh, and R. J. Safranek, Weaving HighPerformance Multiprocessor Fabric. Hillsboro, OR: IntelCorporation, 2009.

[13] Intel, “An introduction to the intel R© quickpath interconnect,”Intel White-Paper, Tech. Rep. Document Number: 320412-001US, Jan. 2009.

[14] V. Babka and P. Tuma, “Investigating cache parameters of x86family of processors,” in SPEC Benchmark Workshop 2009, ser.LNCS, D. Kaeli and K. Sachs, Eds., no. 5419. Heidelberg:Springer-Verlag Berlin, 2009, pp. 77–96.

[15] L. Peng, J.-K. Peir, T. K. Prakash, C. Staelin, Y.-K. Chen, andD. Koppelman, “Memory hierarchy performance measurementof commercial dual-core desktop processors,” J. Syst. Archit.,vol. 54, pp. 816–828, August 2008. [Online]. Available:http://portal.acm.org/citation.cfm?id=1399642.1399665

[16] J. D. McCalpin, “STREAM: Sustainable memory bandwidthin high performance computers,” University of Virginia,Charlottesville, Virginia, Tech. Rep., 1991–2007. [Online].Available: http://www.cs.virginia.edu/stream/

MICHAEL E. THOMADAKIS 14

[17] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller,“Memory performance and cache coherency effects on anintel nehalem multiprocessor system,” in Proceedings of the2009 18th International Conference on Parallel Architecturesand Compilation Techniques. Washington, DC, USA: IEEEComputer Society, 2009, pp. 261–270. [Online]. Available:http://portal.acm.org/citation.cfm?id=1636712.1637764

[18] D. Hackenberg, D. Molka, and W. E. Nagel, “Comparing cachearchitectures and coherency protocols on x86-64 multicoresmp systems,” in Proceedings of the 42nd Annual IEEE/ACMInternational Symposium on Microarchitecture, ser. MICRO42. New York, NY, USA: ACM, 2009, pp. 413–422. [Online].Available: http://doi.acm.org/10.1145/1669112.1669165

[19] M. E. Thomadakis, “The architecture of the Nehalem processorand Nehalem-EP smp platforms,” Texas A&M University,College Station, TX, Technical Report, 2011. [Online].Available: http://sc.tamu.edu/systems/eos/nehalem.pdf

[20] D. Levinthal, “Performance analysis guide for intel R© core(TM)

i7 processor and intel xeon(TM) 5500 processors,” Intel, Tech.Rep. Version 1.0, 2009.

[21] M. E. Thomadakis, “A High-Performance Nehalem iDataPlexCluster and DDN S2A9900 Storage for Texas A&MUniversity,” Texas A&M University, College Station, TX,Technical Report, 2010. [Online]. Available: http://sc.tamu.edu/systems/eos/

[22] G. Juckeland, S.Borner, M. Kluge, S. Kolling, W. Nagel,S. Pfluger, H. Roding, S. Seidl, T. William, andR. Wloch, “BenchIT – performance measurement andcomparison for scientific applications,” in Parallel Computing- Software Technology, Algorithms, Architectures andApplications, ser. Advances in Parallel Computing,F. P. G.R. Joubert, W.E. Nagel and W. Walter,Eds. North-Holland, 2004, vol. 13, pp. 501–508. [On-line]. Available: http://www.sciencedirect.com/science/article/B8G4S-4PGPT1D-28/2/ceadfb7c3312956a6b713e0536929408

[23] G. Juckeland, M. Kluge, W. E. Nagel, and S. Pfluger, “Perfor-mance analysis with BenchIT: Portable, flexible, easy to use,” inProc. of the Inte’l Conf. on Quantitative Evaluation of Systems,QEST2004. Los Alamitos, CA, USA: IEEE Computer Society,2004, pp. 320–321.

[24] P. P. Gelsinger, “Intel architecture press briefing,” in IntelDeveloper Forum. San Francisco: Intel, Mar. 2008.