1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang

1

Memory Performance and Scalability of Intel’s and AMD’s Dual-Core

Processors: A Case Study

Lu Peng1, Jih-Kwon Peir2, Tribuvan K. Prakash1, Yen-Kuang Chen3 and David Koppelman1

1Louisiana State University2University of Florida

3Intel Corporation

04/11/2007 IPCCC’07 Peng, Louisiana State University 2

Motivation

Dual-Core processors are popular. Understanding the impact of memory

hierarchy to overall performance. What are important factors for memory

hierarchy performance? How about speedups for dual threads?


Selected Three Dual-Core Processors

Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2

• Shared Cache vs. Private Cache

• On-chip vs. Off-chip memory controller

• On-chip vs. Off-chip Inter-core communication

Off-Chip

On-Chip

Shared




• Core 2 Duo:• SharedShared L2 cache, no L2 coherence, beneficial with one active core, higher latency, fairness issue • When L1 miss, search L2 and the other L1 simultaneously, fast cache-cache transfer and L1 coherence (like a bus)• Memory controller off-chip, aggressive memory dependence predict




• Pentium D:• Two Pentium 4 on a chip, use technology remap approach (SMP)• Private L2 cache, MESI coherence, require memory update for MS, off-chip FSB for memory update, L1 coherence also go through FSB• Memory controller off-chip, longer delay but adaptive to new DRAM




• Athlon 64x2:• Private L2 cache, connected through HyperTransport• Use system request queue for internal commun. Between two cores• MOESI coherence protocol allows shared-modified block in O-state no need for memory updated when read a remote Modified block


Specifications of the selected processors


Methodology

Same platform: SUSE Linux 10.1 with kernel 2.6.16-smp

Micro-benchmarks Memory bandwidth and latency measured by Lmbench A lockless program [19] measuring cache-to-cache latency

Real workloads Single threaded: SPEC CPU2000 and CPU2006 Multi-threaded: blastp, hmmpfam, SPECjbb2005 and

SPLASH2


Memory operations from Lmbench

Memory read - measuring the time to read every 4 byte word from memory.

Memory write - measuring the time to write every 4 byte word to memory.

Other operations such as Memory bzero etc. Refer the paper for details.


Lockless Program measuring cache-to-cache latency

Doesn’t employ expensive read-modify-write atomic primitives.

Maintains a lockless counter for each thread. *pPong is in a different cache line with *pPing. C2C latency for Core 2 Duo, Pentium D and Athlon

64X2: 33ns, 133ns and 68ns respectively.


Memory bandwidth collected from the lmbench suite

Intel Core 2 Duo Memory Bandwidth (1 copy)

0

2500

5000

7500

10000

12500

15000

17500

512

1024

2048

4096

8192 16

K32

K64

K12

8K25

6K51

2K 1M 2M 4M 8M 16M

32M

64M

128M

256M

512M

1024

M

Array Size (Bytes)

Band

widt

h (M

B/s)

libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory w rite

Intel Pentium D Memory Bandwidth (1 copy)

0

2500

5000

7500

10000

12500

15000

17500

512

1024

2048

4096

8192 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

32M

64M

128M

256M

512M

1024

M

Array Size (Bytes)

Band

widt

h (M

B/s)

libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory write

AMD Athlon 64X2-Memory Bandwidth (1 copy)

0

2500

5000

7500

10000

12500

15000

17500

512

1024

2048

4096

8192 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

32M

64M

128M

256M

512M

1024

M

Array Size (Bytes)

Ban

dwid

th (M

B/s

)


Intel Core 2 Duo Memory Bandwidth (2 copies)

0

5000

10000

15000

20000

25000

30000

35000

512

1024

2048

4096

8192 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

32M

64M

128M

256M

Array Size (Bytes)

Band

widt

h (M

B/s)


Intel Pentium D Memory Bandwidth (2 copies)

0

5000

10000

15000

20000

25000

30000

35000

512

1024

2048

4096

8192 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

32M

64M

128M

256M

Array Size (Bytes)

Band

widt

h (M

B/s)


AMD Athlon 64X2-Memory Bandwidth (2copies)

0

5000

10000

15000

20000

25000

30000

35000

512

1024

2048

4096

8192 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

32M

64M

128M

256M

Array Size (Bytes)

Ban

dwid

th (M

B/s

)


Doubled!!

Private cache is faster!

1. In general, Core 2 Duo and Athlon 64 X2 have better bandwidth than that of Pentium D.

2. Pentium D shows the best memory read bandwidth when the array size is less than its L2 size.

3. Athlon 64X2 provides doubled memory read bandwidth for two copies lmbench, benefiting from its on-chip memory controller.


SPEC CPU2000 and CPU2006 benchmarks’ execution time

SPEC CPU2000 Execution Time (Single Program)

0

100

200

300

400

500

600

700

AM

MP

AR

T

BZ

IP2

CR

AF

TY

EO

N

EQ

UA

KE

GA

P

GC

C

GZ

IP

MC

F

ME

SA

PA

RS

ER

PE

RL

TW

OLF

VP

R

Sec

on

ds

Core 2 Duo

Pentium D

Athlon 64X2

SPEC CPU2000 Average Execution Time (Mixed Program)

0

100

200

300

400

500

600

700

AMM

P

ART

BZIP

2

CR

AFTY EO

N

EQU

AKE

GAP

GC

C

GZI

P

MC

F

MES

A

PAR

SER

PER

L

TWO

LF

VPR

Seco

nds

Core 2 Duo

Pentium D

Athlon 64X2

SPEC CPU2006 Execution Time (Single Program)

0200400600800

1000120014001600180020002200

AS

TA

R

BZ

IP2

GC

C

H264R

EF

HM

ME

R

LIB

QU

AN

T

OM

NE

TP

P

PE

RL

SJE

NG

Seco

nd

s

Core 2 Duo

Pentium D

Athlon 64X2

SPEC CPU2006 Average Execution Time (Mixed Program)

0200400600800

1000120014001600180020002200

AS

TA

R

BZ

IP2

GC

C

H264R

EF

HM

ME

R

LIB

QU

AN

T

OM

NE

TP

P

PE

RL

SJE

NG

Seco

nd

s

Core 2 Duo

Pentium D

Athlon 64X2 1. Core 2 Duo processor runs fastest for almost all workloads, especially for art, mcf.

2. Athlon shows the best performance for ammp which has a large working set, resulting a high L2 miss rate.

3. When mixed with another program, memory intensive program’s execution time increasing is large.

4. When mixed with another program, CPU bounded program’s execution time increasing is small.


Multi-programmed speedup of mixed

SPEC CPU 2000/2006 benchmarks (a) SPEC CPU2000 Speedup

80

100

120

140

160

180

200

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

AMMP ART BZIP2 CRAFTY EON EQUAKE GAP GCC GZIP MCF MESA PARSER PERL TWOLF VPR

Sp

eed

up

(%

)

MAXAVGMIN

(b) SPEC CPU2006 Speedup

80

100

120

140

160

180

200

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

C2D

PN

T

AT

H

ASTAR BZIP GCC H264REF HMMER LIBQUANTUMN OMNETPP PERL SJENG

Sp

eed

up

(%

)

MAXAVGMIN

1. Athlon 64X2 achieves the best speedup for all workloads.

2. CPU bounded program shows the best speedup.3. Memory bounded program shows the worst

speedup.


Multithreaded Program Behaviors

Execution Time (1-Thread)

1.0

10.0

100.0

1000.0

10000.0

blas

ttp

hmm

pfam

barn

es fmm

ocea

n fft

lu-c

on

lu-n

on-

con

radi

x

seco

nds

Core 2 DuoPentium DAthlon 64X2

Multithreaded Speedup

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

blas

ttp

hmm

pfam

barn

es fmm

ocea

n fft

lu-c

on

lu-n

on-c

on

radi

x

Spee

dup

Core 2 Duo Pentium D Athlon 64X2

1. Core 2 Duo’s single thread performance boosts because of larger L2 cache.

2. Core 2 Duo shows the best speedup for ocean due to high cache-to-cache transfer ratio. Verified by Intel VTune Analyzer.

3. Pentium D shows the best speedup for barnes because of the low cache miss rate


Conclusions

Analyzed the memory hierarchy of selected Intel and AMD dual-core processors.

For the best performance and scalability, the following are important factors: fast cache-to-cache communication; large L2 or shared capacity; fast front side bus; on-chip memory controller. fair resource (cache) sharing.


Thank you!

Questions?


Backup Slides (Memory load latency collected from the lmbench suite)

Intel Core 2 Duo-Memory Load Latency-1 copy

0

30

60

90

120

150

00.

01

0.02

0.04

0.08

0.14

0.25

0.47

0.88

1.63

3

5.5 10 18 32 60

112

208

384

704

1280

Array Size (MB)

Late

ncy

(ns)

stride-16 stride-32

stride-64 stride-128


stride-1024

Intel Pentium D-Memory Load Latency-1 copy

0

30

60

90

120

150

0

0.01

0.02

0.03

0.05

0.09

0.16

0.25

0.44

0.75

1.25

2

3.5 6 10 16 28 48 80 128

224

384

640

1024

Array Size (MB)

Late

ncy

(ns)

stride-16 stride-32



stride-1024

AMD-Memory Load Latency-1 copy

0

30

60

90

120

150

0

0.01

0.02

0.03

0.05

0.09

0.16

0.25

0.44

0.75

1.25

2

3.5 6 10 16 28 48 80 128

224

384

640

1024

Array Size (MB)

Late

ncy

(ns)

stride-16 stride-32



stride-1024

Intel Core 2 Duo-Memory Load Latency-2 copies

0

30

60

90

120

150

00.

010.

020.

030.

050.

090.

160.

250.

440.

751.

252

3.5 6

10 16 28 48 8012

822

438

4

Array Size (MB)

Late

ncy

(ns)

stride-16 stride-32



stride-1024

Intel Pentium D-Memory Load Latency-2 copies

0

30

60

90

120

150

00.

010.

020.

030.

040.

060.

090.

140.

220.

34 0.5

0.81

1.25

1.88

34.

5 7 11 16 26 40 60 96 144

224

352

Array Size (MB)

Late

ncy

(ns)

stride-16 stride-32



stride-1024

AMD-Memory Load Latency-2 copies

0

30

60

90

120

150

0

0.01

0.02

0.03

0.05

0.09

0.16

0.25

0.44

0.75

1.25

2

3.5 6 10 16 28 48 80 128

224

384

Array Size (MB)La

tenc

y (n

s)

stride-16 stride-32



stride-1024


Memory latency collected from the lmbench suite (continued)

Latencies for all configurations jump after the array size is larger than L2 sizes.

When the stride size is equal to 128 bytes, Pentium D still benefits partially but the L2 prefetchers of Core 2 Duo and Athlon 64X2 is not triggered.

When the stride size is large than 128 bytes, Athlon 64X2’s on-die memory controller and separate I/O HyperTransport show the advantage.

Two copies of lmbench suites bring more pressures on Pentium D.


Backup Slides (Bandwidth for STREAM / STREAM2)

Stream Bandwidth (1 Copy)

0

2000

4000

6000

8000

10000

12000

14000

16000

copy scale add triad fill* copy* daxpy* sum*

Operation (* means from STREAM2)

Ban

dw

idth

(M

B/s

)


Stream Bandwidth (2 Copy)

0

2000

4000

6000

8000

10000

12000

14000

16000

copy scale add triad fill* copy* daxpy* sum*

Operation (* means from STREAM2)

Ban

dw

idth

(M

B/s

)


• The add operation is a loop of c[i] = a[i] + b[i], which can easily take advantage of the SSE2 packet operations. It shows higher bandwidth.

• Intel Core 2 Duo shows the best bandwidth for all operations because of L1 data prefetchers and the faster Front Side Bus.

• Athlon 64X2 has better bandwidth than that of Pentium D due to its faster on-chip memory controller.

Documents

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang