Upload
jean-hebb
View
217
Download
0
Embed Size (px)
Citation preview
1
Memory Performance and Scalability of Intel’s and AMD’s Dual-Core
Processors: A Case Study
Lu Peng1, Jih-Kwon Peir2, Tribuvan K. Prakash1, Yen-Kuang Chen3 and David Koppelman1
1Louisiana State University2University of Florida
3Intel Corporation
04/11/2007 IPCCC’07 Peng, Louisiana State University 2
Motivation
Dual-Core processors are popular. Understanding the impact of memory
hierarchy to overall performance. What are important factors for memory
hierarchy performance? How about speedups for dual threads?
04/11/2007 IPCCC’07 Peng, Louisiana State University 3
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2
• Shared Cache vs. Private Cache
• On-chip vs. Off-chip memory controller
• On-chip vs. Off-chip Inter-core communication
Off-Chip
On-Chip
Shared
04/11/2007 IPCCC’07 Peng, Louisiana State University 4
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2
• Core 2 Duo:• SharedShared L2 cache, no L2 coherence, beneficial with one active core, higher latency, fairness issue • When L1 miss, search L2 and the other L1 simultaneously, fast cache-cache transfer and L1 coherence (like a bus)• Memory controller off-chip, aggressive memory dependence predict
04/11/2007 IPCCC’07 Peng, Louisiana State University 5
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2
• Pentium D:• Two Pentium 4 on a chip, use technology remap approach (SMP)• Private L2 cache, MESI coherence, require memory update for MS, off-chip FSB for memory update, L1 coherence also go through FSB• Memory controller off-chip, longer delay but adaptive to new DRAM
04/11/2007 IPCCC’07 Peng, Louisiana State University 6
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2
• Athlon 64x2:• Private L2 cache, connected through HyperTransport• Use system request queue for internal commun. Between two cores• MOESI coherence protocol allows shared-modified block in O-state no need for memory updated when read a remote Modified block
04/11/2007 IPCCC’07 Peng, Louisiana State University 7
Specifications of the selected processors
04/11/2007 IPCCC’07 Peng, Louisiana State University 8
Methodology
Same platform: SUSE Linux 10.1 with kernel 2.6.16-smp
Micro-benchmarks Memory bandwidth and latency measured by Lmbench A lockless program [19] measuring cache-to-cache latency
Real workloads Single threaded: SPEC CPU2000 and CPU2006 Multi-threaded: blastp, hmmpfam, SPECjbb2005 and
SPLASH2
04/11/2007 IPCCC’07 Peng, Louisiana State University 9
Memory operations from Lmbench
Memory read - measuring the time to read every 4 byte word from memory.
Memory write - measuring the time to write every 4 byte word to memory.
Other operations such as Memory bzero etc. Refer the paper for details.
04/11/2007 IPCCC’07 Peng, Louisiana State University 10
Lockless Program measuring cache-to-cache latency
Doesn’t employ expensive read-modify-write atomic primitives.
Maintains a lockless counter for each thread. *pPong is in a different cache line with *pPing. C2C latency for Core 2 Duo, Pentium D and Athlon
64X2: 33ns, 133ns and 68ns respectively.
04/11/2007 IPCCC’07 Peng, Louisiana State University 11
Memory bandwidth collected from the lmbench suite
Intel Core 2 Duo Memory Bandwidth (1 copy)
0
2500
5000
7500
10000
12500
15000
17500
512
1024
2048
4096
8192 16
K32
K64
K12
8K25
6K51
2K 1M 2M 4M 8M 16M
32M
64M
128M
256M
512M
1024
M
Array Size (Bytes)
Band
widt
h (M
B/s)
libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory w rite
Intel Pentium D Memory Bandwidth (1 copy)
0
2500
5000
7500
10000
12500
15000
17500
512
1024
2048
4096
8192 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
32M
64M
128M
256M
512M
1024
M
Array Size (Bytes)
Band
widt
h (M
B/s)
libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory write
AMD Athlon 64X2-Memory Bandwidth (1 copy)
0
2500
5000
7500
10000
12500
15000
17500
512
1024
2048
4096
8192 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
32M
64M
128M
256M
512M
1024
M
Array Size (Bytes)
Ban
dwid
th (M
B/s
)
libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory write
Intel Core 2 Duo Memory Bandwidth (2 copies)
0
5000
10000
15000
20000
25000
30000
35000
512
1024
2048
4096
8192 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
32M
64M
128M
256M
Array Size (Bytes)
Band
widt
h (M
B/s)
libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory write
Intel Pentium D Memory Bandwidth (2 copies)
0
5000
10000
15000
20000
25000
30000
35000
512
1024
2048
4096
8192 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
32M
64M
128M
256M
Array Size (Bytes)
Band
widt
h (M
B/s)
libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory write
AMD Athlon 64X2-Memory Bandwidth (2copies)
0
5000
10000
15000
20000
25000
30000
35000
512
1024
2048
4096
8192 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
32M
64M
128M
256M
Array Size (Bytes)
Ban
dwid
th (M
B/s
)
libc bcopy unalignedlibc bcopy alignedMemory bzero unrolled bcopy unalignedMemory readMemory write
Doubled!!
Private cache is faster!
1. In general, Core 2 Duo and Athlon 64 X2 have better bandwidth than that of Pentium D.
2. Pentium D shows the best memory read bandwidth when the array size is less than its L2 size.
3. Athlon 64X2 provides doubled memory read bandwidth for two copies lmbench, benefiting from its on-chip memory controller.
04/11/2007 IPCCC’07 Peng, Louisiana State University 12
SPEC CPU2000 and CPU2006 benchmarks’ execution time
SPEC CPU2000 Execution Time (Single Program)
0
100
200
300
400
500
600
700
AM
MP
AR
T
BZ
IP2
CR
AF
TY
EO
N
EQ
UA
KE
GA
P
GC
C
GZ
IP
MC
F
ME
SA
PA
RS
ER
PE
RL
TW
OLF
VP
R
Sec
on
ds
Core 2 Duo
Pentium D
Athlon 64X2
SPEC CPU2000 Average Execution Time (Mixed Program)
0
100
200
300
400
500
600
700
AMM
P
ART
BZIP
2
CR
AFTY EO
N
EQU
AKE
GAP
GC
C
GZI
P
MC
F
MES
A
PAR
SER
PER
L
TWO
LF
VPR
Seco
nds
Core 2 Duo
Pentium D
Athlon 64X2
SPEC CPU2006 Execution Time (Single Program)
0200400600800
1000120014001600180020002200
AS
TA
R
BZ
IP2
GC
C
H264R
EF
HM
ME
R
LIB
QU
AN
T
OM
NE
TP
P
PE
RL
SJE
NG
Seco
nd
s
Core 2 Duo
Pentium D
Athlon 64X2
SPEC CPU2006 Average Execution Time (Mixed Program)
0200400600800
1000120014001600180020002200
AS
TA
R
BZ
IP2
GC
C
H264R
EF
HM
ME
R
LIB
QU
AN
T
OM
NE
TP
P
PE
RL
SJE
NG
Seco
nd
s
Core 2 Duo
Pentium D
Athlon 64X2 1. Core 2 Duo processor runs fastest for almost all workloads, especially for art, mcf.
2. Athlon shows the best performance for ammp which has a large working set, resulting a high L2 miss rate.
3. When mixed with another program, memory intensive program’s execution time increasing is large.
4. When mixed with another program, CPU bounded program’s execution time increasing is small.
04/11/2007 IPCCC’07 Peng, Louisiana State University 13
Multi-programmed speedup of mixed
SPEC CPU 2000/2006 benchmarks (a) SPEC CPU2000 Speedup
80
100
120
140
160
180
200
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
AMMP ART BZIP2 CRAFTY EON EQUAKE GAP GCC GZIP MCF MESA PARSER PERL TWOLF VPR
Sp
eed
up
(%
)
MAXAVGMIN
(b) SPEC CPU2006 Speedup
80
100
120
140
160
180
200
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
C2D
PN
T
AT
H
ASTAR BZIP GCC H264REF HMMER LIBQUANTUMN OMNETPP PERL SJENG
Sp
eed
up
(%
)
MAXAVGMIN
1. Athlon 64X2 achieves the best speedup for all workloads.
2. CPU bounded program shows the best speedup.3. Memory bounded program shows the worst
speedup.
04/11/2007 IPCCC’07 Peng, Louisiana State University 14
Multithreaded Program Behaviors
Execution Time (1-Thread)
1.0
10.0
100.0
1000.0
10000.0
blas
ttp
hmm
pfam
barn
es fmm
ocea
n fft
lu-c
on
lu-n
on-
con
radi
x
seco
nds
Core 2 DuoPentium DAthlon 64X2
Multithreaded Speedup
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
blas
ttp
hmm
pfam
barn
es fmm
ocea
n fft
lu-c
on
lu-n
on-c
on
radi
x
Spee
dup
Core 2 Duo Pentium D Athlon 64X2
1. Core 2 Duo’s single thread performance boosts because of larger L2 cache.
2. Core 2 Duo shows the best speedup for ocean due to high cache-to-cache transfer ratio. Verified by Intel VTune Analyzer.
3. Pentium D shows the best speedup for barnes because of the low cache miss rate
04/11/2007 IPCCC’07 Peng, Louisiana State University 15
Conclusions
Analyzed the memory hierarchy of selected Intel and AMD dual-core processors.
For the best performance and scalability, the following are important factors: fast cache-to-cache communication; large L2 or shared capacity; fast front side bus; on-chip memory controller. fair resource (cache) sharing.
04/11/2007 IPCCC’07 Peng, Louisiana State University 16
Thank you!
Questions?
04/11/2007 IPCCC’07 Peng, Louisiana State University 17
Backup Slides (Memory load latency collected from the lmbench suite)
Intel Core 2 Duo-Memory Load Latency-1 copy
0
30
60
90
120
150
00.
01
0.02
0.04
0.08
0.14
0.25
0.47
0.88
1.63
3
5.5 10 18 32 60
112
208
384
704
1280
Array Size (MB)
Late
ncy
(ns)
stride-16 stride-32
stride-64 stride-128
stride-256 stride-512
stride-1024
Intel Pentium D-Memory Load Latency-1 copy
0
30
60
90
120
150
0
0.01
0.02
0.03
0.05
0.09
0.16
0.25
0.44
0.75
1.25
2
3.5 6 10 16 28 48 80 128
224
384
640
1024
Array Size (MB)
Late
ncy
(ns)
stride-16 stride-32
stride-64 stride-128
stride-256 stride-512
stride-1024
AMD-Memory Load Latency-1 copy
0
30
60
90
120
150
0
0.01
0.02
0.03
0.05
0.09
0.16
0.25
0.44
0.75
1.25
2
3.5 6 10 16 28 48 80 128
224
384
640
1024
Array Size (MB)
Late
ncy
(ns)
stride-16 stride-32
stride-64 stride-128
stride-256 stride-512
stride-1024
Intel Core 2 Duo-Memory Load Latency-2 copies
0
30
60
90
120
150
00.
010.
020.
030.
050.
090.
160.
250.
440.
751.
252
3.5 6
10 16 28 48 8012
822
438
4
Array Size (MB)
Late
ncy
(ns)
stride-16 stride-32
stride-64 stride-128
stride-256 stride-512
stride-1024
Intel Pentium D-Memory Load Latency-2 copies
0
30
60
90
120
150
00.
010.
020.
030.
040.
060.
090.
140.
220.
34 0.5
0.81
1.25
1.88
34.
5 7 11 16 26 40 60 96 144
224
352
Array Size (MB)
Late
ncy
(ns)
stride-16 stride-32
stride-64 stride-128
stride-256 stride-512
stride-1024
AMD-Memory Load Latency-2 copies
0
30
60
90
120
150
0
0.01
0.02
0.03
0.05
0.09
0.16
0.25
0.44
0.75
1.25
2
3.5 6 10 16 28 48 80 128
224
384
Array Size (MB)La
tenc
y (n
s)
stride-16 stride-32
stride-64 stride-128
stride-256 stride-512
stride-1024
04/11/2007 IPCCC’07 Peng, Louisiana State University 18
Memory latency collected from the lmbench suite (continued)
Latencies for all configurations jump after the array size is larger than L2 sizes.
When the stride size is equal to 128 bytes, Pentium D still benefits partially but the L2 prefetchers of Core 2 Duo and Athlon 64X2 is not triggered.
When the stride size is large than 128 bytes, Athlon 64X2’s on-die memory controller and separate I/O HyperTransport show the advantage.
Two copies of lmbench suites bring more pressures on Pentium D.
04/11/2007 IPCCC’07 Peng, Louisiana State University 19
Backup Slides (Bandwidth for STREAM / STREAM2)
Stream Bandwidth (1 Copy)
0
2000
4000
6000
8000
10000
12000
14000
16000
copy scale add triad fill* copy* daxpy* sum*
Operation (* means from STREAM2)
Ban
dw
idth
(M
B/s
)
Core 2 DuoPentium DAthlon 64X2
Stream Bandwidth (2 Copy)
0
2000
4000
6000
8000
10000
12000
14000
16000
copy scale add triad fill* copy* daxpy* sum*
Operation (* means from STREAM2)
Ban
dw
idth
(M
B/s
)
Core 2 DuoPentium DAthlon 64X2
• The add operation is a loop of c[i] = a[i] + b[i], which can easily take advantage of the SSE2 packet operations. It shows higher bandwidth.
• Intel Core 2 Duo shows the best bandwidth for all operations because of L1 data prefetchers and the faster Front Side Bus.
• Athlon 64X2 has better bandwidth than that of Pentium D due to its faster on-chip memory controller.