ZKI AK Supercomputing Herbsttreffen 2016 Intel Xeon · PDF fileZKI AK Supercomputing Herbsttreffen 2016 Intel Xeon Phi in the Cray XC ... the exclusive licensee of Linus ... pat_report

ZKI AK Supercomputing Herbsttreffen 2016
Intel Xeon Phi in the Cray XC
Stefan [email protected]

Legal Disclaimer
Information in this document is provided in connect ion with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and pr oduct descriptions at any time, without notice.All products, dates and figures specified are preli minary based on current expectations, and are subje ct to change without notice. Cray hardware and software products may contain des ign defects or errors known as errata, which may ca use the product to deviate from published specifications. Current char acterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically ann ounced for release. Customers and other third parties are not authorize d by Cray Inc. to use codenames in advertising, pro motion or marketing andany use of Cray Inc. internal codenames is at the s ole risk of the user. Performance tests and ratings are measured using sp ecific systems and/or components and reflect the ap proximate performance of Cray Inc. products as measured by th ose tests. Any difference in system hardware or sof tware design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are r egistered in the United States and other countries: C RAY and design, SONEXION, URIKA, and YARCDATA. The following are tra demarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUS TER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEK ARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cr ay Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registe red trademark LINUX is used pursuant to a sublicense from LMI, th e exclusive licensee of Linus Torvalds, owner of th e mark on a worldwide basis. Other trademarks used in this document are th e property of their respective owners.
Copyright 2016 Cray Inc.
2Copyright 2016 Cray Inc.

Agenda
Cray packaging and customers
Xeon Phi Overview
Software Memory layout and usage CrayPat
Results
3Copyright 2016 Cray Inc.ZKI AK Supercomputing Herbsttreffen 2016

Cray XC Series Compute Blade - Intel Xeon Phi TMprocessor, Knights Landing Many-Core
Copyright 2016 Cray Inc.4

Early XC KNL Customer Adoption - Pioneers
Cray will be shipping 30,000+ KNL processors in the next two months
Argonne National Labs ( ANL) European Centre for Medium-Range Weather Forecasts (ECMWF) Los Alamos National Labs ( LANL ), & Sandia National Labs ( SNL) National Energy Research Scientific Computing Cente r (NERSC)/
Lawrence Berkeley National Labs Kyoto University in Japan HLRN/ZIB in Germany ( ZIB)

KNL Processor Architecture
2 DDR4 memory controllers, 3 channels each
8 Memory controllers for the MCDRAM (16 Gbytes)
Up to 36 tiles, each tile with 2 cores, each with
4 SMT 2 VPUs, each with 512 bit vector
registers Shared 1 MB L2 cache CHA : Caching home agent Frequency 1.3-1.6 GHz
Turbo Mode +100 all tile, +200 single tile -200 heavy AVX, -100 AVX
Binary compatible with Intel Xeon
6ZKI AK Supercomputing Herbsttreffen 2016

Core to Core: Comparing Xeon Phi to Xeon
Feature Broadwell Knights Landing How KNL compares
Number of cores 22 68 A lot more cores (3X)
Core frequency 2.1 to 3.6 1.3 to 1.6 Lower frequency (2X)
Serial scalar rate Lorenz=3048 874 3.5X slower
L1 cache size 32KB 32KB Same
L1 ld bandwidth 2X 32 bytes 2X 64 bytes Higher per cycle (2X)
L1 load rate 7 billion/sec 3 billion/sec Same per cycle, but lower clock
L2 cache size 256KB 1MB/2 cores Much larger (2X per core)
L2 bandwidth 64 bytes/cyc 64 bytes/cyc Same per cycle, but lower clock
L3 cache size 2.5 MB/core N/A Many kernels bandwidth limited

Node to Node: Comparing Xeon Phi to Xeon
Feature Broadwell Knights Landing How KNL compares
Number of cores up to 44 68 More cores (1.5X)
DDR 8 channels 6 channels 25% less bandwidth + capacity
MCDRAM N/A 8 channels,16 GB
Unique feature
Memory Bandwidth ~120 GB/s 490 GB/s MCDRAM rate (4X)
FP Peak (vector) ~1.3 TF/s ~2.6 TF/s Higher peak (2X)
FP Peak (scalar) 334 GF/s 326 GF/s Slightly less
Instruction Peak 387 Ginst/s 190 Ginst/s Half peak rate
Package Power 290 W 215 W Less power
Single Node : 2 socket BDW vs 1 socket Xeon Phi

KNL Memory Modes
MCDRAM is NUMA node 1
DDR is NUMA node 0
FlatCache
MCDRAM acts as memory-side cache for DDR
DDR is NUMA node 0
Hybrid
Part of MCDRAM is cache,part is NUMA node 1
DDR is NUMA node 0
Changing memory mode requires a BIOS change followed by a node reboot. The Cray XC gives this flexibility to the user on a batch submission script level.

Longer Vector/SIMD Floating Point Units
Register length is increased
+
=So you should make sure your application vectorize today, to be prepared for the KNL
=> 512 (KNL + next generation Xeon)128 => 256 (today)

Where is the memory located
And how do we get it somewhere else ?
11

Using numactl to use MCDRAM (flat mode)
To get all memory allocated in MCDRAM, use numactl Configure MCDRAM as flat MCDRAM will be NUMA node 1 Per-node memory limit is 16 GB (MCDRAM size) Run using numactl
$> aprun -n 320 -N 64 numactl --membind=1 a.out
To use MCDRAM first, but overflow into DDR Configure flat as above Use --preferred=1 instead of --membind=1 Per node memory limit includes all memory (MCDRAM+DDR) Often does not help much
Only first allocations will be to MCDRAM Later allocations will overflow into DDR

Intel Memory Allocation Examples
float *fv;fv = (float *) malloc (sizeof(float) * 1000);
Allocate from DDR
float *fv;fv = (float *) hbw_malloc (sizeof(float) * 1000);
Allocate from MCDRAM
c Declare arrays to be dynamicREAL, ALLOCATABLE :: A(:), B(:), C(:)
!DIR$ ATTRIBUTES FASTMEM :: ANSIZE=1024
cc allocate array A from MCDRAMc
ALLOCATE (A(NSIZE))cc Allocate arrays that will come from DDRc
ALLOCATE (B(NSIZE), C(NSIZE))
Allocate arrays from MCDRAM & DDR in Intel Fortran
13ZKI AK Supercomputing Herbsttreffen 2016

CCE Memory Allocation Examples
#pragma memory(bandwidth)float *fv = (float *) malloc (sizeof(float) * 1000);
Allocate from MCDRAM in CCE C
c Declare arrays to be dynamicREAL, ALLOCATABLE :: A(:), B(:), C(:)NSIZE=1024
cc allocate array A from MCDRAMc!DIR$ MEMORY(BANDWIDTH)
ALLOCATE (A(NSIZE))cc Allocate arrays that will come from DDRc
ALLOCATE (B(NSIZE), C(NSIZE))
Allocate arrays from MCDRAM & DDR in CCE Fortran
#pragma memory(bandwidth)float *fv = new float[1000];
Allocate from MCDRAM in CCE C++

KNL node : What to be aware of
Memory layout and usage How do I use the MCDRAM Which arrays are located where
next release of Reveal will help with identifying arrays to put into fast memory
CPU performance Longer Vectors Avoid slower serial core performance
Number of (Hyper-)Cores available (

CrayPAT

Performance Counters on KNL
Early access version of PAPI available Cray interna lly
KNL has 2 general purpose counters
Without multiplexing, which can distort counter dat a, it is difficult to define groups equivalent to those available on X eon
Welcome input on favorite CPU counter events
Uncore counters not yet available Must run at higher privilege level (via perf_events_paranoid file) Coming in official PAPI soon

MCDRAM Configuration Information
CrayPat/X: Version 6.4.2.36 Revision 8374f24 08/0 8/16 14:59:22
Experiment: lite lite/sample_profi le
Number of PEs (MPI ranks): 2,048
Numbers of PEs per Node: 32 PEs on each of 64 Nodes
Numbers of Threads per PE: 1
Number of Cores per Socket: 512 PEs on sockets wit h 34 Cores
1,536 PEs on sockets with 68 Cores
MCDRAM: 7.2 GHz, 16 GiB available as snc2, cache (10 0% cache) for 512 PEs
MCDRAM: 7.2 GHz, 16 GiB available as quad, cache (10 0% cache) for 1536 PEs
Avg Process Time: 3,251 secs
High Memory: 9,651,837.1 MBytes 4,712.8 MB ytes per PE
I/O Read Rate: 23.645208 MBytes/sec
I/O Write Rate: 6.836539 MBytes/sec
Avg CPU Energy: 43,899,476 joules 685,929 jo ules per node
Avg CPU Power: 13,504 watts 211.00 watts per n ode
18Copyright 2016 Cray Inc.
Number of PEs in job
PEs in job split this way
Different NUMA node usages in a single job

Documents

ZKI AK Supercomputing Herbsttreffen 2016 Intel Xeon · PDF fileZKI AK Supercomputing Herbsttreffen 2016 Intel Xeon Phi in the Cray XC ... the exclusive licensee of Linus ... pat_report