ZKI AK Supercomputing Herbsttreffen 2016 Intel Xeon · PDF fileZKI AK Supercomputing Herbsttreffen 2016 Intel Xeon Phi in the Cray XC ... the exclusive licensee of Linus ... pat_report

Embed Size (px)

Citation preview

  • ZKI AK Supercomputing Herbsttreffen 2016

    Intel Xeon Phi in the Cray XC

    Stefan [email protected]

  • Legal Disclaimer

    Information in this document is provided in connect ion with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and pr oduct descriptions at any time, without notice.All products, dates and figures specified are preli minary based on current expectations, and are subje ct to change without notice. Cray hardware and software products may contain des ign defects or errors known as errata, which may ca use the product to deviate from published specifications. Current char acterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically ann ounced for release. Customers and other third parties are not authorize d by Cray Inc. to use codenames in advertising, pro motion or marketing andany use of Cray Inc. internal codenames is at the s ole risk of the user. Performance tests and ratings are measured using sp ecific systems and/or components and reflect the ap proximate performance of Cray Inc. products as measured by th ose tests. Any difference in system hardware or sof tware design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are r egistered in the United States and other countries: C RAY and design, SONEXION, URIKA, and YARCDATA. The following are tra demarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUS TER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEK ARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cr ay Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registe red trademark LINUX is used pursuant to a sublicense from LMI, th e exclusive licensee of Linus Torvalds, owner of th e mark on a worldwide basis. Other trademarks used in this document are th e property of their respective owners.

    Copyright 2016 Cray Inc.

    2Copyright 2016 Cray Inc.

  • Agenda

    Cray packaging and customers

    Xeon Phi Overview

    Software Memory layout and usage CrayPat

    Results

    3Copyright 2016 Cray Inc.ZKI AK Supercomputing Herbsttreffen 2016

  • Cray XC Series Compute Blade - Intel Xeon Phi TMprocessor, Knights Landing Many-Core

    Copyright 2016 Cray Inc.4

    ZKI AK Supercomputing Herbsttreffen 2016

  • Early XC KNL Customer Adoption - Pioneers

    Cray will be shipping 30,000+ KNL processors in the next two months

    Argonne National Labs ( ANL) European Centre for Medium-Range Weather Forecasts (ECMWF) Los Alamos National Labs ( LANL ), & Sandia National Labs ( SNL) National Energy Research Scientific Computing Cente r (NERSC)/

    Lawrence Berkeley National Labs Kyoto University in Japan HLRN/ZIB in Germany ( ZIB)

    Copyright 2016 Cray Inc.5

    ZKI AK Supercomputing Herbsttreffen 2016

  • KNL Processor Architecture

    Copyright 2016 Cray Inc.

    2 DDR4 memory controllers, 3 channels each

    8 Memory controllers for the MCDRAM (16 Gbytes)

    Up to 36 tiles, each tile with 2 cores, each with

    4 SMT 2 VPUs, each with 512 bit vector

    registers Shared 1 MB L2 cache CHA : Caching home agent Frequency 1.3-1.6 GHz

    Turbo Mode +100 all tile, +200 single tile -200 heavy AVX, -100 AVX

    Binary compatible with Intel Xeon

    6ZKI AK Supercomputing Herbsttreffen 2016

  • Core to Core: Comparing Xeon Phi to Xeon

    Feature Broadwell Knights Landing How KNL compares

    Number of cores 22 68 A lot more cores (3X)

    Core frequency 2.1 to 3.6 1.3 to 1.6 Lower frequency (2X)

    Serial scalar rate Lorenz=3048 874 3.5X slower

    L1 cache size 32KB 32KB Same

    L1 ld bandwidth 2X 32 bytes 2X 64 bytes Higher per cycle (2X)

    L1 load rate 7 billion/sec 3 billion/sec Same per cycle, but lower clock

    L2 cache size 256KB 1MB/2 cores Much larger (2X per core)

    L2 bandwidth 64 bytes/cyc 64 bytes/cyc Same per cycle, but lower clock

    L3 cache size 2.5 MB/core N/A Many kernels bandwidth limited

    Copyright 2016 Cray Inc.7

    ZKI AK Supercomputing Herbsttreffen 2016

  • Node to Node: Comparing Xeon Phi to Xeon

    Feature Broadwell Knights Landing How KNL compares

    Number of cores up to 44 68 More cores (1.5X)

    DDR 8 channels 6 channels 25% less bandwidth + capacity

    MCDRAM N/A 8 channels,16 GB

    Unique feature

    Memory Bandwidth ~120 GB/s 490 GB/s MCDRAM rate (4X)

    FP Peak (vector) ~1.3 TF/s ~2.6 TF/s Higher peak (2X)

    FP Peak (scalar) 334 GF/s 326 GF/s Slightly less

    Instruction Peak 387 Ginst/s 190 Ginst/s Half peak rate

    Package Power 290 W 215 W Less power

    Copyright 2016 Cray Inc.8

    Single Node : 2 socket BDW vs 1 socket Xeon Phi

    ZKI AK Supercomputing Herbsttreffen 2016

  • KNL Memory Modes

    Copyright 2016 Cray Inc.9

    MCDRAM is NUMA node 1

    DDR is NUMA node 0

    FlatCache

    MCDRAM acts as memory-side cache for DDR

    DDR is NUMA node 0

    Hybrid

    Part of MCDRAM is cache,part is NUMA node 1

    DDR is NUMA node 0

    Changing memory mode requires a BIOS change followed by a node reboot. The Cray XC gives this flexibility to the user on a batch submission script level.

    ZKI AK Supercomputing Herbsttreffen 2016

  • Longer Vector/SIMD Floating Point Units

    Register length is increased

    Copyright 2016 Cray Inc.

    +

    =So you should make sure your application vectorize today, to be prepared for the KNL

    => 512 (KNL + next generation Xeon)128 => 256 (today)

    ZKI AK Supercomputing Herbsttreffen 2016

  • Where is the memory located

    And how do we get it somewhere else ?

    11

  • Using numactl to use MCDRAM (flat mode)

    To get all memory allocated in MCDRAM, use numactl Configure MCDRAM as flat MCDRAM will be NUMA node 1 Per-node memory limit is 16 GB (MCDRAM size) Run using numactl

    $> aprun -n 320 -N 64 numactl --membind=1 a.out

    To use MCDRAM first, but overflow into DDR Configure flat as above Use --preferred=1 instead of --membind=1 Per node memory limit includes all memory (MCDRAM+DDR) Often does not help much

    Only first allocations will be to MCDRAM Later allocations will overflow into DDR

    Copyright 2016 Cray Inc.12

    ZKI AK Supercomputing Herbsttreffen 2016

  • Intel Memory Allocation Examples

    Copyright 2016 Cray Inc.

    float *fv;fv = (float *) malloc (sizeof(float) * 1000);

    Allocate from DDR

    float *fv;fv = (float *) hbw_malloc (sizeof(float) * 1000);

    Allocate from MCDRAM

    c Declare arrays to be dynamicREAL, ALLOCATABLE :: A(:), B(:), C(:)

    !DIR$ ATTRIBUTES FASTMEM :: ANSIZE=1024

    cc allocate array A from MCDRAMc

    ALLOCATE (A(NSIZE))cc Allocate arrays that will come from DDRc

    ALLOCATE (B(NSIZE), C(NSIZE))

    Allocate arrays from MCDRAM & DDR in Intel Fortran

    13ZKI AK Supercomputing Herbsttreffen 2016

  • CCE Memory Allocation Examples

    Copyright 2016 Cray Inc.14

    #pragma memory(bandwidth)float *fv = (float *) malloc (sizeof(float) * 1000);

    Allocate from MCDRAM in CCE C

    c Declare arrays to be dynamicREAL, ALLOCATABLE :: A(:), B(:), C(:)NSIZE=1024

    cc allocate array A from MCDRAMc!DIR$ MEMORY(BANDWIDTH)

    ALLOCATE (A(NSIZE))cc Allocate arrays that will come from DDRc

    ALLOCATE (B(NSIZE), C(NSIZE))

    Allocate arrays from MCDRAM & DDR in CCE Fortran

    #pragma memory(bandwidth)float *fv = new float[1000];

    Allocate from MCDRAM in CCE C++

    ZKI AK Supercomputing Herbsttreffen 2016

  • KNL node : What to be aware of

    Memory layout and usage How do I use the MCDRAM Which arrays are located where

    next release of Reveal will help with identifying arrays to put into fast memory

    CPU performance Longer Vectors Avoid slower serial core performance

    Number of (Hyper-)Cores available (

  • CrayPAT

  • Performance Counters on KNL

    Early access version of PAPI available Cray interna lly

    KNL has 2 general purpose counters

    Without multiplexing, which can distort counter dat a, it is difficult to define groups equivalent to those available on X eon

    Welcome input on favorite CPU counter events

    Uncore counters not yet available Must run at higher privilege level (via perf_events_paranoid file) Coming in official PAPI soon

    Copyright 2016 Cray Inc.17

    ZKI AK Supercomputing Herbsttreffen 2016

  • MCDRAM Configuration Information

    CrayPat/X: Version 6.4.2.36 Revision 8374f24 08/0 8/16 14:59:22

    Experiment: lite lite/sample_profi le

    Number of PEs (MPI ranks): 2,048

    Numbers of PEs per Node: 32 PEs on each of 64 Nodes

    Numbers of Threads per PE: 1

    Number of Cores per Socket: 512 PEs on sockets wit h 34 Cores

    1,536 PEs on sockets with 68 Cores

    MCDRAM: 7.2 GHz, 16 GiB available as snc2, cache (10 0% cache) for 512 PEs

    MCDRAM: 7.2 GHz, 16 GiB available as quad, cache (10 0% cache) for 1536 PEs

    Avg Process Time: 3,251 secs

    High Memory: 9,651,837.1 MBytes 4,712.8 MB ytes per PE

    I/O Read Rate: 23.645208 MBytes/sec

    I/O Write Rate: 6.836539 MBytes/sec

    Avg CPU Energy: 43,899,476 joules 685,929 jo ules per node

    Avg CPU Power: 13,504 watts 211.00 watts per n ode

    18Copyright 2016 Cray Inc.

    Number of PEs in job

    PEs in job split this way

    Different NUMA node usages in a single job

    ZKI AK Supercomputing Herbsttreffen 2016