BuddyBland Titan SC12

Embed Size (px)

Citation preview

  • 7/29/2019 BuddyBland Titan SC12

    1/12

    Office ofScience

    Titan - Early Exwith the Titan S

    Oak Ridge National La

    Oak Ridge Leadership Co

    No

  • 7/29/2019 BuddyBland Titan SC12

    2/12

    4Bud

    4,352 ft2

    404 m2

    SYSTEM SPECIFICATIONS:

    Peak performance of 27.1 PF (

    18,688 Compute Nodes each w 16-Core AMD Opteron CP

    NVIDIA Tesla K20x GPU

    512 Service and I/O nodes

    200 Cabinets

    710 TB total system memory

    Cray Gemini 3D Torus Intercon 8.9 MW peak power 8.3 avg.

    ORNLs Titan Hybrid System: Crawith AMD Opteron and NVIDIA Tesprocessors

  • 7/29/2019 BuddyBland Titan SC12

    3/12

    5Bud

    X86 processor provides fast, singlthread performance for control &communications AMD Opteron 62

    16 cores 141 GFLOPs p

  • 7/29/2019 BuddyBland Titan SC12

    4/12

    6Bud

    GPUs are designed for extremeparallelism, performance & powerefficiency

    NVIDIA Tesla K2

    14 Streaming Mult

    2,688 CUDA cores

    1.31 TFLOPs peak

    6 GB GDDR5 mem

    HPL: >2.0 GFLOP(Titan full system mea

  • 7/29/2019 BuddyBland Titan SC12

    5/12

  • 7/29/2019 BuddyBland Titan SC12

    6/128Bud

    Titan:Cray XK7 System

    Board:

    4 Compute Nodes

    5.8 TF

    152 GB

    Cab

    2496

    13

    3.Compute Node:

    1.45 TF

    38 GB

  • 7/29/2019 BuddyBland Titan SC12

    7/1213 Bud

    Why GPUs?High Performance and Power Efficiena Path to Exascale Hierarchical parallelism Improves scalability of applications

    Exposing more parallelism through code refactoring and sourcdirectives

    Heterogeneous multi-core processor architecture Use the rigprocessor for each task.

    Data locality Keep the data near the processing. GPU has hbandwidth to local memory for rapid access. GPU has large in

    Explicit data management Explicitly manage data movemen

    CPU and GPU memories.

  • 7/29/2019 BuddyBland Titan SC12

    8/1214 Bud

    Hybrid Programming Model On Jaguar, with 299,008 cores, we were seeing the lim

    single level of MPI scaling for most applications

    To take advantage of the vastly larger parallelism in Titneed to use hierarchical parallelism in their codes

    Distributed memory: MPI, SHMEM, PGAS

    Node Local: OpenMP, Pthreads, local MPI communicators

    Within threads: Vector constructs on GPU, libraries, OpenA

    These are the same types of constructs needed onmulti-PFLOPS computers to scale to the full size o

    systems!

  • 7/29/2019 BuddyBland Titan SC12

    9/1215

    Bud

    How do you program these nodes? Compilers

    OpenACC is a set of compiler directives that allows the user to express hierin the source code so that the compiler can generate parallel code for the taGPU, MIC, or vector SIMD on CPU

    Cray compiler supports XK7 nodes and is OpenACC compatible

    CAPS HMPP compiler supports C, C++ and Fortran compilation for heterogOpenACC support

    PGI compiler supports OpenACC and CUDA Fortran

    Tools

    Allinea DDT debugger scales to full system size and with ORNL support wilheterogeneous (x86/GPU) apps

    ORNL has worked with the Vampir team at TUD to add support for profiling heterogeneous nodes

    CrayPAT and Cray Apprentice support XK6 programming

  • 7/29/2019 BuddyBland Titan SC12

    10/1221

    Bud

    Early Science Applications on Ti

    Material Science(WL-LSMS)Role of material disorder,statistics, and fluctuationsin nanoscale materialsand systems.

    Combustion (S3D)Combustion simulations toenable the next generationof diesel/bio- fuels to burnmore efficiently.

    Climate Change (CAM-SE)Answer questions about specific climatechange adaptation and mitigationscenarios; realistically representfeatures like precipitationpatterns/statistics and tropical storms.

    Nuclear EUnprecedenradiation trathat can be nuclear eneapplications

    BiofueA multipmolecucode.

    Astrophysics (NRDF)Radiation transport critical toastrophysics, laser fusion,combustion, atmosphericdynamics, and medical imaging.

    H Eff ti GPU S l bl A li

  • 7/29/2019 BuddyBland Titan SC12

    11/1222

    Bud

    How Effective are GPUs on Scalable AppliOLCF-3 Early Science CodesVery early performance measurements on Titan

    XK7 (w/ K20x) vs. XE6

    Cray XK7: K20x GPU plus AMD

    Cray XE6: Dual AMD 6274 and

    Cray XK6 w/o GPU: Single AMD 62

    Application PerformanceRatio

    Comments

    S3D 1.8 Turbulent combustion 6% of Jaguar workload

    Denovo sweep 3.8 Sweep kernel of 3D neutron transport f 2% of Jaguar workload

    LAMMPS 7.4*(mixed precision) High-performance molecular dynamics 1% of Jaguar workload

    WL-LSMS 3.8 Statistical mechanics of magnetic mate 2% of Jaguar workload 2009 Gordon Bell Winner

    CAM-SE 1.8*(estimate)

    Community atmosphere model 1% of Jaguar workload

  • 7/29/2019 BuddyBland Titan SC12

    12/12

    27Bud

    Questions?

    [email protected]

    27

    The research and activities des

    presentation were performed using

    the National Center for Computati

    Oak Ridge National Laboratory, whi

    the Office of Science of the U.S. Depunder Contract No. DE-AC050

    Want to join ou

    ORNL is hiring. C

    http://jobs.o