17
A Quick HPC Perspective on Accelerators Vincent Betro Computational Scientist University of Tennessee National Institute for Computational Sciences Copyright

2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

A Quick HPC Perspective on Accelerators

Vincent Betro Computational Scientist

University of Tennessee National Institute for Computational Sciences

Copyright 2013

Page 2: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Things started looking harder around 2004…

Source: published SPECInt data

Page 3: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Moore’s Law is not at all dead…

High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018

Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm

Integration Capacity (Billions of Transistors)

2 4 8 16 32 64 128 256

Intel process technology capabilities

50nm

Transistor for 90nm Process

Source: Intel Influenza Virus

Source: CDC

Page 4: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

The problem in 2004…didn’t get better.

0

1

10

100

1,000

1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

Watts�pe

r�Squ

are�c

m

Historical�Single�Core Historical�MultiͲCoreITRS�Hi�Perf

100W�lightbulb

Hot�Plate

Nuclear�Reactor

1.E+01

1.E+02

1.E+03

1.E+04

1/1/92

1/1/96

1/1/00

1/1/04

1/1/08

1/1/12

1/1/16

Clock�(

MHz

)

Node�Processor�Clock�(MHz)�(H)Node�Processor�Clock�(MHz)�(L)Node�Processor�Clock�(MHz)�(M)Acclerator�Core�Clock�(MHz)�(H)Acclerator�Core�Clock�(MHz)�(L)Acclerator�Core�Clock�(MHz)�(M)

Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

Page 5: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Not a new problem, just a new scale…

CPU Power

W)

Cray-2 with cooling tower in foreground, circa 1985

Page 6: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

How to get same number of transistors to give us more performance without cranking up power?

C1

C4

C2

C3

Small core

Big core

Cache

Cache

1

2

3

4

1 2

1 1

1

2

3

4

1

2

3

4

Power

Performance

Power = ¼

Performance = 1/2

Many core is more power efficient

Power ~ area

Single thread performance ~ area**.5

Page 7: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Example: Dual core with voltage scaling

Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1

Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8

Frequency Reduction

Power Reduction

Performance Reduction

15% 45% 10%

A 15% Reduction In Voltage

Yields

SINGLE CORE DUAL CORE

RULE OF THUMB

Page 8: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

GPU Architecture – Kepler: Streaming Multiprocessor (SMX)

!   192 SP CUDA Cores per SMX !   192 fp32 ops/clock !   192 int32 ops/clock

!   64 DP CUDA Cores per SMX !   64 fp64 ops/clock

!   4 warp schedulers !   Up to 2048 threads concurrently

!   32 special-function units !   64KB shared mem + L1 cache !   48KB Read-Only Data cache !   64K 32-bit registers

Page 9: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Intel’s MIC Approach

Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying “why can’t we do this in x86?” The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line.

Courtesy Dan Stanzione, TACC

Page 10: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

MIC Architecture

• Many cores on the die •  L1 and L2 cache • Bidirectional ring network

for L2 Memory and PCIe connection

Courtesy Dan Stanzione, TACC

Page 11: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

What makes an Accelerator?

PCI Bus

CPU Memory GPU Memory

CPU GPU Intel Xeon Phi

Page 12: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Top 10 Systems in November 2012 (Just got updated last month)

# Site Manufacturer Computer Country Cores Rmax [Pflops]

Power [MW]

1 Oak Ridge National Laboratory Cray Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x USA 560,640 17.6 8.21

2 Lawrence Livermore National Laboratory IBM Sequoia BlueGene/Q, Power BQC 16C 1.6GHz, Custom

USA 1,572,864 16.3 7.89

3 RIKEN Advanced Institute for Computational Science Fujitsu

K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect

Japan 795,024 10.5 12.66

4 Argonne National Laboratory IBM Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom

USA 786,432 8.16 3.95

5 Forschungszentrum Juelich (FZJ) IBM JuQUEEN BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Germany 393,216 4.14 1.97

6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Germany 147,456 2.90 3.52

7 Texas Advanced Computing Center/UT Dell Stampede PowerEdge C8220, Xeon E5 8C 2.7GHz, Intel Xeon Phi

USA 204,900 2.66

8 National SuperComputer Center in Tianjin NUDT

Tianhe-1A NUDT TH MPP, Xeon 6C, NVidia, FT-1000 8C

China 186,368 2.57 4.04

9 CINECA IBM Fermi BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Italy 163,840 1.73 0.82

10 IBM IBM DARPA Trial Subset Power 775, Power7 8C 3.84GHz, Custom

USA 63,360 1.52 3.57

Page 13: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

•  Upgrade of Jaguar from Cray XT5 to XK6 •  Cray Linux Environment operating system

•  Gemini interconnect ­ 3-D Torus ­ Globally addressable memory ­ Advanced synchronization features

•  AMD Opteron 6274 processors (Interlagos)

•  New accelerated node design using NVIDIA multi-core accelerators ­ 2011: 960 NVIDIA x2090 “Fermi” GPUs ­ 2012: 14,592 NVIDIA “Kepler” GPUs

•  20+ PFlops peak system performance

•  600 TB DDR3 mem. + 88 TB GDDR5 mem

Titan System at Oak Ridge National Laboratory (Just displaced by Chinese system using Xeon Phi)

Page 14: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Power Efficiency over Time

0  

500  

1,000  

1,500  

2,000  

2,500  

3,000  

2008   2009   2010   2011   2012  

Linp

ack/Po

wer      [Gflo

ps/kW]  

TOP10  

TOP50  

TOP500  

Accelerator and BG

multicore

Data from: TOP500 November 2012

Courtesy Horst Simon, LBNL

Page 15: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Projected Performance Development

Data from: TOP500 November 2012

0.1  1  

10  100  1000  10000  

100000  1000000  10000000  

100000000  1E+09  1E+10  1E+11  

1994   1996   1998   2000   2002   2004   2006   2008   2010   2012   2014   2016   2018   2020  

SUM

N=1

N=500

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

1 Eflop/s

Courtesy Horst Simon, LBNL

Page 16: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Trends with ends.

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

01/01/96

01/01/00

01/01/04

01/01/08

01/01/12

Comp

ound

Annu

alGrow

thRa

te:CAG

R

Rmax (Gflop/s) Total Cores

Ave Cycles/sec per core (Mhz) Mem/Core (GB)

Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

Page 17: 2-HPC Perspective on Acceleratorshpcuniversity.org/media/TrainingMaterials/11/2-HPC... · 6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Summary

Heterogeneous computing is an important piece of the computing future. GPU accelerators are here to stay. Now, let’s see how you can use them…