54
Supercomputers Special Course of Compute r Architecture H.Amano

Supercomputers Special Course of Computer Architecture H.Amano

Embed Size (px)

Citation preview

Page 1: Supercomputers Special Course of Computer Architecture H.Amano

Supercomputers

Special Course of Computer Architecture

H.Amano

Page 2: Supercomputers Special Course of Computer Architecture H.Amano

Contents• What are supercomputers?

• Architecture of Supercomputers

• Representative supercomputers

• Exa-Scale supercomputer project

Page 3: Supercomputers Special Course of Computer Architecture H.Amano

Defining Supercomputers• High performance computers mainly for scientific

computation.– Huge amount of computation for Biochemistry,

Physics, Astronomy, Meteorology and etc.– Very expensive: developed and managed by national

fund.– High level techniques are required to develop and

manage them.– USA, Japan and China compete the top 1

supercomputer.– A large amount of national fund is used, and tends to

be political news→ In Japan, the supercomputer project became the target of budget review in Dec. 2009  

「 K 」 achieved 10PFLOPS, and became the top 1 in the last year,but Sequoia got back in the last month.

Page 4: Supercomputers Special Course of Computer Architecture H.Amano

FLOPS

• Floating Point Operation Per Second

• Floating Point number– (Mantissa)   ×  2 ( index )– Double precision 64bit, Single precision 32bit.– IEEE Standard defines the format and

rounding

238

5211

Single

Double

sign index mantissa

Page 5: Supercomputers Special Course of Computer Architecture H.Amano

The range of performance

106

100 万

M ( Mega)

10 億

G ( Giga)

1 兆

T ( Tera )

1000 兆

P ( Peta )

100 京

E ( Exa )

10PFLOPS   =   1 京回 in Japanese→   The name 「 K 」 comes from it.

iPhone4S140MFLOPS

High-end PC50-80GFLOPS

PowerfulGPUTera-FLOPS

Supercomputers10TFLOPS-16PFLOPS

growing ratio: 1.9times/year

109 1012 1015 1018

Page 6: Supercomputers Special Course of Computer Architecture H.Amano

How to select top 1 ?• Top500/Green500: Performance of executing Linpack

– Linpack is a kernel for matrix computation.– Scale free– Performance centric.

• Godon Bell Prize– Peak Performance, Price/Performance, Special Achievement

• HPC Challenge– Global HPL   Matrix computation: Computation – Global Random Access :  random memory access :  Communicati

on– EP stream per system:   heavy load memory access :  Memory perf

ormance– Global FFT: Complicated problem requiring both memory and commun

ication performance.

• Nov.   ACM/IEEE Supercomputing Conference– Top500 、 Gordon Bell Prize 、 HPC   Challenge 、 Green500

• Jun.   International Supercomputing Conference– Top500 、 Green500

Page 7: Supercomputers Special Course of Computer Architecture H.Amano

2

3

8

9

10

2011.112011.62010.112010.6

Rmax:Peta FLOPS

KJapan

Tianhe( 天河 )   China

Jaguar USANebulae China

Tsubame JapanRoadrunner USAKraken USA

Jugene Germany

Top 5

SequoiaUSA16PFLOPS

Page 8: Supercomputers Special Course of Computer Architecture H.Amano

From SACSIS2012 Invited Speech.

Page 9: Supercomputers Special Course of Computer Architecture H.Amano

Name Development

Hardware Cores Performance TFLOPS

Power ( KW)

K   ( 京 )

(Japan)

RIKEN AICS SPARC VIIIfx 2.0GHz Tofu Interconnect

Fujitsu

705024 10510

(11280)

12659.9

Tianhe-1A( 天河 )

(China)

National Supercomputer Center Tenjien

NUDT YH MPPXeon X5670 6C 2.93GHz,NVIDIA 2050 NUDT

186368 2566

(4701)

4040

Jaguar

(USA)

DOE/SC/Oak Ridge

National Lab.

Cray XT5-HE Opteron 6-Core 2.6GHz, Cray Inc.

224162 1759

(2331)

6950

Nebulae(China)

National Supercomputing Centre in Shenzhen

Dawning TC3600 Blade, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050,Dawing

120640 1271

(2974)

2580

TSUBAME2.0(Japan)

GSIC,Tokyo Inst. of Technology

HP ProLiant SL390s G7 Xeon 6C X5670, NVIDIA GPU,NEC/HP

73238 1192

(2287)

1398.6

Top 500 2011   11 月

Page 10: Supercomputers Special Course of Computer Architecture H.Amano

Machine Place FLOPS/W Total

kW

1 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom

IBM - Rochester 2026.48 85.12

2 - 5 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom

BlueGene/Q Prototype

IBM – Thomas J. Watson Research  Center /Rochester

1689.86 -2026.48

6 DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR

Nagasaki Univ. 1378.32 47.05

7 Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 2090

Barcelona Supercomputing Center

1266.26 81.50

8 Curie Hybrid Nodes - Bullx B505, Nvidia M2090, Xeon E5640 2.67 GHz, Infiniband QDR

TGCC / GENCI 1010.11 108.80

Green 500 2011   11 月

10 位は Tsubame-2 . 0 (東工大)

IBM BlueGene/Qgot 1-5

Page 11: Supercomputers Special Course of Computer Architecture H.Amano

Why Top1?

• Top1 is just a measure of matrix computation.• Top1 of Green500, Gordon Bell Prize, Top1 of

each HPC Challenge program

→   All machines are valuable.

TV or newspapers are too much focus on Top 500.• However, most top 1 computer also got Gordon

Bell Prize and HPC Challenge top1.– K and Sequoia

• Impact of Top 1 is great!

Page 12: Supercomputers Special Course of Computer Architecture H.Amano

Why supercomputers so fast?× Because they use high freq. clock

100MH z

1GH z

1992 2000 2008

Pentium43.2GHz

Nehalem3.3GHz

Alpha21064150MHz

K 2GHz

The speed up of the clock is saturated in 2003.

Power and heat dissipation

The clock frequency of K and Sequoia is lower than

that of common PCs

40 % / year

Clock freq. of High end PC

Freq.

Sequoia 1.6GHz

Page 13: Supercomputers Special Course of Computer Architecture H.Amano

Major 3 methods of parallel processing in supercomputers

Supercomputer = massively parallel computers– SIMD   (Single Instruction Stream Multiple Data Streams)

• Most accelerators

– Pipelined processing• Vector computers

– MIMD(Multiple Instruction Streams Multiple Data Streams):• Homogeneous (vs. Accelerators), Scalar (vs. Vector machines)

– Although all supercomputers use three methods in various level, it can be classified by its usage.

Key issues other than computational nodesLarge high bandwidth memoryLarge diskHigh speed Interconnection Networks.

Page 14: Supercomputers Special Course of Computer Architecture H.Amano

SIMD (Single Instruction StreamMultiple Data Streams

Instruction

InstructionMemory

Processing Unit

Data memory

•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX instructions/ClearSpeed/IMAP/GP-GPU( coarse grain )•CM-2, ( fine grain )

Page 15: Supercomputers Special Course of Computer Architecture H.Amano

– TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th )– 天河一号 (Xeon+FireStream,2009/11 5th )

GPGPU(General-Purpose computing on Graphic ProcessingUnit)

※() 内は開発環境

Page 16: Supercomputers Special Course of Computer Architecture H.Amano

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

Thread Execution Manager

Input Assembler

Host

Load/Store

Global Memory

GeForceGTX280240 cores

Page 17: Supercomputers Special Course of Computer Architecture H.Amano

GPU   (NVIDIA’s GTX580)

512 GPU cores ( 128 X 4 )768 KB L2 cache40nm CMOS 550 mm^2

128 Cores128 Cores 128 Cores128 Cores

128 Cores128 Cores 128 Cores128 Cores

L2 CacheL2 Cache

Page 18: Supercomputers Special Course of Computer Architecture H.Amano

Cell Broadband Engine

SXU

LSDMA

MICBIF/

IOIF0

PXU

L1 CL2 CPPE

SXU

LSDMA

SXU

LSDMA

SXU

LSDMA

SXU

LSDMA

SXU

LSDMA

SXU

LSDMA

SXU

LSDMA

IOIF1

SPE

1.6GHz / 4 X 16B data rings

IBM Roadrunner

PS3

Common platform forsupercomputers and games

Page 19: Supercomputers Special Course of Computer Architecture H.Amano

2

3

4

5

10

Peta FLOPS11 K

Japan

Tianhe( 天河 )   China

Jaguar USA

Nebulae China

Tsubame Japan

Peak performance vsLinpack Performance

The difference is large in machines with accelerators

Homogeneous

Using GPU

Accelerator type isenergy efficient.

Page 20: Supercomputers Special Course of Computer Architecture H.Amano

Pipeline processing

1 2 3 4 5 6

Stage

Each stage sends the result/receives the input every clock cycle.N stages = N times performanceData dependency makes RAW hazards and degrades the performance.If the large array is treated, a lot of stages can work efficiently.

Page 21: Supercomputers Special Course of Computer Architecture H.Amano

Vector computers

a0a1a2…..

multiplieradder

X[i]=A[i] * B[i]Y=Y+X[i]

vector registers

The classic style supercomputers since Cray-1.Earth simulator may be the last vector supercomputer.

b0b1b2….

Page 22: Supercomputers Special Course of Computer Architecture H.Amano

a1a2…..

X[i]=A[i] * B[i]Y=Y+X[i]b1b2….

a0

b0

Vector computers

multiplieradder

vector registers

Page 23: Supercomputers Special Course of Computer Architecture H.Amano

a2…..

X[i]=A[i] * B[i]Y=Y+X[i]b2….

a0b0

b1

a1

Vector computers

multiplieradder

vector registers

Page 24: Supercomputers Special Course of Computer Architecture H.Amano

a11…..

X[i]=A[i] * B[i]Y=Y+X[i]b11….

a9b9

b10

a10

x1x0

Vector computers

multiplieradder

vector registers

Page 25: Supercomputers Special Course of Computer Architecture H.Amano

• Multiple processors (cores) can work independently.– Synchronization mechanism– Data communication: Shared memory

• All supercomputers are MIMD with multiple cores.

• However, K and Sequoia (BlueGene Q) are typical massively parallel MIMD machines.– homogeneous computers– scalar processors

MIMD ( Multipe-Instruction Streams/

Multiple-Data Streams)

Page 26: Supercomputers Special Course of Computer Architecture H.Amano

MIMD ( Multipe-Instruction Streams/

Multiple-Data Streams)

Node 1

Node  2

Node 3

Node 0

Interconnect ionNetwork

Shared Memory

Processors which canwork independently.

Page 27: Supercomputers Special Course of Computer Architecture H.Amano

Multi-Core (Intel’s Nehalem-EX)

8 CPU cores24MB L3 cache45nm CMOS 600 mm^2

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPUL3 CacheL3 Cache

L3 CacheL3 Cache

Page 28: Supercomputers Special Course of Computer Architecture H.Amano

Intel 80-Core Chip

Intel 80-core chip [Vangal,ISSCC’07]

Page 29: Supercomputers Special Course of Computer Architecture H.Amano

How to program them ?• Can the common programs for PC be acc

elerated on supercomputers?– Yes, a certain degree by parallel compilers.

• However, in order to efficient use of many cores, specialists must optimize programs.– Multithread using MPIs– Open MP– Open CL/CUDA →   GPU accelerator type

Page 30: Supercomputers Special Course of Computer Architecture H.Amano

From IBM   web site

The fastest computerAlso simple NUMA

Page 31: Supercomputers Special Course of Computer Architecture H.Amano

IBM’s BlueGene Q

• Successor of Blue Gene L and Blue Gene P.• Sequoia is consisting of BlueGene Q• 18 Power processors (16 computational, 1 contr

ol and 1 redundant) and network interfaces are provided in a chip.

• Inner-chip interconnection is a cross-bar switch.• 5 dimensional Mesh/Torus• 1.6GHz clock.

Page 32: Supercomputers Special Course of Computer Architecture H.Amano

Japanese supercomputers

• K-Supercomputer– Homogeneous scalar type massively parallel computers.

• Earth simulator– Vector computers– The difference between peak and Linpack performance is small.

• TIT’s Tsubame– A lot of GPUs are used. Energy efficient supercomputer.

• Nagasaki University’s   DEGIMA– A lot of GPUs are used. Hand made supercomputer. High cost-p

erformance. Gordon Bell prize cost performance winner

• GRAPE projects– For astronomy, dedicated supercomputers. SIMD 、 Various ver

sion won the Gordon Bell prize.

Page 33: Supercomputers Special Course of Computer Architecture H.Amano

SACSIS2012 Invited Speech

Page 34: Supercomputers Special Course of Computer Architecture H.Amano

Supercomputer 「 K 」

Core

Core

Core

Core

Core

Core

Core

Core

L2 C

InterConnect

Controller

Tofu Interconnect       6-D Torus/Mesh

SPARC64 VIIIfx Chip

4 nodes/board

24boards/Lack

96nodes/Lack

RDMA mechanismNUMA or UMA+NORMA

Memory

Page 35: Supercomputers Special Course of Computer Architecture H.Amano

SACSIS2012 Invited speech

Page 36: Supercomputers Special Course of Computer Architecture H.Amano

SACSIS2012 invited speech

Page 37: Supercomputers Special Course of Computer Architecture H.Amano

water cooling system

Page 38: Supercomputers Special Course of Computer Architecture H.Amano

Lacks of K

Page 39: Supercomputers Special Course of Computer Architecture H.Amano

6 dimensional torusTofu

Page 40: Supercomputers Special Course of Computer Architecture H.Amano

0 1 20 0 0

1 1 1

2 2 20 1 2

0 1 2

3-dimensional mesh

3-ary   1-cube

3-ary   2-cube

0 1 20 0 0

1 1 1

2 2 20 1 2

0 1 2

0 1 20 0 0

1 1 1

2 2 20 1 2

0 1 2

0 0

0 0

01

1 1 1

1 1

2 2 2

3- ary  3- cube

Page 41: Supercomputers Special Course of Computer Architecture H.Amano

4 dimensional mesh

0***

1***

2***

Page 42: Supercomputers Special Course of Computer Architecture H.Amano

Why K could get top 1

• The delay of BlueGeneQ/Sequoia– Financial crisis in USA

• Withdrawal of NEC/Hitachi– As starting, the complex system of a vector machine a

nd a scalar machine was planned.– All budget can be used only for scalar machine.

• Budget reviewing made the project famous.– Enough fund was thrown in short period.

• Engineers in Fujitsu did really good job.

Page 43: Supercomputers Special Course of Computer Architecture H.Amano

SACSIS2012 invited talk

Page 44: Supercomputers Special Course of Computer Architecture H.Amano

The earth simulatorV

ect

or

Pro

cess

or

Vect

or

Pro

cess

or

Vect

or

Pro

cess

or

0 1 7

Shared Memory16GB

Vect

or

Pro

cess

or

Vect

or

Pro

cess

or

Vect

or

Pro

cess

or

0 1 7

Shared Memory16GB

Vect

or

Pro

cess

or

Vect

or

Pro

cess

or

Vect

or

Pro

cess

or

0 1 7

Shared Memory16GB

….

Interconnection Network (16GB/s x 2)

Node 0 Node 1 Node 639

Peak performance40TFLOPS

Page 45: Supercomputers Special Course of Computer Architecture H.Amano

The Earth simulator(2002)   Simple NUMA

Page 46: Supercomputers Special Course of Computer Architecture H.Amano

TIT’s   TsubameWell balanced supercomp

uter with GPUs

Page 47: Supercomputers Special Course of Computer Architecture H.Amano

NagasakiUniv’s DEGIMA

Page 48: Supercomputers Special Course of Computer Architecture H.Amano

GRAPE-DR

Kei Hiraki “GRAPE-DR” http://www.fpl.org (FPL2007)

Page 49: Supercomputers Special Course of Computer Architecture H.Amano

Exa-scale computer• Japanese national project for exa-scale computer started.• Feasibility   Study started.

– U. Tokyo, Tsukuba Univ. Tohoku Univ. and Riken.• It is difficult to produce supercomputers with Japanese original chip

s.• In Japan, a vendor suffers loss for developing supercomputers.• The vendor may retrieve development fee later by selling smaller sy

stems.• However, Japanese semiconductor companies will not be able to su

pport a big money for development.• If Intel’s CPUs or NVIDIA’s GPUs are used, a huge national money

will flow to US companies.• For exa-scale: 70,000,000 cores are needed.

– The limitation of budget is severer than technical limit.

Page 50: Supercomputers Special Course of Computer Architecture H.Amano

Amdahl’s lawSerial part1%

Parallel part 99 %

Accelerated by parallel processing

0.01 + 0.99/p

50 times with 100 cores 、 91 times with 1000 cores

If there is a small part of serial execution part, the performanceimprovement is limited.

Page 51: Supercomputers Special Course of Computer Architecture H.Amano

Why Exa-scale supercomputers?

• The ratio of serial part becomes small for the large scale problem.– Linpack is scale free benchmark.– Serial execution part 1 day + Parallel execution part 10 years→   1day+1day: A big impact.

• Are there any big programs which cannot be solved by K but can be solved by Exa-scale supercomputers?– The number of programs will be decreased.– Can we find new area of application ?

• It is important such a big computing power is open for researches.

Page 52: Supercomputers Special Course of Computer Architecture H.Amano

Should we develop a floating computation centric supercomputers?

• What people wants big supercomputer to do?– Finding new medicines: Pattern matching.– Simulation of earthquake, Meteorology for analyzing

global warming.– Big data– Artificial Intelligence

• Most of them are not suitable for floating computation centric supercomputers.

• “Supercomputers for big data” or “Super-cloud computers” might be required.

Page 53: Supercomputers Special Course of Computer Architecture H.Amano

Motivation and limitation

• Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors and application.

• Flagship and symbols.• No-computer is remained in Japan other than supercomputers• A super computing power is open for peaceful researches.• It is a tool which makes impossible analysis possible.

• What needs infinite computing power ?• Is it a Japanese supercomputer if all cores and accelerators are ma

de in USA ?• Does floating centric supercomputer to solve LInpack as fast as pos

sible really fit the demand?

Look at Exa-scale computer project!

Page 54: Supercomputers Special Course of Computer Architecture H.Amano

Excise

• A target program:serial computation part :1

parallel computation part: N3

• K: 700,000 cores

• Exa: 70,000,000 cores

• What N makes Exa 10 times faster than K ?