The Cray X1 Multiprocessor and Roadmap

Cray Inc.

The Cray X1 Multiprocessor and

Roadmap

Cray’s Computing Vision

‘Strider 2’

‘Black Widow’

2006200620062006

‘Strider 3’

2006200620062006

‘Black Widow 2’

Scalable High-Bandwidth Computing

Product IntegrationProduct Integration

RSRed Storm

‘Strider X’

20102010‘‘Cascade’Cascade’

SustainedSustainedPetaflopsPetaflops

2004200420042004

2005200520052005

Cray PVP

• Powerful vector processors

• Very high memory bandwidth

• Non-unit stride computation

• Special ISA features

• Modernized the ISA

• Extreme scalability

• Optimized communication

• Memory hierarchy

• Synchronization features

• Improved via vectors

Extreme scalability with high bandwidth vector processors

Cray X1

Cray X1 Instruction Set Architecture

New ISA– Much larger register set (32x64 vector, 64+64 scalar)– All operations performed under mask– 64- and 32-bit memory and IEEE arithmetic– Integrated synchronization features

Advantages of a vector ISA– Compiler provides useful dependence information to hardware– Very high single processor performance with low complexity

ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr)

– Localized computation on processor chip– large register state with very regular access patterns– registers and functional units grouped into local clusters (pipes) excellent fit with future IC technology

– Latency tolerance and pipelining to memory/network very well suited for scalable systems

– Easy extensibility to next generation implementation

• New instruction set• New system architecture• New processor microarchitecture

• “Classic” vector machines were programmed differently– Classic vector: Optimize for loop length with little regard for locality– Scalable micro: Optimize for locality with little regard for loop length

• The Cray X1 is programmed like other parallel machines– Rewards locality: register, cache, local memory, remote memory– Decoupled microarchitecture performs well on short loop nests– (however, does require vectorizable code)

Not Your Father’s Vector Machine

P PP P

$ $ $ $

P PP P

$ $ $ $

P PP P

$ $ $ $

P PP P

$ $ $ $

M M M M M M M M M M M M M M M M

mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem

• Four multistream processors (MSPs), each 12.8 Gflops

• High bandwidth local shared memory (128 Direct Rambus channels)

• 32 network links and four I/O links per node

51 Gflops, 200 GB/s

Cray X1 Node

• 16 parallel networks for bandwidth• Global shared memory across machine

Interconnection

Network

NUMA Scalable up to 1024 Nodes

PP PP PP PP

M1 M15

node 0

node 1

node 2

node 3

Section 0 Section 1 Section 15

Network Topology (16 CPUs)

16links

Designed for Scalability

• Distributed shared memory (DSM) architecture– Low latency, load/store access to entire machine (tens of TBs)

• Decoupled vector memory architecture for latency tolerance– Thousands of outstanding references, flexible addressing

• Very high performance network– High bandwidth, fine-grained transfers

– Same router as Origin 3000, but 32 parallel copies of the network

• Architectural features for scalability– Remote address translation

– Global coherence protocol optimized for distributed memory

– Fast synchronization

• Parallel I/O scales with system size

T3E Heritage

Decoupled Microarchitecture

• Decoupled access/execute and decoupled scalar/vector

• Scalar unit runs ahead, doing addressing and control– Scalar and vector loads issued early– Store addresses computed and saved for later use– Operations queued and executed later when data arrives

• Hardware dynamically unrolls loops– Scalar unit goes on to issue next loop before current loop has completed– Full scalar register renaming through 8-deep shadow registers– Vector loads renamed through load buffers– Special sync operations keep pipeline full, even across barriers

This is key to making the system like short-VL code

Address Translation

• High translation bandwidth: scalar + four vector translations per cycle per P

• Source translation using 256 entry TLBs with support for multiple page sizes

• Remote (hierarchical) translation:– allows each node to manage its own memory (eases memory mgmt.)– TLB only needs to hold translations for one node scales

61 063 62

Memory region:

Page OffsetVirtual Page #

Possible page boundaries:

useg, kseg, kphys 64 KB to 4 GB

Physical address space:

45 036

OffsetNode

354647

Main memory, MMR, I/OMax 64 TB physical memory

Cache Coherence

Global coherence, but only cache memory from local node (8-32 GB)– Supports SMP-style codes up to 4 MSP (16 SSP)– References outside this domain converted to non-allocate– Scalable codes use explicit communication anyway– Keeps directory entry and protocol simple

Explicit cache allocation control– Per instruction hint for vector references– Per page hint for scalar references– Use non-allocating refs to avoid cache pollution

Coherence directory stored on the M chips (rather than in DRAM)– Low latency and really high bandwidth to support vectors

• Typical CC system: 1 directory update per proc per 100 or 200 ns• Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!)

Cray Inc.

System

Software

Cray X1 I/O SubsystemX

I/O Board

SPCChannels

2 x 1.2 GB/s

PCI-XBUSES

800 MB/sPCI-XCards

200 MBytes/s FC-AL

IP over Fiber CPES

CNSGigabit Ethernet or Hippi Network

I Chip

XFS and XVM

ADIC SNFS

Cray X1 UNICOS/mp

UNIX kernel executes

on each Application node

(somewhat like Chorus™

microkernel on UNICOS/mk)

Provides: SSI, scaling &

resiliency

Commands launch

applications like on

T3E (mpprun)

System Service nodes provide

file services, process manage-

ment and other basic UNIX

functionality (like /mk servers).

User commands execute on

System Service nodes.

UNICOS/mp

Global resource

manager (Psched)

schedules

applications on

Cray X1 nodes256 CPU Cray X1

Single System Image

Programming Models

• Shared memory:– OpenMP, pthreads

– Single node (51 Gflops, 8-32 GB)

– Fortran and C

• Distributed memory:– MPI

– shmem(), UPC, Co-array Fortran

• Hierarchical models (DM over OpenMP)• Both MSP and SSP execution modes supported for all

programming models

Cray Inc.

Mechanical

Design

Node board

Field-Replaceable Memory

Daughter Cards

NetworkInterconnect

Spray Cooling Caps

Air Cooling Manifold

17” x 22’’

CPU MCM8 chips

Cray X1 Node Module

Cray X1 Chassis

64 Processor Cray X1 System~820 Gflops

Cray Inc.

Cray BlackWidow

The Next Generation HPC System From Cray Inc.

BlackWidow Highlights

• Designed as a follow-on to Cray X1:– Upward compatible ISA – FCS in 2006

• Significant improvement (>> Moore’s Law rate) in:– Single thread scalar performance– Price performance

• Refine X1 implementation:– Continue to support DSM with 4-way SMP nodes– Further improve sustained memory bandwidth– Improve scalability up and down– Improve reliability/fault tolerance– Enhance instruction set

• A few BlackWidow features:– Single chip vector microprocessor– Hybrid electrical/optical interconnect– Configurable memory capacity, memory BW and network BW

Cascade ProjectCray Inc.Stanford

Cal Tech/JPLNotre Dame

High Productivity Computing Systems

Goals: Provide a new generation of economically viable high productivity computing

systems for the national security and industrial user community (2007 – 2010)

Impact:Performance (efficiency): critical national security

applications by a factor of 10X to 40XProductivity (time-to-solution) Portability (transparency): insulate research and

operational application software from systemRobustness (reliability): apply all known techniques

to protect against outside attacks, hardware faults, & programming errors

Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant

modeling and biotechnology

HPCS Program Focus Areas

HPCS Planned ProgramPhases 1-3

02 05 06 07 08 09 1003 04

ProductsMetrics,

Benchmarks

Academia

Research

Platforms

Software

Platforms

Phase 2R&D

Phase 3 Full Scale Development

(Planned)

Metrics and Benchmarks

System DesignReview

Industry GuidedResearch

Application Analysis

PerformanceAssessment

HPCSCapability or

Products

Fiscal Year

Concept Reviews

PDR & Early

Prototype

Research

Research Prototypes

& Pilot Systems

Phase 3 Readiness Review

TechnologyAssessments

Requirementsand Metrics

Phase 2Readiness Reviews

Phase 1Industry

Concept Study

Reviews

Industry BAA/RFP

Critical Program Milestones

Technology Focus of Cascade

• System Design– Network topology and router design– Shared memory for low latency/low overhead communication

• Processor design– Exploring clustered vectors, multithreading and streams– Enhancing temporal locality via deeper memory hierarchies– Improving latency tolerance and memory/communication concurrency

• Lightweight threads in memory– For exploitation of spatial locality with little temporal locality– Thread data instead of data thread– Exploring use of PIMs

• Programming environments and compilers– High productivity languages– Compiler support for heavyweight/lightweight thread decomposition– Parallel programming performance analysis and debugging

• Operating systems– Scalability to tens of thousands of processors– Robustness and fault tolerance

Lets us explore technologies we otherwise couldn’t.A three year head start on typical development cycle.

Cray’s Computing Vision

‘Strider 2’

‘Black Widow’

2006200620062006

‘Strider 3’

2006200620062006

‘Black Widow 2’

Scalable High-Bandwidth Computing

Product IntegrationProduct Integration

RSRed Storm

‘Strider X’

2004200420042004

2005200520052005

Cray Inc.

File Name: BWOpsReview081103.ppt

Questions?

The Cray X1 Multiprocessor and Roadmap

Documents

Cray UNICOS/mp operating system on Cray X1 hardware ... · Cray UNICOS/mp Operating System Version 2.4.15 on Cray X1 hardware Security Target Version 1.0 August 31, 2004 Prepared

1 Cray Inc. 11/28/2015 Cray Inc Slide 2 Cray 2004-2010 Cray Adaptive Supercomputing Vision Cray moves to Linux-base OS Cray Introduces CX1 Cray moves

What is « HPC - Supélec · Top500 site: HPC roadmap ... (Symetric MultiProcessor): Overview of Recent ... Cray-X1 – 52.4 Tflops Vector MPP Cray-XT3 Cray-XT4

Cray A Seymour Cray Perspective Seymour Cray Lecture Series University of Minnesota November 10, 1997 Gordon Bell

Performance Evaluation of Supercomputers using HPCC and ... · Cray X1 Processor Node Module 16 or 32 GB Memory 200 GB/s 12.8 GF (64bit) CPU I/O I/O I/O I/O X1 node board has performance

X1 StorNext SAN - CUG · 2010-08-11 · Page 9 X1 SAN - Cray User Group - May 2004 Terminology • SNMS - StorNext Management Suite • Manages migration, backup, monitors space,

Remote Memory Architecturespeople.cs.aau.dk/~bt/PhDSuperComputing2011... · Cluster Computing Cray X1: Parallel Vector Architecture Cray combines several technologies in the X1 •

, CRAY C98 , CRAY C98D CRAY C94 , and CRAY C94D Site

Cocot - ManoMano · Cocot 1353_86999. x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x2 x1 x1 x1 x1 x1 x1 x2 A M650mm x8 1/1 B 3.550mm x4 1/1 C 3.545mm x12 1/1 D 3.535mm x12 1/1 E 3.5*25mm

The Cray BlackWidow: A Highly Scalable Vector Multiprocessorsc07.supercomputing.org/schedule/pdf/pap392.pdf · The Cray BlackWidow: A Highly Scalable Vector Multiprocessor Dennis

Early Operational Experience with the Cray X1 at … X1 at the Oak Ridge National Laboratory Center for Computational Sciences Buddy Bland, Richard Alexander Steven Carter, Kenneth

NAS Experience with the Cray X1 - Cray User Group · NAS Experience with the Cray X1 Rupak Biswas Subhash Saini, Sharad Gavali, Henry Jin, Dennis Jespersen, M. Jahed Djomehri, Nateri

Cray Cray Bay

x4 x1 x1 x2 x2 x8 x1 x1 x1 x1 x1 3 x1 x1 x1 x2 x1 x6 · x1 x1 x1 x1 x1 x1 x4 x2 Ø6x12 mm x1 x1 x2 x1 x1 x1 ø4.2x13 mm x2 ER-0919/2000 ISO 9001 IN00389.03 1/4 P1 P2 P3 P5 P6 P4 P0

Cray History - JMU€¦ · Web viewDepending on user permission settings, various processes can be controlled with the software management tool cpr. The Cray X1 also supports the

Mr. Peters’ Cray-Cray APUSH Study Guide & Crystal Ball

Cray System Software Features for Cray X1 System - CUG

Cray X1 Evaluation: Overview and Scalability Analysis

Total Life Cycle Cost Comparison: Cray X1 and Pentium 4 ......– 16 Cray X1 SSPs (4 MSPs) – 76 Athlon 1.4 GHz processors – Reflects (approximately) 4 to 1 advantage • Standard

The Cray-1 Computer System, 1977s3data.computerhistory.org/brochures/cray.cray1.1977.102638650.pdf · THE CRAY -I'COMPUTER SYSTEM The Cray Research, Inc. CRAY-1 Computer System is