View
60
Download
2
Category
Preview:
DESCRIPTION
The Cray X1 Multiprocessor and Roadmap. Cray’s Computing Vision. Scalable High-Bandwidth Computing. 2010 ‘Cascade’ Sustained Petaflops. ‘Black Widow’. ‘Black Widow 2’. 2006. X1E. 2004. Product Integration. X1. 2006. ‘Strider X’. ‘Strider 3’. 2004. 2005. ‘Strider 2’. RS. - PowerPoint PPT Presentation
Citation preview
Cray Inc.
The Cray X1 Multiprocessor and
Roadmap
Slide 2ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray’s Computing Vision
X1
‘Strider 2’
X1E
‘Black Widow’
2006200620062006
‘Strider 3’
2006200620062006
‘Black Widow 2’
Scalable High-Bandwidth Computing
Product IntegrationProduct Integration
RSRed Storm
‘Strider X’
20102010‘‘Cascade’Cascade’
SustainedSustainedPetaflopsPetaflops
20102010‘‘Cascade’Cascade’
SustainedSustainedPetaflopsPetaflops
2004200420042004
2004200420042004
2005200520052005
Slide 3ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray PVP
• Powerful vector processors
• Very high memory bandwidth
• Non-unit stride computation
• Special ISA features
• Modernized the ISA
T3E
• Extreme scalability
• Optimized communication
• Memory hierarchy
• Synchronization features
• Improved via vectors
Extreme scalability with high bandwidth vector processors
Cray X1
Slide 4ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray X1 Instruction Set Architecture
New ISA– Much larger register set (32x64 vector, 64+64 scalar)– All operations performed under mask– 64- and 32-bit memory and IEEE arithmetic– Integrated synchronization features
Advantages of a vector ISA– Compiler provides useful dependence information to hardware– Very high single processor performance with low complexity
ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr)
– Localized computation on processor chip– large register state with very regular access patterns– registers and functional units grouped into local clusters (pipes) excellent fit with future IC technology
– Latency tolerance and pipelining to memory/network very well suited for scalable systems
– Easy extensibility to next generation implementation
Slide 5ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
• New instruction set• New system architecture• New processor microarchitecture
• “Classic” vector machines were programmed differently– Classic vector: Optimize for loop length with little regard for locality– Scalable micro: Optimize for locality with little regard for loop length
• The Cray X1 is programmed like other parallel machines– Rewards locality: register, cache, local memory, remote memory– Decoupled microarchitecture performs well on short loop nests– (however, does require vectorizable code)
Not Your Father’s Vector Machine
Slide 6ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
P PP P
$ $ $ $
P PP P
$ $ $ $
P PP P
$ $ $ $
P PP P
$ $ $ $
M M M M M M M M M M M M M M M M
mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem
IO IO
• Four multistream processors (MSPs), each 12.8 Gflops
• High bandwidth local shared memory (128 Direct Rambus channels)
• 32 network links and four I/O links per node
51 Gflops, 200 GB/s
Cray X1 Node
Slide 7ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
• 16 parallel networks for bandwidth• Global shared memory across machine
Interconnection
Network
NUMA Scalable up to 1024 Nodes
Slide 8ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
M15
PP PP PP PP
PP PP PP PP
PP PP PP PP
PP PP PP PP
M0
M0
M1
M1 M15
M0
M0 M1
M1
M15
M15
node 0
node 1
node 2
node 3
Section 0 Section 1 Section 15
Network Topology (16 CPUs)
Slide 9ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Network Topology (128 CPUs)
RR
RRRR
RR RR
RR
RR
RR
16links
Slide 10ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Network Topology (512 CPUs)
Slide 11ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Designed for Scalability
• Distributed shared memory (DSM) architecture– Low latency, load/store access to entire machine (tens of TBs)
• Decoupled vector memory architecture for latency tolerance– Thousands of outstanding references, flexible addressing
• Very high performance network– High bandwidth, fine-grained transfers
– Same router as Origin 3000, but 32 parallel copies of the network
• Architectural features for scalability– Remote address translation
– Global coherence protocol optimized for distributed memory
– Fast synchronization
• Parallel I/O scales with system size
T3E Heritage
Slide 12ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Decoupled Microarchitecture
• Decoupled access/execute and decoupled scalar/vector
• Scalar unit runs ahead, doing addressing and control– Scalar and vector loads issued early– Store addresses computed and saved for later use– Operations queued and executed later when data arrives
• Hardware dynamically unrolls loops– Scalar unit goes on to issue next loop before current loop has completed– Full scalar register renaming through 8-deep shadow registers– Vector loads renamed through load buffers– Special sync operations keep pipeline full, even across barriers
This is key to making the system like short-VL code
Slide 13ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Address Translation
• High translation bandwidth: scalar + four vector translations per cycle per P
• Source translation using 256 entry TLBs with support for multiple page sizes
• Remote (hierarchical) translation:– allows each node to manage its own memory (eases memory mgmt.)– TLB only needs to hold translations for one node scales
61 063 62
Memory region:
32 16
Page OffsetVirtual Page #
31 15
Possible page boundaries:
MBZ
4748
useg, kseg, kphys 64 KB to 4 GB
VA:
Physical address space:
45 036
OffsetNode
354647
PA:
Main memory, MMR, I/OMax 64 TB physical memory
Slide 14ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cache Coherence
Global coherence, but only cache memory from local node (8-32 GB)– Supports SMP-style codes up to 4 MSP (16 SSP)– References outside this domain converted to non-allocate– Scalable codes use explicit communication anyway– Keeps directory entry and protocol simple
Explicit cache allocation control– Per instruction hint for vector references– Per page hint for scalar references– Use non-allocating refs to avoid cache pollution
Coherence directory stored on the M chips (rather than in DRAM)– Low latency and really high bandwidth to support vectors
• Typical CC system: 1 directory update per proc per 100 or 200 ns• Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!)
Cray Inc.
System
Software
Slide 16ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray X1 I/O SubsystemX
1 N
od
e
I/O Board
I/O Board
SPCChannels
2 x 1.2 GB/s
IOCA
IOCA
IOCA
IOCA
PCI-XBUSES
800 MB/sPCI-XCards
FC
FC
FC
FC
FC
FC
FC
FC
200 MBytes/s FC-AL
IP over Fiber CPES
SB
SB
SB
SB
CB
CNSGigabit Ethernet or Hippi Network
CNSGigabit Ethernet or Hippi Network
RS200
SB
SB
SB
SB
CB
RS200
I Chip
I Chip
XFS and XVM
ADIC SNFS
Slide 17ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray X1 UNICOS/mp
UNIX kernel executes
on each Application node
(somewhat like Chorus™
microkernel on UNICOS/mk)
Provides: SSI, scaling &
resiliency
Commands launch
applications like on
T3E (mpprun)
System Service nodes provide
file services, process manage-
ment and other basic UNIX
functionality (like /mk servers).
User commands execute on
System Service nodes.
UNICOS/mp
Global resource
manager (Psched)
schedules
applications on
Cray X1 nodes256 CPU Cray X1
Single System Image
Slide 18ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Programming Models
• Shared memory:– OpenMP, pthreads
– Single node (51 Gflops, 8-32 GB)
– Fortran and C
• Distributed memory:– MPI
– shmem(), UPC, Co-array Fortran
• Hierarchical models (DM over OpenMP)• Both MSP and SSP execution modes supported for all
programming models
Cray Inc.
Mechanical
Design
Slide 20ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Node board
Field-Replaceable Memory
Daughter Cards
NetworkInterconnect
Spray Cooling Caps
Air Cooling Manifold
PCB
17” x 22’’
CPU MCM8 chips
Slide 21ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray X1 Node Module
Slide 22ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray X1 Chassis
Slide 23ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
64 Processor Cray X1 System~820 Gflops
Cray Inc.
Cray BlackWidow
The Next Generation HPC System From Cray Inc.
Slide 25ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
BlackWidow Highlights
• Designed as a follow-on to Cray X1:– Upward compatible ISA – FCS in 2006
• Significant improvement (>> Moore’s Law rate) in:– Single thread scalar performance– Price performance
• Refine X1 implementation:– Continue to support DSM with 4-way SMP nodes– Further improve sustained memory bandwidth– Improve scalability up and down– Improve reliability/fault tolerance– Enhance instruction set
• A few BlackWidow features:– Single chip vector microprocessor– Hybrid electrical/optical interconnect– Configurable memory capacity, memory BW and network BW
Slide 26ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cascade ProjectCray Inc.Stanford
Cal Tech/JPLNotre Dame
Slide 27ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
High Productivity Computing Systems
Goals: Provide a new generation of economically viable high productivity computing
systems for the national security and industrial user community (2007 – 2010)
Impact:Performance (efficiency): critical national security
applications by a factor of 10X to 40XProductivity (time-to-solution) Portability (transparency): insulate research and
operational application software from systemRobustness (reliability): apply all known techniques
to protect against outside attacks, hardware faults, & programming errors
Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant
modeling and biotechnology
HPCS Program Focus Areas
Slide 28ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
HPCS Planned ProgramPhases 1-3
02 05 06 07 08 09 1003 04
ProductsMetrics,
Benchmarks
Academia
Research
Platforms
Early
Software
Tools
Early
Pilot
Platforms
Phase 2R&D
Phase 3 Full Scale Development
(Planned)
Metrics and Benchmarks
System DesignReview
Industry GuidedResearch
Application Analysis
PerformanceAssessment
HPCSCapability or
Products
Fiscal Year
Concept Reviews
PDR & Early
Prototype
Research
Research Prototypes
& Pilot Systems
Phase 3 Readiness Review
TechnologyAssessments
Requirementsand Metrics
Phase 2Readiness Reviews
Phase 1Industry
Concept Study
Reviews
Industry BAA/RFP
Critical Program Milestones
Slide 29ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Technology Focus of Cascade
• System Design– Network topology and router design– Shared memory for low latency/low overhead communication
• Processor design– Exploring clustered vectors, multithreading and streams– Enhancing temporal locality via deeper memory hierarchies– Improving latency tolerance and memory/communication concurrency
• Lightweight threads in memory– For exploitation of spatial locality with little temporal locality– Thread data instead of data thread– Exploring use of PIMs
• Programming environments and compilers– High productivity languages– Compiler support for heavyweight/lightweight thread decomposition– Parallel programming performance analysis and debugging
• Operating systems– Scalability to tens of thousands of processors– Robustness and fault tolerance
Lets us explore technologies we otherwise couldn’t.A three year head start on typical development cycle.
Slide 30ORNL Workshop at Fall Creek Falls, 10/27/03 Copyright 2003 Cray Inc.
Cray’s Computing Vision
X1
‘Strider 2’
X1E
‘Black Widow’
2006200620062006
‘Strider 3’
2006200620062006
‘Black Widow 2’
Scalable High-Bandwidth Computing
Product IntegrationProduct Integration
RSRed Storm
‘Strider X’
20102010‘‘Cascade’Cascade’
SustainedSustainedPetaflopsPetaflops
20102010‘‘Cascade’Cascade’
SustainedSustainedPetaflopsPetaflops
2004200420042004
2004200420042004
2005200520052005
Cray Inc.
File Name: BWOpsReview081103.ppt
Questions?
Recommended