Upload
edolie
View
33
Download
3
Embed Size (px)
DESCRIPTION
Architectural Support for System Software on Large-Scale Clusters. Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Kei Davis 1 and José Carlos Sancho 1 1 Performance and Architecture Lab (PAL) 2 Grupo de Arquitectura y Computación Paralelas (GACOP) - PowerPoint PPT Presentation
Citation preview
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Architectural Support for System Softwareon Large-Scale Clusters
email:[email protected]
Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,
Kei Davis1 and José Carlos Sancho1
1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Motivation
IndependentIndependentNodes / OSsNodes / OSs
glued together byglued together bySystem Software:System Software:
Parallel Developmentand Debugging Tools
Resource ManagementCommunications
Parallel File SystemFault Tolerance
OSOS OSOSOSOS
OSOS
OSOS OSOSOSOS
OSOS
System software is a key factor to maximize usability, performance and scalability on large-scale clusters!!!
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Why is System Software so important?
Parallel applications rely on services provided by system software System software quality impacts performance, scalability, usability
and responsiveness of parallel applications
Development of system software is a very time- and resource- consuming task System software affects the cost of ownership of clusters
Motivation
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Why is System Software so complex?
Commodity hardware/OSs not designed for clusters Hardware conceived for loosely-coupled environments Local OSs lack global awareness of parallel applications
Complex global state
Non-deterministic behavior inherent to clusters (local OS scheduling) and applications (MPI_ANY_SOURCE)
Independent design of different system software components leads to redundant/missing functionality
Motivation
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Motivation
How is System Software built?
System Software relies on an abstract network interface to move information between nodes Performance expressed by latency and bandwidth Interface can evolve to exploit new hardware capabilities
What hardware features should
the interconnection network provide
to simplify system software?
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Target Simplifying design and implementation of the system
software for large-scale clusters Simplicity, performance, scalability, responsiveness Backbone to integrate all nodes into a single, global OS
Approach Least common denominator of all system software
components as a basic set of three network primitives Global synchronization/scheduling of all system activities
Vision SIMD system running MIMD applications
(variable granularity in the order of hundreds of s)
Goals
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Outline
Motivation and Goals
Introduction
Core Primitives
System Software Design
Case Studies
Concluding remarks
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Introduction
Characteristic Workstation Cluster
Job Launching OS Scripts atop the OS
Challenges in the Design of System Software
SYSTEMSYSTEM
SOFTWARESOFTWARE
Job Scheduling Timeshared by OSBatch queued or gang scheduled by middleware
CommunicationStandard IPC and shared memory
MPI
StorageStandard file system
Custom parallel file system
Debuggability Standard toolsCustom parallel debugging tools
Fault Tolerance Little or noneApplication-assisted checkpointing
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Hardware
Abstract Network Interface
Network Protocol
Network Protocol
Network Protocol
Network Protocol
Introduction
ResourceManagem
entMPI . . . Fault
Tolerance
Parallel Applications
Independent System
Software Components
Designing a Parallel OS: What we have…
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Hardware
Core Primitives
Introduction
ResourceManageme
ntMPI . . .
Fault Tolerance
Parallel Applications Single
,Global OS
Designing a Parallel OS: What we want…
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Outline
Motivation and Goals
Introduction
Core Primitives
System Software Design
Case Studies
Concluding remarks
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Core Primitives
System software built atop three primitives Xfer-And-Signal
– Transfer block of data to a set of nodes– Optionally signal local/remote event upon completion
Compare-And-Write– Compare global variable on a set of nodes– Optionally write global variable on the same set of nodes
Test-Event– Poll local event
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Core Primitives
System software built atop three primitives Xfer-And-Signal (QsNet):
– Node S transfers block of data to nodes D1, D2, D3 and D4
– Events triggered at source and destinations
S D1 D2D4D3
SourceEvent
DestinationEvents
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Core Primitives
System software built atop three primitives Compare-And-Write (QsNet):
– Node S compares variable V on nodes D1, D2, D3 and D4
S D1 D2D4D3
•Is V {, , >} to Value?
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Core Primitives
System software built atop three primitives Compare-And-Write (QsNet):
– Node S compares variable V on nodes D1, D2, D3 and D4
– Partial results are combined in the switches
S D1 D2D4D3
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Outline
Motivation and Goals
Introduction
Core Primitives
System Software Design
Case Studies
Concluding remarks
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
System Software Design
Characteristic
Requirement Solution
Job Launching
Data Dissemination
Flow Control
Termination Detection
Xfer-And-Signal
Compare-And-Write
Compare-And-Write
Job SchedulingHeartbeat
Context Switch
Xfer-And-Signal
Prioritized messages/multiple rails
Communication
PUT
GET
Barrier
Broadcast
Xfer-And-Signal
Xfer-And-Signal
Compare-And-Write
Compare-And-Write + Xfer-And-Signal
… atop the three core primitives
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
System Software Design
Can System Softwarereally be built atop
the Core Primitives?
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Outline
Motivation and Goals
Introduction
Core Primitives
System Software Design
Case Studies
Concluding remarks
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies
Experimental Setup
Characteristic
Crescendo Cluster Wolverine Cluster
Nodes 32 x Dell 1550 64 AlphaServer ES40
CPUs/Node 2 x 1GHz Pentium-III 4 x 833MHz EV68
Memory/Node 1 GB 8 GB
Network Cards QM-400 Elan3 QM-400 Elan3
OS RH 7.3 + QsNet kernel RH 7.1 + QsNet kernel
Software
Qsnetlibs v1.5.0-0 +
Intel C/Fortran Compiler 5.0.1
Qsnetlibs v1.5.0-0 +
Compaq´s C Compiler
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Job Launching: send/execute/check completion
40 times faster than the best reported result!!!
Case Studies: Job Launching
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies: Job Scheduling
Job Scheduling: Gang Scheduling
Very small time slices: RESPONSIVENESS !!!
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies: BCS-MPI
•Global Strobe•(time slice starts)
•Global Strobe•(time slice ends)
Exchange of comm requirements
Communication scheduling
Real transmission
•Global•Synchronization
•Global•Synchronization
Tim
e S
lice
(h
un
dre
ds
of s
) Communication Library: BCS-MPI
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies: BCS-MPI
Global synchronization Strobe sent at regular intervals (time slices)
– Compare-And-Write + Xfer-And-Signal (Master)– Test-Event (Slaves)
All system activities are tightly coupled Global information is required to schedule resources, global
synchronization facilitates the task but it is not enough
Global Scheduling Exchange of communication requirements
– Xfer-And-Signal + Test-Event Communication scheduling Real transmission
– Xfer-And-Signal + Test-Event
Implementation in Network Interface Card
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies: BCS-MPI
SWEEP3D and SAGE Performance (IA32) Production-level MPI versus BCS-MPI
0.5% SPEEDUP 2% SPEEDUP
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies
System Software atopthree Core Primitives:
Performance/Scalability with a simple design
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Outline
Motivation and Goals
Introduction
Core Primitives
System Software Requirements Design
Case Studies
Concluding remarks
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Concluding Remarks
New abstract network interface for system software Three communication primitives in the network hardware Implement the basics of most system software components Simplicity / Performance / Scalability / Responsiveness
Promising preliminary results demonstrate that
scalable resource management and parallel
application communication are indeed feasible
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Future Work
Kernel-level implementation of the core primitives User-level solution is already working
Deterministic replay of MPI programs Ordered resource scheduling may enforce reproducibility
Transparent fault tolerance Global coordination simplifies the state of the machine
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Architectural Support for System Softwareon Large-Scale Clusters
email:[email protected]
Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,
Kei Davis1 and José Carlos Sancho1
1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Introduction
Designing a Parallel Operating System Most system software components need to move
information between nodes frequently– Fast and scalable unicast mechanism
All system activities must be tightly coupled by means of …– … global synchronization
Fast and scalable global synchronization mechanisms
– … and global resource scheduling Fast and scalable global information exchange
Hardware support is key to perform these tasks scalably and at a sub-millisecond granularity
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies
Experimental Setup STORM (Scalable TOol for Resource Management)
– Architecture: Set of dæmons running on the management/compute nodes Network abstract layer based on the three core primitives
– Functionality: Job Launching Job Scheduling (FCFS, gang scheduling and others)
New scheduling algorithms can be plugged in Resource Accounting
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
System Software Design
Characteristic
Requirement Solution
StorageMetadata Transfer
File Data TransferXfer-And-Signal
Debuggability
Debug Data Transfer
Debug Synchronization
Xfer-And-Signal
Compare-And-Write
Fault Tolerance
Fault Detection
Checkpointing Synchronization
Checkpointing Data Transfer
Compare-And-Write
Compare-And-Write
Xfer-And-Signal
… atop the three core primitives
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies: BCS-MPI
Non-blocking primitives: MPI_Isend/Irecv
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)
Case Studies: BCS-MPI
Blocking primitives: MPI_Send/Recv