35
International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Cana Architectural Support for System Software on Large-Scale Clusters email:[email protected] Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Kei Davis 1 and José Carlos Sancho 1 1 Performance and Architecture Lab (PAL) 2 Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3. lanl . gov URL: http://www. ditec .um. es

Architectural Support for System Software on Large-Scale Clusters

  • Upload
    edolie

  • View
    33

  • Download
    3

Embed Size (px)

DESCRIPTION

Architectural Support for System Software on Large-Scale Clusters. Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Kei Davis 1 and José Carlos Sancho 1 1 Performance and Architecture Lab (PAL) 2 Grupo de Arquitectura y Computación Paralelas (GACOP) - PowerPoint PPT Presentation

Citation preview

Page 1: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Architectural Support for System Softwareon Large-Scale Clusters

email:[email protected]

Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,

Kei Davis1 and José Carlos Sancho1

1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP)

CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores

Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN

URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es

Page 2: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Motivation

IndependentIndependentNodes / OSsNodes / OSs

glued together byglued together bySystem Software:System Software:

Parallel Developmentand Debugging Tools

Resource ManagementCommunications

Parallel File SystemFault Tolerance

OSOS OSOSOSOS

OSOS

OSOS OSOSOSOS

OSOS

System software is a key factor to maximize usability, performance and scalability on large-scale clusters!!!

Page 3: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Why is System Software so important?

Parallel applications rely on services provided by system software System software quality impacts performance, scalability, usability

and responsiveness of parallel applications

Development of system software is a very time- and resource- consuming task System software affects the cost of ownership of clusters

Motivation

Page 4: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Why is System Software so complex?

Commodity hardware/OSs not designed for clusters Hardware conceived for loosely-coupled environments Local OSs lack global awareness of parallel applications

Complex global state

Non-deterministic behavior inherent to clusters (local OS scheduling) and applications (MPI_ANY_SOURCE)

Independent design of different system software components leads to redundant/missing functionality

Motivation

Page 5: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Motivation

How is System Software built?

System Software relies on an abstract network interface to move information between nodes Performance expressed by latency and bandwidth Interface can evolve to exploit new hardware capabilities

What hardware features should

the interconnection network provide

to simplify system software?

Page 6: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Target Simplifying design and implementation of the system

software for large-scale clusters Simplicity, performance, scalability, responsiveness Backbone to integrate all nodes into a single, global OS

Approach Least common denominator of all system software

components as a basic set of three network primitives Global synchronization/scheduling of all system activities

Vision SIMD system running MIMD applications

(variable granularity in the order of hundreds of s)

Goals

Page 7: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Outline

Motivation and Goals

Introduction

Core Primitives

System Software Design

Case Studies

Concluding remarks

Page 8: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Introduction

Characteristic Workstation Cluster

Job Launching OS Scripts atop the OS

Challenges in the Design of System Software

SYSTEMSYSTEM

SOFTWARESOFTWARE

Job Scheduling Timeshared by OSBatch queued or gang scheduled by middleware

CommunicationStandard IPC and shared memory

MPI

StorageStandard file system

Custom parallel file system

Debuggability Standard toolsCustom parallel debugging tools

Fault Tolerance Little or noneApplication-assisted checkpointing

Page 9: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Hardware

Abstract Network Interface

Network Protocol

Network Protocol

Network Protocol

Network Protocol

Introduction

ResourceManagem

entMPI . . . Fault

Tolerance

Parallel Applications

Independent System

Software Components

Designing a Parallel OS: What we have…

Page 10: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Hardware

Core Primitives

Introduction

ResourceManageme

ntMPI . . .

Fault Tolerance

Parallel Applications Single

,Global OS

Designing a Parallel OS: What we want…

Page 11: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Outline

Motivation and Goals

Introduction

Core Primitives

System Software Design

Case Studies

Concluding remarks

Page 12: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Core Primitives

System software built atop three primitives Xfer-And-Signal

– Transfer block of data to a set of nodes– Optionally signal local/remote event upon completion

Compare-And-Write– Compare global variable on a set of nodes– Optionally write global variable on the same set of nodes

Test-Event– Poll local event

Page 13: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Core Primitives

System software built atop three primitives Xfer-And-Signal (QsNet):

– Node S transfers block of data to nodes D1, D2, D3 and D4

– Events triggered at source and destinations

S D1 D2D4D3

SourceEvent

DestinationEvents

Page 14: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Core Primitives

System software built atop three primitives Compare-And-Write (QsNet):

– Node S compares variable V on nodes D1, D2, D3 and D4

S D1 D2D4D3

•Is V {, , >} to Value?

Page 15: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Core Primitives

System software built atop three primitives Compare-And-Write (QsNet):

– Node S compares variable V on nodes D1, D2, D3 and D4

– Partial results are combined in the switches

S D1 D2D4D3

Page 16: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Outline

Motivation and Goals

Introduction

Core Primitives

System Software Design

Case Studies

Concluding remarks

Page 17: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

System Software Design

Characteristic

Requirement Solution

Job Launching

Data Dissemination

Flow Control

Termination Detection

Xfer-And-Signal

Compare-And-Write

Compare-And-Write

Job SchedulingHeartbeat

Context Switch

Xfer-And-Signal

Prioritized messages/multiple rails

Communication

PUT

GET

Barrier

Broadcast

Xfer-And-Signal

Xfer-And-Signal

Compare-And-Write

Compare-And-Write + Xfer-And-Signal

… atop the three core primitives

Page 18: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

System Software Design

Can System Softwarereally be built atop

the Core Primitives?

Page 19: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Outline

Motivation and Goals

Introduction

Core Primitives

System Software Design

Case Studies

Concluding remarks

Page 20: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies

Experimental Setup

Characteristic

Crescendo Cluster Wolverine Cluster

Nodes 32 x Dell 1550 64 AlphaServer ES40

CPUs/Node 2 x 1GHz Pentium-III 4 x 833MHz EV68

Memory/Node 1 GB 8 GB

Network Cards QM-400 Elan3 QM-400 Elan3

OS RH 7.3 + QsNet kernel RH 7.1 + QsNet kernel

Software

Qsnetlibs v1.5.0-0 +

Intel C/Fortran Compiler 5.0.1

Qsnetlibs v1.5.0-0 +

Compaq´s C Compiler

Page 21: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Job Launching: send/execute/check completion

40 times faster than the best reported result!!!

Case Studies: Job Launching

Page 22: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies: Job Scheduling

Job Scheduling: Gang Scheduling

Very small time slices: RESPONSIVENESS !!!

Page 23: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies: BCS-MPI

•Global Strobe•(time slice starts)

•Global Strobe•(time slice ends)

Exchange of comm requirements

Communication scheduling

Real transmission

•Global•Synchronization

•Global•Synchronization

Tim

e S

lice

(h

un

dre

ds

of s

) Communication Library: BCS-MPI

Page 24: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies: BCS-MPI

Global synchronization Strobe sent at regular intervals (time slices)

– Compare-And-Write + Xfer-And-Signal (Master)– Test-Event (Slaves)

All system activities are tightly coupled Global information is required to schedule resources, global

synchronization facilitates the task but it is not enough

Global Scheduling Exchange of communication requirements

– Xfer-And-Signal + Test-Event Communication scheduling Real transmission

– Xfer-And-Signal + Test-Event

Implementation in Network Interface Card

Page 25: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies: BCS-MPI

SWEEP3D and SAGE Performance (IA32) Production-level MPI versus BCS-MPI

0.5% SPEEDUP 2% SPEEDUP

Page 26: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies

System Software atopthree Core Primitives:

Performance/Scalability with a simple design

Page 27: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Outline

Motivation and Goals

Introduction

Core Primitives

System Software Requirements Design

Case Studies

Concluding remarks

Page 28: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Concluding Remarks

New abstract network interface for system software Three communication primitives in the network hardware Implement the basics of most system software components Simplicity / Performance / Scalability / Responsiveness

Promising preliminary results demonstrate that

scalable resource management and parallel

application communication are indeed feasible

Page 29: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Future Work

Kernel-level implementation of the core primitives User-level solution is already working

Deterministic replay of MPI programs Ordered resource scheduling may enforce reproducibility

Transparent fault tolerance Global coordination simplifies the state of the machine

Page 30: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Architectural Support for System Softwareon Large-Scale Clusters

email:[email protected]

Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,

Kei Davis1 and José Carlos Sancho1

1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP)

CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores

Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN

URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es

Page 31: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Introduction

Designing a Parallel Operating System Most system software components need to move

information between nodes frequently– Fast and scalable unicast mechanism

All system activities must be tightly coupled by means of …– … global synchronization

Fast and scalable global synchronization mechanisms

– … and global resource scheduling Fast and scalable global information exchange

Hardware support is key to perform these tasks scalably and at a sub-millisecond granularity

Page 32: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies

Experimental Setup STORM (Scalable TOol for Resource Management)

– Architecture: Set of dæmons running on the management/compute nodes Network abstract layer based on the three core primitives

– Functionality: Job Launching Job Scheduling (FCFS, gang scheduling and others)

New scheduling algorithms can be plugged in Resource Accounting

Page 33: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

System Software Design

Characteristic

Requirement Solution

StorageMetadata Transfer

File Data TransferXfer-And-Signal

Debuggability

Debug Data Transfer

Debug Synchronization

Xfer-And-Signal

Compare-And-Write

Fault Tolerance

Fault Detection

Checkpointing Synchronization

Checkpointing Data Transfer

Compare-And-Write

Compare-And-Write

Xfer-And-Signal

… atop the three core primitives

Page 34: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies: BCS-MPI

Non-blocking primitives: MPI_Isend/Irecv

Page 35: Architectural Support for System Software on Large-Scale Clusters

International Conference on Parallel Processing - August 15-18, 2004 - Montreal (Canada)

Case Studies: BCS-MPI

Blocking primitives: MPI_Send/Recv