36
1 Single System Image and Cluster Middleware Approaches, Infrastructure and Technologies

SSI Clusters

Embed Size (px)

Citation preview

Page 1: SSI Clusters

1

Single System Image and Cluster Middleware

Approaches, Infrastructure and Technologies

Page 2: SSI Clusters

2

Recap: Cluster Computer Architecture

Sequential Applications

Parallel Applications

Parallel Programming Environment

Cluster Middleware

(Single System Image and Availability Infrastructure)

Cluster Interconnection Network/Switch

PC/Workstation

Network Interface Hardware

Communications

Software

PC/Workstation

Network Interface Hardware

Communications

Software

PC/Workstation

Network Interface Hardware

Communications

Software

PC/Workstation

Network Interface Hardware

Communications

Software

Sequential Applications

Sequential Applications

Parallel ApplicationsParallel

Applications

Page 3: SSI Clusters

3

• Enhanced Performance (performance @ low cost)

• Enhanced Availability (failure management)

• Single System Image (look-and-feel of one system)

• Size Scalability (physical & application)

• Fast Communication (networks & protocols)

• Load Balancing (CPU, Net, Memory, Disk)

• Security and Encryption (clusters of clusters)

• Distributed Environment (Social issues)

• Manageability (admin. And control)

• Programmability (simple API if required)

• Applicability (cluster-aware and non-aware app.)

Recap: Major issues in Cluster design

Page 4: SSI Clusters

4

A typical Cluster Computing Environment

PVM / MPI/ RSH

Applications

Hardware/OS

???

Page 5: SSI Clusters

5

The missing link is provided by cluster middleware/underware

PVM / MPI/ RSH

Applications

Hardware/OS

Middleware

PVM / MPI/ RSH

Page 6: SSI Clusters

Message Passing Interface (MPI)

6

Message Passing Interface (MPI) is an API specification that allows processes to communicate with one another by sending and receiving messages.

It is typically used for parallel programs running

on computer clusters and supercomputers, where the cost of accessing non-local memory is high.

MPI is a language-independent communications protocol used to program parallel computers.

Page 7: SSI Clusters

7

Middleware Design Goals

Complete Transparency (Manageability): Offer a single system view of a cluster

system.. Single entry point, ftp, telnet, software loading...

Scalable Performance: Easy growth of cluster

no change of API & automatic load distribution.

Enhanced Availability: Automatic Recovery from failures

Employ checkpointing & fault tolerant technologies Handle consistency of data when replicated..

Page 8: SSI Clusters

8

What is Single System Image (SSI)?

SSI is the illusion, created by software or hardware, that presents a collection of computing resources as one, more whole resource. In other words, it the property of a system

that hides the heterogeneous and distributed nature of the available resources and presents them to users and applications as a single unified computing resource.

SSI makes the cluster appear like a single machine to the user, to applications, and to the network.

Page 9: SSI Clusters

9

Cluster Middleware & SSI

SSI Supported by a middleware layer that resides

between the OS and user-level environment Middleware consists of essentially 2 sub-layers of

SW infrastructure SSI infrastructure

Glue together OSs on all nodes to offer unified access to system resources

System availability infrastructure Enable cluster services such as checkpointing,

automatic failover, recovery from failure, & fault-tolerant support among all nodes of the cluster

Page 10: SSI Clusters

10

Functional Relationship Among Middleware SSI Modules

Page 11: SSI Clusters

11

Benefits of SSI

Use of system resources transparent. Transparent process migration and load

balancing across nodes. Improved reliability and higher availability. Improved system response time and

performance Simplified system management. Reduction in the risk of operator errors. No need to be aware of the underlying

system architecture to use these machines effectively.

Page 12: SSI Clusters

12

Desired SSI Services/Functions

Single Entry Point: telnet cluster.my_institute.edu telnet node1.cluster. institute.edu

Single User Interface: using the cluster through a single GUI window and it should provide a look and feel of managing a single resources (e.g., PARMON).

Single File Hierarchy: /Proc, NFS, xFS, AFS, etc. Single Control Point: Management GUI Single Virtual Networking Single Memory Space - Network RAM/DSM Single Job Management: Glunix, SGE, LSF

Page 13: SSI Clusters

13

Availability Support Functions

Single I/O Space: Any node can access any peripheral or disk

devices without the knowledge of physical location.

Single Process Space: Any process on any node create process

with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.

Single Global Job Management System Checkpointing and process migration:

Can saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. RMS Load balancing...

Page 14: SSI Clusters

14

SSI Levels

SSI levels of abstractions:

Application and Subsystem Level

Operating System Kernel Level

Hardware Level

Page 15: SSI Clusters

15

SSI Characteristics

Every SSI has a boundary. Single system support can

exist at different levels within a system, one able to be build on another.

Page 16: SSI Clusters

16

SSI Boundaries

Batch System

SSIBoundary

Source: In search of clusters

Page 17: SSI Clusters

17

SSI Middleware Implementation: Layered

approach

Page 18: SSI Clusters

18

SSI at Application and Sub-system Levels

Level Examples Boundary Importance

Application batch system andsystem management;Google Search Engine

Sub-system

File system

Distributed DB (e.g., Oracle 10g),OSF DME, Lotus Notes, MPI, PVM

An application What a userwants

Sun NFS, OSF,DFS, NetWare,and so on

A sub-system SSI for allapplications ofthe sub-system

Implicitly supports many applications and subsystems

Shared portion of the file system

Toolkit OSF DCE, SunONC+, ApolloDomain

Best level of support for heterogeneous system

Explicit toolkitfacilities: user,service name, time

© Pfister, In search of clusters

Page 19: SSI Clusters

19

SSI at OS Kernel Level

Level Examples Boundary Importance

Kernel/OS Layer

Solaris MC, Unixware MOSIX, Sprite, Amoeba/GLunix

Kernelinterfaces

Virtualmemory

UNIX (Sun) vnode,Locus (IBM) vproc

Each name space:files, processes, pipes, devices, etc.

Kernel support forapplications, admsubsystems

None supportingOS kernel

Type of kernelobjects: files,processes, etc.

Modularizes SSIcode within kernel

May simplifyimplementationof kernel objects

Each distributedvirtual memoryspace

Microkernel Mach, PARAS, Chorus,OSF/1AD, Amoeba

Implicit SSI forall system services

Each serviceoutside themicrokernel

© Pfister, In search of clusters

Page 20: SSI Clusters

20

SSI at Hardware Level

memory and I/O

Level Examples Boundary Importance

memory SCI (Scalable Coherent Interface), Stanford DASH

better communication and synchronization

memory space

SCI, SMP techniques lower overheadcluster I/O

memory and I/Odevice space

Application and Subsystem Level

Operating System Kernel Level

© Pfister, In search of clusters

Page 21: SSI Clusters

21

SSI via OS path!

1. Build as a layer on top of the existing OS Benefits: makes the system quickly portable,

tracks vendor software upgrades, and reduces development time.

i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. e.g.: Glunix.

2. Build SSI at kernel level, True Cluster OS Good, but Can’t leverage of OS improvements by

vendor. E.g. Unixware, Solaris-MC, and MOSIX.

Page 22: SSI Clusters

22

SSI Systems & Tools

OS level: SCO NSC UnixWare; Solaris-MC; MOSIX, ….

Subsystem level: PVM/MPI, TreadMarks (DSM), Glunix,

Condor, SGE, Nimrod, PBS, .., Aneka Application level:

PARMON, Parallel Oracle, Google, ...

Page 23: SSI Clusters

UnixWare: NonStop Cluster (NSC) OS

Users, applications, and systems management

Users, applications, and systems management

Standard OS kernel calls Standard OS kernel calls

Modularkernel

extensions

Modularkernel

extensions

Extensions

UP or SMP node

Users, applications, and systems management

Users, applications, and systems management

Standard OS kernel calls Standard OS kernel calls

Modular kernel

extensions

Modular kernel

extensions

Extensions

Devices Devices

ServerNet

UP or SMP node

Standard SCO UnixWare

with clustering hooks

Standard SCO UnixWare

with clustering hooks

Other nodes

http://www.sco.com/products/clustering/

Page 24: SSI Clusters

How does NonStop Clusters Work?

Modular Extensions and Hooks to Provide: Single Clusterwide Filesystem view; Transparent Clusterwide device access; Transparent swap space sharing; Transparent Clusterwide IPC; High Performance Internode Communications; Transparent Clusterwide Processes, migration,etc.; Node down cleanup and resource failover; Transparent Clusterwide parallel TCP/IP networking; Application Availability; Clusterwide Membership and Cluster timesync; Cluster System Administration; Load Leveling.

Page 25: SSI Clusters

25

Sun Solaris MC (Multi-Computers)

Solaris MC: A High Performance Operating System for Clusters

A distributed OS for a multicomputer, a cluster of computing nodes connected by a high-speed interconnect

Provide a single system image, making the cluster appear like a single machine to the user, to applications, and the the network

Built as a globalization layer on top of the existing Solaris kernel

Interesting features extends existing Solaris OS preserves the existing Solaris ABI/API compliance provides support for high availability uses C++, IDL, CORBA in the kernel leverages Spring OS technology

Page 26: SSI Clusters

26

Solaris-MC: Solaris for MultiComputers

global file system

globalized process management

globalized networking and I/O

Solaris MC Architecture

System call interface

Network

File system

C++

Processes

Object framework

Existing Solaris 2.5 kernel

Othernodes

Object invocations

Kernel

Solaris MC

Applications

http://research.sun.com/techrep/1995/abstract-48.html

Page 27: SSI Clusters

27

Solaris MC components

Object and communication support

High availability support

PXFS global distributed file system

Process management

NetworkingSolaris MC Architecture

System call interface

Network

File system

C++

Processes

Object framework

Existing Solaris 2.5 kernel

Othernodes

Object invocations

Kernel

Solaris MC

Applications

Page 28: SSI Clusters

28

MOSIX: Multicomputer OS for UNIX

An OS module (layer) that provides the applications with the illusion of working on a single system.

Remote operations are performed like local operations.

Transparent to the application - user interface unchanged.

PVM / MPI / RSHMOSIX

Application

Hardware/OS

http://www.mosix.cs.huji.ac.il/ || mosix.org

Page 29: SSI Clusters

29

Key Features of MOSIX

Supervised by distributed algorithms that respond on-line to global resource availability – transparently.

Load-balancing - migrate process from over-loaded to under-loaded nodes.

Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping.

Preemptive process migration that can migrate any process, anywhere, anytime

Download MOSIX:http://www.mosix.cs.huji.ac.il/

Page 30: SSI Clusters

30

SSI at Subsystem Level

Resource Management and Scheduling

Page 31: SSI Clusters

31

Resource Management and Scheduling (RMS)

RMS system is responsible for distributing applications among cluster nodes.

It enables the effective and efficient utilization of the resources available

Software components Resource manager

Locating and allocating computational resource, authentication, process creation and migration

Resource scheduler Queuing applications, resource location and assignment. It instructs

resource manager what to do when (policy) Reasons for using RMS

Provide an increased, and reliable, throughput of user applications on the systems

Load balancing Utilizing spare CPU cycles Providing fault tolerant systems Manage access to powerful system, etc

Basic architecture of RMS: client-server system

Page 32: SSI Clusters

32

Cluster RMS Architecture

Resource Manager

Job Manager

Computation Node 1

Computation Node c

:

:

:

Computation Nodes

User u

:

:

:

job

Manager Node

Node Status Monitor

User Population

execution results

User 1job

execution results

Job Scheduler

Page 33: SSI Clusters

33

Services provided by RMS

Process Migration Computational resource has become too heavily

loaded Fault tolerant concern

Checkpointing Scavenging Idle Cycles

70% to 90% of the time most workstations are idle Fault Tolerance Minimization of Impact on Users Load Balancing Multiple Application Queues

Page 34: SSI Clusters

34

Some Popular Resource Management Systems

Project Commercial Systems - URLLSF http://www.platform.com/

SGE http://www.sun.com/grid/

NQE http://www.cray.com/

LL http://www.ibm.com/systems/clusters/software/loadleveler/

PBS http://www.pbsgridworks.com/

Public Domain System - URLAlchemi http://www.alchemi.net - desktop grids

Condor http://www.cs.wisc.edu/condor/

GNQS http://www.gnqs.org/

Page 35: SSI Clusters

35

Pros and Cons of SSI Approaches

Hardware: Offer the highest level of transparency, but it has rigid

architecture – not flexible while extending or enhancing the system.

Operating System Offers full SSI, but expensive to develop and maintain due to

limited market share. It cannot be developed partially, to benefit full functionality

need to be developed, so it can be risky. E.g., Mosix and SolarisMC

Subsystem Level Easy to implement at benefit class of applications for which it

is designed. E.g., Job management systems such as PBS and SGE.

Application Level Easy to realise, but requires that each application developed

as SSI-aware separately. E.g., Google

Page 36: SSI Clusters

36

Additional References

R. Buyya, T. Cortes, and H. Jin, Single System Image, International Journal of High-Performance Computing Applications (IJHPCA), Volume 15, No. 2, Summer 2001.

G. Pfister, In Search of Clusters, Prentice Hall, USA.

B. Walker, Open SSI Linux Cluster Project: http://openssi.org/ssi-intro.pdf