Upload
puneetshah15
View
66
Download
3
Embed Size (px)
Citation preview
1
Single System Image and Cluster Middleware
Approaches, Infrastructure and Technologies
2
Recap: Cluster Computer Architecture
Sequential Applications
Parallel Applications
Parallel Programming Environment
Cluster Middleware
(Single System Image and Availability Infrastructure)
Cluster Interconnection Network/Switch
PC/Workstation
Network Interface Hardware
Communications
Software
PC/Workstation
Network Interface Hardware
Communications
Software
PC/Workstation
Network Interface Hardware
Communications
Software
PC/Workstation
Network Interface Hardware
Communications
Software
Sequential Applications
Sequential Applications
Parallel ApplicationsParallel
Applications
3
• Enhanced Performance (performance @ low cost)
• Enhanced Availability (failure management)
• Single System Image (look-and-feel of one system)
• Size Scalability (physical & application)
• Fast Communication (networks & protocols)
• Load Balancing (CPU, Net, Memory, Disk)
• Security and Encryption (clusters of clusters)
• Distributed Environment (Social issues)
• Manageability (admin. And control)
• Programmability (simple API if required)
• Applicability (cluster-aware and non-aware app.)
Recap: Major issues in Cluster design
4
A typical Cluster Computing Environment
PVM / MPI/ RSH
Applications
Hardware/OS
???
5
The missing link is provided by cluster middleware/underware
PVM / MPI/ RSH
Applications
Hardware/OS
Middleware
PVM / MPI/ RSH
Message Passing Interface (MPI)
6
Message Passing Interface (MPI) is an API specification that allows processes to communicate with one another by sending and receiving messages.
It is typically used for parallel programs running
on computer clusters and supercomputers, where the cost of accessing non-local memory is high.
MPI is a language-independent communications protocol used to program parallel computers.
7
Middleware Design Goals
Complete Transparency (Manageability): Offer a single system view of a cluster
system.. Single entry point, ftp, telnet, software loading...
Scalable Performance: Easy growth of cluster
no change of API & automatic load distribution.
Enhanced Availability: Automatic Recovery from failures
Employ checkpointing & fault tolerant technologies Handle consistency of data when replicated..
8
What is Single System Image (SSI)?
SSI is the illusion, created by software or hardware, that presents a collection of computing resources as one, more whole resource. In other words, it the property of a system
that hides the heterogeneous and distributed nature of the available resources and presents them to users and applications as a single unified computing resource.
SSI makes the cluster appear like a single machine to the user, to applications, and to the network.
9
Cluster Middleware & SSI
SSI Supported by a middleware layer that resides
between the OS and user-level environment Middleware consists of essentially 2 sub-layers of
SW infrastructure SSI infrastructure
Glue together OSs on all nodes to offer unified access to system resources
System availability infrastructure Enable cluster services such as checkpointing,
automatic failover, recovery from failure, & fault-tolerant support among all nodes of the cluster
10
Functional Relationship Among Middleware SSI Modules
11
Benefits of SSI
Use of system resources transparent. Transparent process migration and load
balancing across nodes. Improved reliability and higher availability. Improved system response time and
performance Simplified system management. Reduction in the risk of operator errors. No need to be aware of the underlying
system architecture to use these machines effectively.
12
Desired SSI Services/Functions
Single Entry Point: telnet cluster.my_institute.edu telnet node1.cluster. institute.edu
Single User Interface: using the cluster through a single GUI window and it should provide a look and feel of managing a single resources (e.g., PARMON).
Single File Hierarchy: /Proc, NFS, xFS, AFS, etc. Single Control Point: Management GUI Single Virtual Networking Single Memory Space - Network RAM/DSM Single Job Management: Glunix, SGE, LSF
13
Availability Support Functions
Single I/O Space: Any node can access any peripheral or disk
devices without the knowledge of physical location.
Single Process Space: Any process on any node create process
with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.
Single Global Job Management System Checkpointing and process migration:
Can saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. RMS Load balancing...
14
SSI Levels
SSI levels of abstractions:
Application and Subsystem Level
Operating System Kernel Level
Hardware Level
15
SSI Characteristics
Every SSI has a boundary. Single system support can
exist at different levels within a system, one able to be build on another.
16
SSI Boundaries
Batch System
SSIBoundary
Source: In search of clusters
17
SSI Middleware Implementation: Layered
approach
18
SSI at Application and Sub-system Levels
Level Examples Boundary Importance
Application batch system andsystem management;Google Search Engine
Sub-system
File system
Distributed DB (e.g., Oracle 10g),OSF DME, Lotus Notes, MPI, PVM
An application What a userwants
Sun NFS, OSF,DFS, NetWare,and so on
A sub-system SSI for allapplications ofthe sub-system
Implicitly supports many applications and subsystems
Shared portion of the file system
Toolkit OSF DCE, SunONC+, ApolloDomain
Best level of support for heterogeneous system
Explicit toolkitfacilities: user,service name, time
© Pfister, In search of clusters
19
SSI at OS Kernel Level
Level Examples Boundary Importance
Kernel/OS Layer
Solaris MC, Unixware MOSIX, Sprite, Amoeba/GLunix
Kernelinterfaces
Virtualmemory
UNIX (Sun) vnode,Locus (IBM) vproc
Each name space:files, processes, pipes, devices, etc.
Kernel support forapplications, admsubsystems
None supportingOS kernel
Type of kernelobjects: files,processes, etc.
Modularizes SSIcode within kernel
May simplifyimplementationof kernel objects
Each distributedvirtual memoryspace
Microkernel Mach, PARAS, Chorus,OSF/1AD, Amoeba
Implicit SSI forall system services
Each serviceoutside themicrokernel
© Pfister, In search of clusters
20
SSI at Hardware Level
memory and I/O
Level Examples Boundary Importance
memory SCI (Scalable Coherent Interface), Stanford DASH
better communication and synchronization
memory space
SCI, SMP techniques lower overheadcluster I/O
memory and I/Odevice space
Application and Subsystem Level
Operating System Kernel Level
© Pfister, In search of clusters
21
SSI via OS path!
1. Build as a layer on top of the existing OS Benefits: makes the system quickly portable,
tracks vendor software upgrades, and reduces development time.
i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. e.g.: Glunix.
2. Build SSI at kernel level, True Cluster OS Good, but Can’t leverage of OS improvements by
vendor. E.g. Unixware, Solaris-MC, and MOSIX.
22
SSI Systems & Tools
OS level: SCO NSC UnixWare; Solaris-MC; MOSIX, ….
Subsystem level: PVM/MPI, TreadMarks (DSM), Glunix,
Condor, SGE, Nimrod, PBS, .., Aneka Application level:
PARMON, Parallel Oracle, Google, ...
UnixWare: NonStop Cluster (NSC) OS
Users, applications, and systems management
Users, applications, and systems management
Standard OS kernel calls Standard OS kernel calls
Modularkernel
extensions
Modularkernel
extensions
Extensions
UP or SMP node
Users, applications, and systems management
Users, applications, and systems management
Standard OS kernel calls Standard OS kernel calls
Modular kernel
extensions
Modular kernel
extensions
Extensions
Devices Devices
ServerNet
UP or SMP node
Standard SCO UnixWare
with clustering hooks
Standard SCO UnixWare
with clustering hooks
Other nodes
http://www.sco.com/products/clustering/
How does NonStop Clusters Work?
Modular Extensions and Hooks to Provide: Single Clusterwide Filesystem view; Transparent Clusterwide device access; Transparent swap space sharing; Transparent Clusterwide IPC; High Performance Internode Communications; Transparent Clusterwide Processes, migration,etc.; Node down cleanup and resource failover; Transparent Clusterwide parallel TCP/IP networking; Application Availability; Clusterwide Membership and Cluster timesync; Cluster System Administration; Load Leveling.
25
Sun Solaris MC (Multi-Computers)
Solaris MC: A High Performance Operating System for Clusters
A distributed OS for a multicomputer, a cluster of computing nodes connected by a high-speed interconnect
Provide a single system image, making the cluster appear like a single machine to the user, to applications, and the the network
Built as a globalization layer on top of the existing Solaris kernel
Interesting features extends existing Solaris OS preserves the existing Solaris ABI/API compliance provides support for high availability uses C++, IDL, CORBA in the kernel leverages Spring OS technology
26
Solaris-MC: Solaris for MultiComputers
global file system
globalized process management
globalized networking and I/O
Solaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Othernodes
Object invocations
Kernel
Solaris MC
Applications
http://research.sun.com/techrep/1995/abstract-48.html
27
Solaris MC components
Object and communication support
High availability support
PXFS global distributed file system
Process management
NetworkingSolaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Othernodes
Object invocations
Kernel
Solaris MC
Applications
28
MOSIX: Multicomputer OS for UNIX
An OS module (layer) that provides the applications with the illusion of working on a single system.
Remote operations are performed like local operations.
Transparent to the application - user interface unchanged.
PVM / MPI / RSHMOSIX
Application
Hardware/OS
http://www.mosix.cs.huji.ac.il/ || mosix.org
29
Key Features of MOSIX
Supervised by distributed algorithms that respond on-line to global resource availability – transparently.
Load-balancing - migrate process from over-loaded to under-loaded nodes.
Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping.
Preemptive process migration that can migrate any process, anywhere, anytime
Download MOSIX:http://www.mosix.cs.huji.ac.il/
30
SSI at Subsystem Level
Resource Management and Scheduling
31
Resource Management and Scheduling (RMS)
RMS system is responsible for distributing applications among cluster nodes.
It enables the effective and efficient utilization of the resources available
Software components Resource manager
Locating and allocating computational resource, authentication, process creation and migration
Resource scheduler Queuing applications, resource location and assignment. It instructs
resource manager what to do when (policy) Reasons for using RMS
Provide an increased, and reliable, throughput of user applications on the systems
Load balancing Utilizing spare CPU cycles Providing fault tolerant systems Manage access to powerful system, etc
Basic architecture of RMS: client-server system
32
Cluster RMS Architecture
Resource Manager
Job Manager
Computation Node 1
Computation Node c
:
:
:
Computation Nodes
User u
:
:
:
job
Manager Node
Node Status Monitor
User Population
execution results
User 1job
execution results
Job Scheduler
33
Services provided by RMS
Process Migration Computational resource has become too heavily
loaded Fault tolerant concern
Checkpointing Scavenging Idle Cycles
70% to 90% of the time most workstations are idle Fault Tolerance Minimization of Impact on Users Load Balancing Multiple Application Queues
34
Some Popular Resource Management Systems
Project Commercial Systems - URLLSF http://www.platform.com/
SGE http://www.sun.com/grid/
NQE http://www.cray.com/
LL http://www.ibm.com/systems/clusters/software/loadleveler/
PBS http://www.pbsgridworks.com/
Public Domain System - URLAlchemi http://www.alchemi.net - desktop grids
Condor http://www.cs.wisc.edu/condor/
GNQS http://www.gnqs.org/
35
Pros and Cons of SSI Approaches
Hardware: Offer the highest level of transparency, but it has rigid
architecture – not flexible while extending or enhancing the system.
Operating System Offers full SSI, but expensive to develop and maintain due to
limited market share. It cannot be developed partially, to benefit full functionality
need to be developed, so it can be risky. E.g., Mosix and SolarisMC
Subsystem Level Easy to implement at benefit class of applications for which it
is designed. E.g., Job management systems such as PBS and SGE.
Application Level Easy to realise, but requires that each application developed
as SSI-aware separately. E.g., Google
36
Additional References
R. Buyya, T. Cortes, and H. Jin, Single System Image, International Journal of High-Performance Computing Applications (IJHPCA), Volume 15, No. 2, Summer 2001.
G. Pfister, In Search of Clusters, Prentice Hall, USA.
B. Walker, Open SSI Linux Cluster Project: http://openssi.org/ssi-intro.pdf