Lecture 11: Unix Clusters

Lecture 11: Unix Clusters

Asoc. Prof. Guntis BarzdinsAsist. Girts Folkmanis

University of LatviaDec 10, 2004

Moore’s Law - Density

Moore's Law and Performance The performance of computers is determined by

architecture and clock speed. Clock speed doubles over a 3 year period due to

the scaling laws on chip. Processors using identical or similar

architectures gain performance directly as a function of Moore's Law.

Improvements in internal architecture can yield better gains cf Moore's Law.

Future of Moore’s Law Short-Term (1-5 years)

Will operate (due to prototypes in lab) Fabrication cost will go up rapidly

Medium-Term (5-15 years) Exponential growth rate will likely slow Trillion-dollar industry is motivated

Long-Term (>15 years) May need new technology (chemical or quantum) We can do better (e.g., human brain) I would not close the patent office

Different kinds of PC cluster

High Performance Computing Cluster Load Balancing High Availability

High Performance Computing Cluster (Beowulf)

Start from 1994 Donald Becker of NASA assemble the world’s first cluster

with 16 sets of DX4 PCs and 10 Mb/s ethernet Also called Beowulf cluster Built from commodity off-the-shelf hardware Applications like data mining, simulations, parallel

processing, weather modelling, computer graphical rendering, etc.

Examples of Beowulf cluster

Scyld Cluster O.S. by Donald Becker http://www.scyld.com

ROCKS from NPACI http://www.rocksclusters.org

OSCAR from open cluster group http://oscar.sourceforge.net

OpenSCE from Thailand http://www.opensce.org

http://www.scyld.com/

http://www.scyld.com/

http://www.rocksclusters.org/

http://www.rocksclusters.org/

http://oscar.sourceforge.net/

http://oscar.sourceforge.net/

Cluster Sizing Rule of Thumb

System software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applications Max socket connections Direct access message tag lists & buffers NFS / storage system clients Debugging Etc

It is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters

Apple Xserve G5 with Xgrid Environment

Alternative to Beowulf PC cluster Server node + 10 compute nodes

Dual CPU G5 processors (2 GHz, 1 GB memory)

Gigabit ethernet inter-connectivity3 TB XServe RAID array

Xgrid offers ‘easy’ pool-of-processors computing model

MPI available for heritage code

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Xgrid Computing Environment Suitable for loosely coupled distributed computing

Controller distributes tasks to agent processors (tasks include data and code) Collects results when agents finish Distributes more chunks to agents as they become free and join cluster/grid

Xgrid client

Xgrid controller

Xgrid agents

Server storage

Xgrid Work Flow

Cluster Status

Offline turned off

Unavailable turned on,but busy w/ othernon-cluster tasks

Working computingon this cluster job

Available waiting tobe assignedcluster work

Cluster Status Displays

Tachometer illustratestotal processing poweravailable to cluster atany time.

Level will change if runningon a cluster of desktopworkstations, but will staysteady if monitoring adedicated cluster

Rocky’s Tachy Tach

Load Balancing Cluster

PC cluster deliver load balancing performance Commonly used with busy ftp and web servers

with large client base Large number of nodes to share load

High Availability Cluster

Avoid downtime of services Avoid single point of failure Always with redundancy Almost all load balancing cluster are with HA

capability

Examples of Load Balancing and High Availability Cluster

RedHat HA cluster http://ha.redhat.com

Turbolinux Cluster Server http://www.turbolinux.com/products/tcs

Linux Virtual Server Project http://www.linuxvirtualserver.org/

High Availability Approach: Redundancy + Failover

Redundancy eliminates Single Points Of Failure (SPOF)Auto detect Failures (hardware, network, applications) Automatic Recovery from failures (no human intervention)

Real-Time Disk Replication (DRDB) Distributed Replicating Block Device

IBM Supported Solutions

Linux-HA (Heartbeat)Open Source ProjectMultiple platform solution for IBM

eServers, Solaris, BSDPackaged with several Linux

DistributionsStrong focus on ease-of-use,

security, simplicity, low-cost> 10K clusters in production since

1999

Tivoli System Automation (TSA) for Multi-PlatformProprietary IBM SolutionUsed across all eServers,

ia32 from any vendorAvailable on Linux, AIX, OS/400Rules Based Recovery SystemOver 1000 licenses since 2003

HPCC Cluster and parallel computing applications

Message Passing Interface MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) LAM/MPI (http://lam-mpi.org)

Mathematical fftw (fast fourier transform) pblas (parallel basic linear algebra software) atlas (a collections of mathematical library) sprng (scalable parallel random number generator) MPITB -- MPI toolbox for MATLAB

Quantum Chemistry software gaussian, qchem

Molecular Dynamic solver NAMD, gromacs, gamess

Weather modelling MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)

MOSIX and openMosix

MOSIX: MOSIX is a software package that enhances the Linux kernel with cluster capabilities. The enhanced kernel supports any size cluster of X86/Pentium based boxes. MOSIX allows for the automatic and transparent migration of processes to other nodes in the cluster, while standard Linux process control utilities, such as 'ps' will show all processes as if they are running on the node the process originated from.

openMosix: openMosix is a spin off of the original Mosix. The first version of openMosix is fully compatible with the last version of Mosix, but is going to go in its own direction.

MOSIX architecture (3/9)

Preemptive process migration

any user’s process, trasparently and at any time, can migrate to any available node.

The migrating process is divided into two contexts: system context (deputy) that may not be migrated from “home” workstation

(UHN); user context (remote) that can be migrated on a diskless node;

MOSIX architecture (4/9)

Preemptive process migration

master node diskless node

Multi-CPU Servers

Benchmark - Memory

1x Stream: 2x Stream: 4x Stream:

2x Opteron, 1.8 GHz, HyperTransport: 1006 – 1671 MB/s 975 – 1178 MB/s 924 – 1133 MB/s

2x Xeon, 2.4 GHz, 400 MHz FSB: 1202 – 1404 MB/s 561 – 785 MB/s 365 – 753 MB/s

4x DIMM

1GB DDR266

Avent Techn.

4x DIMM

1GB DDR266

Avent Techn.

Sybase DBMS Performance

Multi-CPU Hardware and Software

Service Processor (SP)

Dedicated SP on-board PowerPC based Own IP name/address Front panel Command line interface Web-server

Remote administration System status Boot/Reset/Shutdown Flash the BIOS

Unix Scheduling

Process Scheduling

When to run scheduler1. Process create2. Process exit3. Process blocks4. System interrupt

Non-preemptive – process runs until it blocks or gives up CPU (1,2,3)

Preemptive – process runs for some time unit, then scheduler selects a process to run (1-4)

Solaris Overview Multithreaded, Symmetric Multi-Processing Preemptive kernel - protected data structures

Interrupts handled using threads MP Support - per cpu dispatch queues, one global

kernel preempt queue. System Threads Priority Inheritance Turnstiles rather than wait queues

Linux Today Linux scales very well in SMP systems up to 4 CPU’s. Linux on 8 CPU’s is still competitive, but between 4way and

8way systems the price per CPU increases significantly. For SMP systems with more than 8 CPU’s, classic Unix

systems are the best choice. With Oracle Real Application Clusters (RAC),

small 4 or 8way systems can be clustered to cross the today’s Linux limitations.

Commodity, inexpensive 4way Intel boxes, clustered with Oracle 9i RAC, help to reduce TCO.

Documents

Lecture 11: Unix Clusters