The cluster project started through a discussion between the
Principal of ANDC,Dr Savithri Singh , and the Director ofOpenLX,Mr
Sudhir Gandotraduring a Linux workshop in 2007
Dr Sanjay Chauhans recruitment :
Dr. Savithri Singhinducted Dr Sanjay Chauhan from Physics
department in the cluster project
Clueless students' involvement :
Arjun, Animesh, Ankit and Sudhang.
4. Chapter 2
5.
Initially the project was very challenging, the challenges
being of two sorts:
Technical :
Especially the reclamation of the to-be-junkedhardware,
and.
Human :
Mostly relating to the lack of experience and know- how of the
players.This was especially hurtful,since it cost significant
man-hours spent on suboptimal and downright incorrect 'solutions'
that could have been avoided had the team been slightly more
knowledgeable.
6. Chapter 3
Not everything that can be counted counts, and not everything
that counts can be counted.
7. Junkyard Reclamation.
The project officially started when the team was
"presented"
with 18- 20 decrepit machines of which barely 5 worked .
The junk consisted of A gallery of PI's, PII's, PIII's at the
end of
their life, most of them not working, requiring us to
implement some:
Upgradation :
Some of those that did required significant upgrades to be
worth deployment in the cluster.
Scavenging :
Over a certain length of time, few could be repaired while the
rest were discarded after " scavenging " useful parts from them for
use in the future and in salvageable machines.
Arjunsknowledge on hardware acts as great foundation and
learning experience.
8. Experiences don't come cheap..
The first investment : Since a fairly "impressive" cluster
needed to be at least visibly fast to the lay observer, the
machines had to be upgraded in RAM.25 X 256 SDRAM modules were
purchasedand multiples of these were put in all working machines
.
9.
Finally 6 comps that were in the best
state were chosen as follows:
Specs here:
4 X PII with 512 MB RAM.
2 X PIII with 512 MB RAM.
These were connected via a 100Mbps switch.
10. Chapter 4
Wisdom Through Failure.
11. Our first mistake..
ClusterKoppixis chosen
Based on thorough research by Dr. Chauhan on the topic, we
choose:
ClusterKnoppix is a specializedLinux distribution based on
theKnoppixdistribution, but which uses the openMosix
kernel.
openMosix , developed by Israeli technologist, author, investor
and entrepreneur Moshe Barwas a fork of the once-open,
then-proprietary MOSIX cluster system.
12. Why cluster knoppix?
Lack of requisite knowledge to remaster or implement changes at
kernel level.
ClusterKnoppixaims to provide the same core features and
software asKnoppix , but adds the o penMosixclustering capabilities
also.
Specifically designed to be agood master node .
openMosixhas theability to build a cluster out of inexpensive
hardwaregiving you a traditional supercomputer. As long as you use
processors out of the same architecture, any configuration of your
node is possible.
13.
No cdrom drive/harddisk/floppy needed for the clients /
openMosix autodiscovery:
New nodes automatically join the cluster (no configuration
needed).
Cluster Management tools:
openMosix userland / openMosixview
Every node can run full blown X(PC-room/demo setup) or,Console
only:more memory available for user applications.
14. What Could Have Been 15. Problems up there
Both clusterknoppix and openMosix development had
stopped so not much support was available.
16.
OpenMosix terminal server - uses PXE, DHCP and tftp to boot
linux clients via the network:
So it wasnt compatible with the older cards in our fixed
machines which werent PXE enabled.
Wouldnt work on WFC machines' lan cards:
No support for post 2.4.x kernels,hence it couldnt be deployed
on any of the otherlabs in the college, as the machines on those
had network cards that were incompatible with the GNU/Linux kernel
versions with which openMosix worked.
17. Problems down under On the master node we executed the
following commands: 1) ifconfig eth0 192.168.1.10 2) route add -net
0.0.0.0 gw 192.168.1.1 3) tyd -f init 4) tyd And on the drone node
we executed: 1) ifconfig eth0 192.168.1.20 2) route add -net
0.0.0.0 gw 192.168.1.1 3) tyd -f init 4) tyd -m 192.168.1.10 The
error we got was : SIOCSIFFLAGS : no such device 18. Chapter 5
Any port in a storm
19. Other solutions tried.
The 'educational' BCCD from the university of IOWA :
The BCCD was created to facilitate instruction of parallel
computing aspects and paradigms.
The BCCD is a bootable CD image that boots up into a
pre-configured distributed computing environment.
Focus in on educational aspects of High-Performance Computing
(HPC) instead of the HPC core.
Problem
It asked for a password even from a live cd due to the hardware
incompatibility!!!!!!!!
20.
CHAOS:
Small (6Mbyte)Linux distributiondesigned for creatingad
hoccomputer clusters.
This tiny disc will bootanyi586classPC(that supports CD
booting), into a workingopenMosix node, without disturbing (or even
touching) the contents of any local hard disk.
Quantian OS:
A re-mastering ofclusterknoppixfor computational sciences.
The environment is self-configuring and directly bootable.
21. Chapter 6.
First taste of success.
22. Paralledigm Shift!!!
After a lot of frustrating trials that the clusterKnoppix idea
was dropped.
Parallel Knoppix(Upgraded to Pelican HPC) is chosen:
ParallelKnoppix is a live CD image that let's you set up a high
performance computing cluster in a few minutes.
A Parallel cluster allows you to do parallel computing using
MPI.
Advantages:
The frontend node (either a real computer or a virtual machine)
boots from the CD image. The compute nodes boot by PXE, using the
frontend node as the server.
The LAM-MPI and OpenMPI implementations of MPI are
installed.
Contains extensive example programs .
Very easy to add packages
23.
Didn't work immediately :
PK needs LAN-booting support and our network cards didn't
support it. We added no acpi and accidentally it worked.. ;)
Etherboot is used :
gPXE/Etherboot is an open source(GPL) network bootloader. It
provides a direct replacement for proprietary PXE ROMs, with many
extra features such asDNS,HTTP, iSCSI, etc .
This solution, thus, gave us our first cluster.
24. What the future holds
More permanent solution instead of temporary solution. eg
ROCKS, HADOOP, DISCO.....
Implementing key parallel algorithms.
Developing a guide for future cluster administrators.. (Who
should be students.... :) )
Familiarizing other departments with the applications of
cluster for their research.