The Andc Cluster

Embed Size (px)

Citation preview

  • 1. The ANDC Cluster Story.
      • Animesh Kumar
      • &
      • Ankit Bhattacharjee

2. Chapter 1.

  • In the beginning there was

3. Then it exploded..

    • The Idea:
    • The cluster project started through a discussion between the Principal of ANDC,Dr Savithri Singh , and the Director ofOpenLX,Mr Sudhir Gandotraduring a Linux workshop in 2007
    • Dr Sanjay Chauhans recruitment :
    • Dr. Savithri Singhinducted Dr Sanjay Chauhan from Physics department in the cluster project
    • Clueless students' involvement :
    • Arjun, Animesh, Ankit and Sudhang.

4. Chapter 2

5.

  • Initially the project was very challenging, the challenges being of two sorts:
      • Technical :
      • Especially the reclamation of the to-be-junkedhardware, and.
      • Human :
      • Mostly relating to the lack of experience and know- how of the players.This was especially hurtful,since it cost significant man-hours spent on suboptimal and downright incorrect 'solutions' that could have been avoided had the team been slightly more knowledgeable.

6. Chapter 3

  • Not everything that can be counted counts, and not everything that counts can be counted.

7. Junkyard Reclamation.

    • The project officially started when the team was "presented"
    • with 18- 20 decrepit machines of which barely 5 worked .
    • The junk consisted of A gallery of PI's, PII's, PIII's at the end of
    • their life, most of them not working, requiring us to
    • implement some:
    • Upgradation :
    • Some of those that did required significant upgrades to be worth deployment in the cluster.
    • Scavenging :
    • Over a certain length of time, few could be repaired while the rest were discarded after " scavenging " useful parts from them for use in the future and in salvageable machines.
    • Arjunsknowledge on hardware acts as great foundation and learning experience.

8. Experiences don't come cheap..

  • The first investment : Since a fairly "impressive" cluster needed to be at least visibly fast to the lay observer, the machines had to be upgraded in RAM.25 X 256 SDRAM modules were purchasedand multiples of these were put in all working machines .

9.

    • Finally 6 comps that were in the best
    • state were chosen as follows:
    • Specs here:
    • 4 X PII with 512 MB RAM.
    • 2 X PIII with 512 MB RAM.
    • These were connected via a 100Mbps switch.

10. Chapter 4

  • Wisdom Through Failure.

11. Our first mistake..

  • ClusterKoppixis chosen
  • Based on thorough research by Dr. Chauhan on the topic, we choose:
  • ClusterKnoppix is a specializedLinux distribution based on theKnoppixdistribution, but which uses the openMosix
  • kernel.
  • openMosix , developed by Israeli technologist, author, investor and entrepreneur Moshe Barwas a fork of the once-open, then-proprietary MOSIX cluster system.

12. Why cluster knoppix?

    • Lack of requisite knowledge to remaster or implement changes at kernel level.
    • ClusterKnoppixaims to provide the same core features and software asKnoppix , but adds the o penMosixclustering capabilities also.
    • Specifically designed to be agood master node .
    • openMosixhas theability to build a cluster out of inexpensive hardwaregiving you a traditional supercomputer. As long as you use processors out of the same architecture, any configuration of your node is possible.

13.

    • No cdrom drive/harddisk/floppy needed for the clients / openMosix autodiscovery:
      • New nodes automatically join the cluster (no configuration needed).
    • Cluster Management tools:
      • openMosix userland / openMosixview
    • Every node can run full blown X(PC-room/demo setup) or,Console only:more memory available for user applications.

14. What Could Have Been 15. Problems up there

  • Both clusterknoppix and openMosix development had
  • stopped so not much support was available.

16.

  • OpenMosix terminal server - uses PXE, DHCP and tftp to boot linux clients via the network:
      • So it wasnt compatible with the older cards in our fixed machines which werent PXE enabled.
  • Wouldnt work on WFC machines' lan cards:
      • No support for post 2.4.x kernels,hence it couldnt be deployed on any of the otherlabs in the college, as the machines on those had network cards that were incompatible with the GNU/Linux kernel versions with which openMosix worked.

17. Problems down under On the master node we executed the following commands: 1) ifconfig eth0 192.168.1.10 2) route add -net 0.0.0.0 gw 192.168.1.1 3) tyd -f init 4) tyd And on the drone node we executed: 1) ifconfig eth0 192.168.1.20 2) route add -net 0.0.0.0 gw 192.168.1.1 3) tyd -f init 4) tyd -m 192.168.1.10 The error we got was : SIOCSIFFLAGS : no such device 18. Chapter 5

  • Any port in a storm

19. Other solutions tried.

  • The 'educational' BCCD from the university of IOWA :
      • The BCCD was created to facilitate instruction of parallel computing aspects and paradigms.
      • The BCCD is a bootable CD image that boots up into a pre-configured distributed computing environment.
      • Focus in on educational aspects of High-Performance Computing (HPC) instead of the HPC core.
    • Problem
      • It asked for a password even from a live cd due to the hardware incompatibility!!!!!!!!

20.

    • CHAOS:
      • Small (6Mbyte)Linux distributiondesigned for creatingad hoccomputer clusters.
      • This tiny disc will bootanyi586classPC(that supports CD booting), into a workingopenMosix node, without disturbing (or even touching) the contents of any local hard disk.
    • Quantian OS:
      • A re-mastering ofclusterknoppixfor computational sciences.
      • The environment is self-configuring and directly bootable.

21. Chapter 6.

  • First taste of success.

22. Paralledigm Shift!!!

  • After a lot of frustrating trials that the clusterKnoppix idea was dropped.
  • Parallel Knoppix(Upgraded to Pelican HPC) is chosen:
      • ParallelKnoppix is a live CD image that let's you set up a high performance computing cluster in a few minutes.
      • A Parallel cluster allows you to do parallel computing using MPI.
  • Advantages:
      • The frontend node (either a real computer or a virtual machine) boots from the CD image. The compute nodes boot by PXE, using the frontend node as the server.
      • The LAM-MPI and OpenMPI implementations of MPI are installed.
      • Contains extensive example programs .
      • Very easy to add packages

23.

    • Didn't work immediately :
      • PK needs LAN-booting support and our network cards didn't support it. We added no acpi and accidentally it worked.. ;)
    • Etherboot is used :
      • gPXE/Etherboot is an open source(GPL) network bootloader. It provides a direct replacement for proprietary PXE ROMs, with many extra features such asDNS,HTTP, iSCSI, etc .
    • This solution, thus, gave us our first cluster.

24. What the future holds

  • More permanent solution instead of temporary solution. eg ROCKS, HADOOP, DISCO.....
  • Implementing key parallel algorithms.
  • Developing a guide for future cluster administrators.. (Who should be students.... :) )
  • Familiarizing other departments with the applications of cluster for their research.

25.