Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Cluster Computing Applications for Bioinformatics

Thurs., Aug. 9, 2007

Introduction to cluster computing

Working with Linux operating systems

Overview of bioinformatics applications

Introduction

Damian Christey Professional

Technologist Departments of

Mathematics and Biology

[email protected]

mailto:[email protected]

Cluster Computing

High Availability (HA)

High Performance (HPC)

Specialized software Highly parallel

Beowulf Commodity hardware Open Source software

Biology Cluster Hardware

12 nodes 2 processors per node

Dual core 1GHz Opteron

8 GB RAM each

Gigabit ethernet

2TB RAID storage

GNU/Linux

Free, Open Source, Unix-based operating system

Rocks cluster management system: http://www.rocksclusters.org/

CentOS: http://centos.org/ derived from Redhat:

http://www.redhat.com/

http://www.rocksclusters.org/

http://centos.org/

http://www.redhat.com/

Why Linux?

Cheap

Reliable and Scalable

Customizable

Unix philosophy

Text processing

Accessing the Cluster

Monitoring - http://alba.as.wvu.edu/ganglia

Secure Shell ssh -X [email protected] on Mac OS or

Linux Windows users can download SSH and X server

from: http://cygwin.com/

File transfer – SFTP http://www.winscp.com/ for Windows http://cyberduck.ch/ for Mac

qrsh – command to get a shell on a node

http://alba.as.wvu.edu/ganglia

mailto:[email protected]

http://cygwin.com/

http://www.winscp.com/

http://cyberduck.ch/

Unix Filesystem

Tree with a single root: / folders may be physically

stored on separate devices, different machines

/home/bob : Bob’s files

/opt/Bio : Bioinformatics programs

/share/bio : shared data, genome libraries

Unix Permissions

3x3 Matrix: owner, group, other read, write, execute

chgrp biouser file change the group to

which the file belongs

chmod g+w file give the group write

permission to your file

Text Processing

cat file : dump the contents of file to standard output

head , tail : output the first / last n lines of file

grep : return lines matching pattern in input or file

grep -v : invert match

| : pipe output of one program to another

> : pipe output to a file >> : concatenate output to end of file

Sequencing and Assembly Software

Phred - reads DNA sequencing trace files, calls bases, and assigns quality values

Phrap - assembling shotgun DNA sequence data

Consed - viewing, editing, and finishing sequence assemblies created with phrap

Artemis - genome viewer and annotation tool

Sequence Analysis and Screening Software

(WU, NCBI, MPI) BLAST - find regions of local similarity between sequences

ClustalW, T_Coffee, MUSCLE - multiple sequence alignment

RepeatMasker - screens for interspersed repeats and low complexity sequences

RepeatScout, PILER - de novo repeat finder

EMBOSS – assorted analysis tools

Phylogenetics Software

Phylip, Paup - packages for inferring phylogenies or evolutionary trees.

MrBayes - bayesian inference of phylogeny

Structure - model-based clustering method for inferring population structure

Documents

Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview