13
Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview of bioinformatics applications

Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Embed Size (px)

Citation preview

Page 1: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Cluster Computing Applications for Bioinformatics

Thurs., Aug. 9, 2007

Introduction to cluster computing

Working with Linux operating systems

Overview of bioinformatics applications

Page 2: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Introduction

Damian Christey Professional

Technologist Departments of

Mathematics and Biology

[email protected]

Page 3: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Cluster Computing

High Availability (HA)

High Performance (HPC)

Specialized software Highly parallel

Beowulf Commodity hardware Open Source software

Page 4: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Biology Cluster Hardware

12 nodes 2 processors per node

Dual core 1GHz Opteron

8 GB RAM each

Gigabit ethernet

2TB RAID storage

Page 5: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

GNU/Linux

Free, Open Source, Unix-based operating system

Rocks cluster management system: http://www.rocksclusters.org/

CentOS: http://centos.org/ derived from Redhat:

http://www.redhat.com/

Page 6: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Why Linux?

Cheap

Reliable and Scalable

Customizable

Unix philosophy

Text processing

Page 7: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Accessing the Cluster

Monitoring - http://alba.as.wvu.edu/ganglia

Secure Shell ssh -X [email protected] on Mac OS or

Linux Windows users can download SSH and X server

from: http://cygwin.com/

File transfer – SFTP http://www.winscp.com/ for Windows http://cyberduck.ch/ for Mac

qrsh – command to get a shell on a node

Page 8: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Unix Filesystem

Tree with a single root: / folders may be physically

stored on separate devices, different machines

/home/bob : Bob’s files

/opt/Bio : Bioinformatics programs

/share/bio : shared data, genome libraries

Page 9: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Unix Permissions

3x3 Matrix: owner, group, other read, write, execute

chgrp biouser file change the group to

which the file belongs

chmod g+w file give the group write

permission to your file

Page 10: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Text Processing

cat file : dump the contents of file to standard output

head , tail : output the first / last n lines of file

grep : return lines matching pattern in input or file

grep -v : invert match

| : pipe output of one program to another

> : pipe output to a file >> : concatenate output to end of file

Page 11: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Sequencing and Assembly Software

Phred - reads DNA sequencing trace files, calls bases, and assigns quality values

Phrap - assembling shotgun DNA sequence data

Consed - viewing, editing, and finishing sequence assemblies created with phrap

Artemis - genome viewer and annotation tool

Page 12: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Sequence Analysis and Screening Software

(WU, NCBI, MPI) BLAST - find regions of local similarity between sequences

ClustalW, T_Coffee, MUSCLE - multiple sequence alignment

RepeatMasker - screens for interspersed repeats and low complexity sequences

RepeatScout, PILER - de novo repeat finder

EMBOSS – assorted analysis tools

Page 13: Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview

Phylogenetics Software

Phylip, Paup - packages for inferring phylogenies or evolutionary trees.

MrBayes - bayesian inference of phylogeny

Structure - model-based clustering method for inferring population structure