Upload
wesley-singleton
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
AGENDA
2
• Introduction to Linux • How to request an HPC account• How to Login to HPC• Basic Linux commands• Available resources• How to submit a job to the cluster
AGENDA
3
Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster
What is UNIX
• Unix is an Operating System (OS), just like Microsoft "Windows" is an OS Computers– Runs on many computer "servers“, has ability to provide
multi-user, multi-tasking environment– Orchestrates the various parts of the computer: the
processor, the on-board memory, the disk drives, keyboards, video monitors, etc. to perform useful tasks
• Unix operating system comprises three parts – the kernel (with commands to interact with it), standard utility programs/ services, system configuration files
What is Linux
• Linux is “souped-up” Unix, and provides additional user-friendly programs– command line interface (CLI) and graphical user interface
(GUI) are available to execute commands
• What exactly does this mean?– It means we can install and run scientific software as well
as business applications
5
Why Unix/Linux?
• UNIX is good for automation of computer tasks:– performing complex operations with very few key strokes– operating on large number of objects for e.g.,
• Parsing file contents (pattern matching)• Manipulating text files containing scientific data
• UNIX is fast• LINUX(≈ UNIX) is free and runs on all PCs and MACs,
plus specialty hardware for mobile devices• Many scientific software are freely available on Linux
AGENDA
7
Introduction to Linux How to request an HPC account (to work on Linux)
How to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster
Getting an account
• To get started on using the Umass linux servers, you need to have an account. Fill out this form:
https://ghpcc06.umassrc.org/hpc/index.php • Your PI has to authorize• To connect to the HPC server from Windows, use
Putty client, or from a Mac, use SSH
http://
wiki.umassrc.org/wiki/index.php/Connecting_to_the_Cluster
8
Working on a Linux Computer
• Linux as a personal workstation• Linux/Unix as a central “server” (multi-user)
– Three pieces of information – user name, password and server name or IP address
• “Putty" on Windows OS can be used to connect to UMass Research Computing servers– remote login may not allow for displaying graphics - text
mode interaction only– graphics or "X" can be displayed using special tools (Xming)
AGENDA
10
Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster
Logging into Linux
• Why do we need to login?– Tracking who can login and what access they have
• Logging in– Use SSH client software– Login to a particular server which has a designated name:
• Ex: hpcc01.umassmed.edu, ghpcc06.umassrc.org• User credentials: user name, password
– SSH Client for Windows: Putty– SSH Client for Mac/Linux: Terminal
11
How do I interact with Linux?
• Using a command line interface (CLI) where we explicitly type commands and have Linux execute them (using a command shell)
• What is a command shell– A program that interprets the commands we wish to have
executed by Linux
• Enter “bash”– Bourne again Shell
13
Logging out of Linux
• Logging Out of Linux:– To end your session use the “exit” command from the
command prompt:[username@hpcc02 ~]$ exitConnection to hpcc02 closed
• You can also use the key sequence (<ctrl>+D) to close a sessions
14
AGENDA
15
Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster
Before we begin learning…
• We will use the term Linux and UNIX interchangeably• Many variants of Linux exist – Redhat, Ubuntu,
CentOS, Debian, etc.• Commands between Linux distributions will be
exactly (or almost exactly) the same• Most of the commands we will be covering are
applicable to other *NIX based operating systems
16
Files and Linux
• Linux users are working with– Applications– Files
• There are several different file types defined for different types of usage in Linux– Basic files text or binary type files (sequence files, etc.)– Executable files (programs). Programs such as bowtie,
gate, ls, cp, cd, etc.
17
Things we need to do on a shell
• Just like with a Windows PC, users need to:– Create, edit, move, rename and delete files– Organize files into folders and navigate the filesystem– Organize users and control permissions of what they can
see and do– View and manage processes, services– Install and run programs and work with their output
• In Linux, you have to learn "commands" to get above things done, implementing them on the "shell" or "command line"
Filesystem: Relative and Absolute Path
• The Linux file system is hierarchical and resembles a tree structure
• A user in the “admin” directory can access the “steve” directory by specifying the relative path “steve” or the absolute path “/users/admin/steve”. Similarly “users” can be accessed by specifying “../users” or “/users”
Linux Layout
• Linux commands are typically installed under:– /bin Linux commands– /sbin Typical system commands– /usr/bin User level commands (editors, etc.)– /share/bin Specific cluster software– /share/apps Specific genome based cluster tools
20
Basic command structure
• Basic form of a Unix command is: command [-options] [arguments]
• Example: ls -l /tmp– “ls” is the command. It lists contents of a directory– “-l” is the option or flag or modifier of the default behavior
of command. Try “ls”.– “/tmp” is the argument. Contents of this directory are
shown
• Aborting a shell command– most Unix systems allow to abort the current command by
typing Control-C
Note on Linux and commands
• Linux commands are case sensitive so:– Exit is not the same as exit– Bowtie is not the same as BOWTIE– Gate is not the same as gatE
• In Linux we use a / as a directory separator– In Windows we use \ as the directory separator
• Linux file names can be descriptive and do not require a file extension
22
Basic Linux commands (List 1)
ls List the contents of directorycp Copy file(s)rm Remove file(s)mv Move file(s)cd Change location to another directorymkdir Make a new directorypwd Display the path of current directoryrmdir Remove a directorycat Display contents of file
Basic Linux commands (List 1 ..contd)
head Display beginning of filetail Display end of fileclear Clear up the shell windowvi Open a file for editing in the VI editorpasswd Change the passwordless Displays contents of file with scrollingmore Displays contents of file with scrollinghistory Displays history of commands executed
Basic Linux commands (List 1 ..contd)
date Displays the current date and timewho Displays who is currently logged inwhoami Displays your usernamelast Displays recent login activityexit Exit the shellwc Count words and lines in filegrep Search for string pattern in fileman Display “manual” page for chosen command
25
Determining Present Working Directory with “pwd”
• When user logs in, they are placed in their HOME directory, which is usually under the “/home” directory
• The linux shell account name and the home directory name are usually the same, so “/home/snagpal” would be the home directory location for user “snagpal”
• As users navigate the filesystem, they can check/confirm where they are currently by running the “pwd” command
[snagpal@u15982204 ~]$ pwd/home/snagpal
• In windows, you can view the same in the windows explorer address bar
Changing directories with “cd”
• Often, users need to go to another directory that is:– a sub-directory that can be accessed below in the tree hierarchy of the
present working directory– a super-directory that can be accessed through the parent of the
present working directory
• In both cases, absolute and relative paths can be used. Lets say user is currently in “/home/snagpal” and needs to access– A sub-directory of the home directory
cd linuxcoursecd /home/snagpal/linuxcourse
– A super-directory of the home directorycd ../../usr/localcd /usr/local
Listing files and directories with “ls”
• “ls” lists files and sub-directories in a chosen directory. Windows explorer offers a rich, graphical equivalent– To list files in the current directory
ls– To list files in another directory (absolute path)
ls /usr/local– To modify the default view of the output to a long list
ls –l /usr/local
Making Directories with “mkdir”
• To create new sub-directories in the home folder or elsewhere on the filesystem, use the “mkdir” command
• Absolute or relative paths can be specifiedmkdir linuxcourse
mkdir /home/snagpal/linuxcourse
Removing Directories with “rmdir”
• To remove directories in the home folder or elsewhere on the filesystem, use the “rmdir” command
• Absolute or relative paths can be specifiedrmdir linuxcoursermdir /home/snagpal/linuxcourse
Copying, Moving and Removing files
• Users needing to make duplicates of a file can easily do so using the “cp” command. It requires the source and destination location to be specified (absolute or relative path)
cp /share/training/linux/test.txt /home/snagpal
cp /share/training/linux/test.txt .
• The dot “.” represents current working directory. Copying leaves a copy of the file in its original source location. Move deletes it, and also allows to rename files
mv /share/training/linux/test.txt /home/snagpal/file.txt
mv /share/training/linux/test.txt file.txt
• To remove a file, use “rm”rm test.txt
rm /home/snagpal/file.txt
File Naming conventions in Linux
• To name files and directories, use:– characters A-Z, a-z– numbers 0-9– period .– dash -– underscore _
• Files and Directory with shell meta characters in the name should be avoided, such as: \ / < > ! $ % ^ & * | { } [ ] “ ‘ ` ; ~
The “vi” editor (…contd)
• To exit the “vi” editor and return to the linux prompt, you have to return to command mode, by pressing the “Esc” key. Then use the “:” key to enter the command line mode
wq saves the current changes and exits viw! saves the current changes but doesn’t exit viq! exits vi without saving any change
• There are many more commands to execute in the command mode and command line mode. A vi tutorial is suggested
Creating and editing files
• Linux has many text editors, most commonly “vi”, but “emacs”, “pico” and “nano” can also be installed
• Most common syntax is:
vi newfile.txt # Creates new file
vi existingfile.txt # Opens existing file
• The filename is checked to see if it exists. If it does, it is displayed. If not, a new file with the name is created
• By default, “vi” opens in command mode. Users can scroll in the file – up, down, page up, page down, move cursor, delete lines, undo, etc
• To enter the “write” or “insert” mode for adding text, users press the “i” or “a” key on keyboard. To exit, press “Esc” key
Searching for patterns in text with “grep”
• Grep searches line-by-line for a specified pattern, and outputs any line that matches the pattern. Basic syntax for the grep command is: grep [options] pattern [files]
cp /share/training/linux/seq.fasta .
grep ">" seq.fasta
grep TCGAAGA seq.fasta
• Many “options”, also searches using regular expressions (a mathematical expression that expresses the characteristics of one or more strings, e.g.:te?xt, *omics
Counting words in file with “wc”
• The “wc” command counts words and lines in a filecp /share/training/linux/abstract.txt .
cat abstract.txtwc abstract.txtwc –l abstract.txt
Text processing Linux Commands$ head -2 file_name List the first two lines$ tail -2 file_name List last two lines$ head -5 file_name|tail -1 List fifth line$ cat file_name|head -50|tail -1 List 50th
line$ cat file|sort -rn|tail -5 List the last 5
items (sorted in reverse numerical order)$ sort -rn file|uniq –c Sort a file, and
count the number of line occurrences
37
Miscellaneous commands
• Displaying current date and time with "date“date
• Clearing the terminal with "clear“clear
• Displaying history of commands with "history“history
Getting Help in Unix
• Use the man command, followed by the name of the command you need help with– Type ‘man ls’ to see the manual page for the "ls" command
man ls
User convenience features
• Shell tab completion with suggestions• Shell expansion of wild-cards for specifying multiple
argumentsls –l *.txt
• Combining options/flagsls –la *.txt
• Using flag names with "--“• Copying and pasting clipboard with left and right mouse clicks
Tying Linux commands together• All commands are executed left -> right (LR)
– Output is expressed in the same manner
• Linux Pipes ‘|’ and commands• Ex: determine how many sequences we have$ cat sequence.fastq | wc
There are 4 lines per sequence in a fastq, how can we determine the # of sequences (x/4):
$ wc -l sequence.fastq| awk '{print $1}‘ | xargs -i echo "scale=0; {}/4“ | bc -l
41
Linux/UNIX Redirection
• What is redirection?– Linux uses the notion of < and > for redirection of input
and output respectively. – A redirection using > allows the user to save the output to
a file for example. In the same way > redirects output, < redirects input from for example the keyboard to a file for input.
– Ex: echo “test” > file1 # “test” to file1– Ex: cat < file1 # output the “file1” file
42
Redirection (..contd)
• A word on redirection: be careful when using redirection to a file, as a single > (redirect output from stdout to a file) will overwrite (or create) a file, whereas a >> (two > signs in a row) will attempt to append to a file thus preserving the initial file input.
43
Redirection (…contd)• If we create two files
(file1/2) with Line1, and Line2 in them respectively
• We can then create a new file using the > Redirection operator
$ cat file1 file2 > file3
44
Redirection (…contd)
• Using bowtie with re-direction – Ex: analyze fastq files to look for all alignments per
read, with hits guaranteed best stratum (with ties broken by quality), and reporting 2 end-to-end hits
• In the bowtie example we are redirecting the output of the bowtie alignment reads to the file we have named ‘output_file’ in your scratch dir.
$ bowtie -a --best -v 2 upstream_mate downstream_mate.fastq > ~/scratch/output_file
45
Shortcut BASH keystrokes• Keyboard shortcut timesavers in BASH
– CTRL + A Move cursor to start of line– CTRL + C Stop a program – CTRL + D Logout (Same as ‘exit’ command)– CTRL + E Move Cursor to end of line– CTRL + Z Suspend program– TAB Command completion (type part of
command and hit tab to complete command)– TAB TAB Shows all commands available
46
Executing Commands
• PATH– Commands are part of your shell’s PATH
• For example: when we type a command such as ‘ls’ the command will be run as it is part of the search PATH
– An example PATH is$ echo $PATH
/bin:/sbin/:/home/ritaccoa– Commands which are not in your PATH will not be
found and therefore not executed
47
Calling external bioinformatics programs
• On our server, several Bioinformatics software are installed
$ module avail• General method to using a software is to load the
software’s module$ module load bowtie/1.0.2$ bowtie --help
AGENDA
49
Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster
HPC infrastructure at UMass RC
• Massachusetts Green High Performance Computing Cluster – 10264 cores available, each node has 196 - 512 GB RAM.
12 GPU nodes available– 400TBs of high performance EMC Isilon X series storage– FDR based Infiniband (IB) network and a 10GE network for
the storage environment
• Software related to research installed:– Physics, Medical Physics, Genomics, Chemistry…
50
Basic terminology• What is a node?• What is a CPU?
• What is a core?
• What is an Operating System– What is a kernel?
• What is a process?– Single process OS and processes– Concurrent (Multi-tasking) OS and processes – Multiple cores (SMP) and Linux processes
53
Basic Terminology• What is a Node?
– A single computer/blade which contains X number of CPUs and Y number of cores per CPU
• What is a CPU?– The central processing unit (CPU) carries out all of
the instructions in which a computer system requires to execute/perform a given task
• What is a core?– A core is a processor within a CPU chip (there can
be many cores on a given CPU)
54
Basic Terminology
• What is a process?– A process is a program executing (ex: iTunes)
• What is a Kernel?– The kernel is the glue between the hardware and
the user. The Kernel schedules processes.– The kernel can be thought of as a crossing guard
directing traffic for optimal performance
55
Basic Terminology• Processes and tasks
– Single process OS and processes• Single processing OSs can run only one user process at a
given time, a single task• All tasks run until completion before another task is started• MSDOS is an example of this type of single user execution
OS.
• Linux Processes and Cores– A one to one relationship is optimal for performance
56
Basic Terminology• Processes and tasks, cont
– Concurrent (Multi-tasking) OS and processes • A concurrent OS provides users the ability to execute
many programs simultaneously• Linux provides users the ability to execute: an editor, a
music player, and other tasks simultaneously, thus allowing for multi-tasking
– Multiple cores (SMP) and Linux processes• A process which can take advantage more than one
core while running. These are typically called: multi-threaded.
57
Short Review
• If a node has four CPUs and two cores per CPU, how many total cores are there?
• In Linux can we execute an editor and a program to search a genome at the same time?
• How many processes should we execute on a node which has two CPUs with 8 cores each?
58
AGENDA
59
Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster
What is HPC?
• HPC = High Performance Computing– Infrastructure where hundreds or thousands of
computers are networked together with shared common storage
– Multiple users can login and use the infrastructure– More than 1 computer can be used to complete a
computing task– Special tools/skills required to leverage HPC
environment – Linux, LSF commands
60
Definitions
61
HPC Term DefinitionNode A single computer available to perform computing tasks
Rack A cabinet in which multiple nodes can be stacked vertically and/or horizontally, allowing for efficient housing, networking and power management
Cluster A collection of computer “nodes” that are on the same network for inter-node communication, shared storage and to execute jobs
CPU A CPU is the electronic circuitry (Microprocessor) within a computer that carries out the instructions of a computer program
Core Independent programming unit within a CPU that can execute program instructions. A modern CPU can have multiple cores
Head node In a cluster, one or a few nodes can be designated as a head node where users typically are able to login and create/monitor jobs
Definitions (…contd)
62
HPC Term DefinitionCompute node
Compute nodes in a cluster execute a job created by a head node. Users cannot login into a compute node
Process A process is an instance of a computer program that is being executed. It contains the program code and its current activity
Thread A thread is the smallest sequence of programmed instructions that can be managed independently by the scheduler of an OS
Job A job is a linux command that is designated to be executed on a compute node rather than the head node
Job array Identical jobs that have a different iterator variable
Parallel job Jobs that break a complex computing task into smaller tasks, such that each task is executed on different nodes simultaneously
Queue Designated “lanes” for submitting different types of jobs depending on priority, resources required or expected duration of execution
Definitions (…contd)
63
HPC Term DefinitionScheduler HPC software that allows for efficient utilization of cluster resources
based on submitted job types
Job Management
HPC software that keeps track of jobs submitted
Research computing
One of the departments within Umass Medical School responsible for supporting the HPC infrastructure on campus. Not related to “IT”
Cloud computing
A variant of HPC infrastructure which is not limited to a particular organization, where computing resources are requested on demand
Distributed computing
Buzzword similar to High Performance Computing
Parallel computing
Buzzword similar to High Performance Computing
Why do you need HPC?
• Needs assessment:– Use software that’s only available on linux
• Install it yourself on your own linux PC?• RC already has it installed?
– Automate data crunching tasks• Routine incoming data that needs to be crunched?• Workflow available within RC to handle it?
– Simulations• Molecular dynamics simulations taking too much time?
64
HPC is not for these!
• To run windows software with ponit-n-click interfaces
• Working with office documents – spreadsheets, slides, etc
• Video games, music or general video• Web browsing• Emails
65
Policies for HPC use
• If you have a “need” to use HPC, RC group can help, but there are expectations:– Understanding of the constraints of our HPC
implementation – CPUs, memory, local and shared storage, networking, etc
– Good knowledge of your own tasks/jobs that you are going to run – expected run times, utilization of memory, disk space and network bandwidth
– Fair share policies
66
Typical HPC environment
67
Connections:
Storage
The Cluster
SDPROLIANT 1850R
SD
Catalyst8500
Power Supply 0CISCO YSTEMSS Power Supply 1
SwitchProcessor
SERIES
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
HEWLETTPACKARD
Cluster head
Storage unit NAS/SAN
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Slave node
Internal cluster traffic(ethernet 1 Gb/s)
NAS storage(ethernet 1 Gb/s)
Public network(ethernet 100 Mb/s)
What is a computing “Job”?
• A computing “job” is an instruction to the HPC system to execute a command or script– Simple linux commands that can be executed
within miliseconds would probably not qualify to be submitted as a “job”
– Any command that is expected to take up a big portion of CPU or memory for more than a few seconds on a node would qualify to be submitted as a “job”. Why? (Hint: multi-user environment)
68
How to submit a “job”
• The basic syntax is:bsub <valid linux command>
• bsub: LSF command for submitting a job• Lets say user wants to count number of lines
in a FASTQ file. On a linux PC, the command iswc –l reads.fastq
• To submit a job to do the work, dobsub wc –l reads.fastq
69
Specifying more “job” options
• Jobs can be marked with options for better job tracking and resource management– Job should be submitted with parameters such as
queue name, estimated runtime, job name, memory required, output and error files, etc.
• These can be passed on in the bsub commandbsub –q short –W 1:00 –R
rusage[mem=2048] –J “Myjob” –o hpc.out –e hpc.err wc –l reads.fastq
70
Job submission “options”
71
Option flag or name
Description
-q Name of queue to use. On our systems, possible values are “short” (<=4 hrs execution time), “long” and “interactive”
-W Allocation of node time. Specify hours and minutes as HH:MM
-J Job name. Eg “Myjob”
-o Output file. Eg. “hpc.out”
-e Error file. Eg. “hpc.err”
-R Resources requested from assigned node. Eg: “-R rusage[mem=1024]”, “-R hosts[span=1]”
-n Number of cores to use on assigned node. Eg. “-n 8”
Why use the correct queue?
• Match requirements to resources• Jobs dispatch quicker• Better for entire cluster• Help GHPCC staff determine when new
resources are needed
72
Demo
#!/bin/bash
#BSUB -q short#BSUB -W 00:10#BSUB -n 1#BSUB -R "rusage[mem=1024]"#BSUB -J "myTask[1-80]”#BSUB -o logs/out.%J.%I
echo "Hello Job $LSB_JOBID Task $LSB_JOBINDEX"
73
Create a script “hello-job-array.sh”
To execute on shell, run: bsub < hello-job-array.sh
Learning to use HPC
• Linux is a pre-requisite to using any HPC system– Plenty of linux tutorials on the internet– Attend our “Intro to linux” sessions when offered
• Our website is a good resource for learning to use HPC, visit
www.umassrc.org • Lots of examples provided
74
Disk usage best practices
• Archive your data– Make backups of your data on mid-long term storage
• Use local storage if possible– Local storage always faster than network
• Don’t use farline for cluster processing
75
HPC Best practices
• When submitting a large number of jobs please consider:– Single CPU jobs versus multi CPU Jobs– Correct amount of memory for your job– Job Arrays– Job dependencies
76
• The earlier your jobs are submitted the earlier your job will gain needed LSF resources.
• Re-direct all LSF output to one directory for convenience
• Add the following to your LSF / Job directives: (redirects stdout/stderr)
#BSUB -o $HOME/LSF_jobs_output/LSF_job.%J.out#BSUB -o $HOME/LSF_jobs_output/LSF_job.%J.%I.out
HPC Best practices cont.
77
HPC Best practices cont.LSF Queues and policies
• Fair share attempts to equalize CPU (slot) resources for Labs and users at job submission.
• The priority of a job is calculated in relation to other submitted jobs. The priority for jobs will change as jobs complete and job slots become available
• All labs start with an equal weight• Each lab member shares in this weight when submitting
jobs• Weights are measured from job submissions per user and
per lab• Weights are based on CPU time used and a decay time
78
Working with bioinformatics data files: A demo
• Log on to the Umass server using Putty on windows or Terminal on Mac
• Request an interactive shell session on one of the compute nodes for this demo
$ bsub –q interactive –W 4:00 –Is bash
• Navigate to the training directory or copy the examples to your local directory
$ cd /share/training/linux-bioinformatics
$ cp /share/training/linux-bioinformatics/* ~
Working with bioinformatics data files: A demo (…contd)
• We have a file with genomic sequence, called “sequence.fa”, and a file with NGS reads, “reads.fq”. Confirm them
$ ls
• We can examine a file using this Linux command$ file sequence.fasequence.fa: ASCII text
• Lets look at the attributes of the files in this directory$ ls -l
Working with bioinformatics data files: A demo (…contd)
• The “cat” command can be used to display the contents of one or more files to the screen
$ cat sequence.fa
• Maybe better to scroll through the file, as pages?$ less sequence.fa
• Display just the first line of file (header)$ head -1 sequence.fa
• Display the last 3 lines of the file$tail -3 sequence.fa
Working with bioinformatics data files: A demo (…contd)
• Determine number of lines in FASTQ filewc –l reads.fq
• Count the number of reads in FASTQ file$ x=`wc -l reads.fq | cut -f 1 -d ' '`
$ echo “$((x/4)) reads”
• Search for pattern in the sequence file and countgrep –c ACGTCA sequence.fa
• Search for adapter and count reads containing itgrep ^ACGTCA reads.fq | wc -l
Innovagene Informatics. All rights
reserved
Working with bioinformatics data files: A demo (…contd)
• Case-insensitive search and countgrep –i ^ACGTCA reads.fqgrep –i ^ACGTCA reads.fq | wc –l
• Display all headers in sequence file$ grep ^> sequence.fa
• Count number of bases in single-sequence FASTA file$ more +2 sequence.fa | wc -m
Working with bioinformatics data files: A demo (…contd)
• Now lets align the reads to the sequence file (chr19)module load bowtie2/2-2.1.0module load samtools/1.2
• If you still have enough time remaining on this compute node (interactive sessions can be requested for up to 8 hours), run bowtie2
bowtie2-build index sequence.fa
bowtie2 -p 1 -x sequence.fa reads.fq -S read.fq.sam
• You can also submit this alignment as a compute job
Working with bioinformatics data files: A demo (…contd)
• Create a bowtie script with the following content#!/bin/bash module load bowtie2/2-2.1.0 module load samtools/1.2 bowtie2-build sequence.fa referencebowtie2 -p 8 -x reference reads.fq -S reads.fq.samsamtools view -b reads.fq.sam –o reads.fq.bam
Working with bioinformatics data files: A demo (…contd)
• Now submit this script as a compute jobbsub -W 4:00 -q short -R
"rusage[mem=4096]" -J "bowtie-job" -o ngs.out -e ngs.err ./bowtie-align.sh
• Another way of writing the script is to include all of the command line options into the script itself (next slide)
• Then submit the compute job asbsub < bowtie-align2.sh
Working with bioinformatics data files: A demo (…contd)
#!/bin/bash#BSUB -J "SeqAlignJob" #BSUB -R rusage[mem=4096] #BSUB -q short #BSUB -W 4:00 #BSUB -o ngs.out #BSUB -e ngs.errmodule load bowtie2/2-2.1.0; module load samtools/1.2 bowtie2-build sequence.fa referencebowtie2 -p 8 -x reference reads.fq -S reads.fq.samsamtools view -b reads.fq.sam -o reads.fq.bam