Subroutines and Files Bioinformatics Ellen Walker Hiram College

Subroutines and Files

Bioinformatics

Ellen Walker

Hiram College

Why Subroutines?

• Saves typing

• Saves potential copy/paste errors

• Collect common algorithm in one place for reuse

Built-In Subroutines

• Provide common useful functions, e.g.– Index– Length– Substr

• Call with arguments, – Index($string, $pat) #$string and $pat are

arguments

• Different arguments produce different results

Finding Predefined Subroutines

• Textbooks (Safari Online has several)

• Google (include “Perl” in your string)

• Online documentation– http://www.gotapi.com/perl is nicely

searchable

http://www.gotapi.com/perl

How a Subroutine Works

• my $code = “ACA”;• print length($code);• print “goodbye\n”;

• Sub length• my $string = shift(@_)• my $length = 0;• …code to count …• return $length;

ACA

3

“ACA”

Key Components

• sub name– Declares this as a subroutine and names it

• shift @_– Pulls the arguments out of the list (in parentheses,

one at a time, left to right)– Example: somesub(“ACT”,1)– $a = shift@_ ($a is “ACT)– $b = shift@_ ($b is 1)

• return value– Ends the subroutine & gives it a value

Example (p. 122)

# find all GC-rich 4-7mers and determine their complements

my $GCmatch;

while ($someDNA =~m/([GC]{4,7})/g ){

$GCmatch = $1;

print “5’ $GCmatch 3’\n\n”;

$compl = complement($GCmatch);

print “3’ $compl 5’”\n”; }

Subroutine (p. 123)

#book version has good documentation

sub complement

{

my $dna = shift(@_); #get first arg

my $anti = $dna;

$anti =~ tr/ACGTacgt/TGCAtgca/;

return $anti;

}

Download These (Ch. 7)

• Counting nucleotides – countNucleotides( $str, “C”);– countNucleotides( $str, “[CG]”);

• Printing sequences with fixed line width– printSequence($str, 80);

Variable Scope

• Variables exist from when they are declared (“my”) until the end of the block (closing brace).

• Variables in subroutines exist only during the subroutine

• Each call to a subroutine re-initializes the variables

Files and Programs

• Files are stored on the computer’s hard drive and maintained by the operating system.

• Programs are connected to files via special subroutines– “open” creates a file handle– “close” releases the file (important!)

Basic File Manipulation

• Open a file and read– my $HANDLE;– open ($HANDLE, ‘<‘, $filename);– $line = <$HANDLE>;

• Open a file and write– My $HANDLE;– open($HANDLE, ‘>’, $filename);– print $HANDLE “Hello world!”;

• Close a file– close($HANDLE);

Allowing for Errors

• If you try to read a file that doesn’t exist, or write a file that does, the open() command will return false

• The rest of your program won’t work.• To fix this add:

or die(“some message $file :$!”)

to the end of the command

($! Contains the system error messages)

Complete Open Examples

open ($HANDLE, ‘<‘, $filename) or

die(“Cannot open file: $filename: $!);

open ($HANDLE, ‘>‘, $filename) or

die(“Cannot write file: $filename: $!);

Reading lines

• Subroutine chomp removes the ‘\n’ character at the end of each line

• $line = <$HANDLE> puts the next line in $line• When there are no more lines, the result is false

• Example: put the whole file in one sequencewhile ($line = <$HANDLE>) { chomp $line $seq = $seq . $line}

Printing to a file

• The print commands (print and printf) can optionally be followed with a file handle before the string to print

• Examples:– print $HANDLE “Hello\n”;– printf $HANDLE “GC percent is %.1f\n”,

$GCcount * 100.0 / $total;

• Subroutine to read FASTA formatted file (p. 141)

• Returns sequence as one long string

• Removes whitespace, lines that begin with # (comments), and all digits

ReadInDNA

FASTA File Format

• One header line, begins with >

• Many lines of text, sometimes capitalized, sometimes with spaces after every n characters

• (ReadInDNA handles these variations)

Getting a FASTA File

• Go to NCBI http://www.ncbi.nlm.nih.gov/

• Search for what you want and download the file to your current machine

• Send the file to your directory of cs.hiram.edu

(Demo to be provided)

http://www.ncbi.nlm.nih.gov/

Assignment

• Using subroutines from your text, determine the GC content of the given genomes.

(Examples to be provided)

Documents

Subroutines and Files Bioinformatics Ellen Walker Hiram College