Upload
vivian-mcdonald
View
217
Download
4
Embed Size (px)
Citation preview
Subroutines and Files
Bioinformatics
Ellen Walker
Hiram College
Why Subroutines?
• Saves typing
• Saves potential copy/paste errors
• Collect common algorithm in one place for reuse
Built-In Subroutines
• Provide common useful functions, e.g.– Index– Length– Substr
• Call with arguments, – Index($string, $pat) #$string and $pat are
arguments
• Different arguments produce different results
Finding Predefined Subroutines
• Textbooks (Safari Online has several)
• Google (include “Perl” in your string)
• Online documentation– http://www.gotapi.com/perl is nicely
searchable
How a Subroutine Works
• my $code = “ACA”;• print length($code);• print “goodbye\n”;
• Sub length• my $string = shift(@_)• my $length = 0;• …code to count …• return $length;
ACA
3
“ACA”
Key Components
• sub name– Declares this as a subroutine and names it
• shift @_– Pulls the arguments out of the list (in parentheses,
one at a time, left to right)– Example: somesub(“ACT”,1)– $a = shift@_ ($a is “ACT)– $b = shift@_ ($b is 1)
• return value– Ends the subroutine & gives it a value
Example (p. 122)
# find all GC-rich 4-7mers and determine their complements
my $GCmatch;
while ($someDNA =~m/([GC]{4,7})/g ){
$GCmatch = $1;
print “5’ $GCmatch 3’\n\n”;
$compl = complement($GCmatch);
print “3’ $compl 5’”\n”; }
Subroutine (p. 123)
#book version has good documentation
sub complement
{
my $dna = shift(@_); #get first arg
my $anti = $dna;
$anti =~ tr/ACGTacgt/TGCAtgca/;
return $anti;
}
Download These (Ch. 7)
• Counting nucleotides – countNucleotides( $str, “C”);– countNucleotides( $str, “[CG]”);
• Printing sequences with fixed line width– printSequence($str, 80);
Variable Scope
• Variables exist from when they are declared (“my”) until the end of the block (closing brace).
• Variables in subroutines exist only during the subroutine
• Each call to a subroutine re-initializes the variables
Files and Programs
• Files are stored on the computer’s hard drive and maintained by the operating system.
• Programs are connected to files via special subroutines– “open” creates a file handle– “close” releases the file (important!)
Basic File Manipulation
• Open a file and read– my $HANDLE;– open ($HANDLE, ‘<‘, $filename);– $line = <$HANDLE>;
• Open a file and write– My $HANDLE;– open($HANDLE, ‘>’, $filename);– print $HANDLE “Hello world!”;
• Close a file– close($HANDLE);
Allowing for Errors
• If you try to read a file that doesn’t exist, or write a file that does, the open() command will return false
• The rest of your program won’t work.• To fix this add:
or die(“some message $file :$!”)
to the end of the command
($! Contains the system error messages)
Complete Open Examples
open ($HANDLE, ‘<‘, $filename) or
die(“Cannot open file: $filename: $!);
open ($HANDLE, ‘>‘, $filename) or
die(“Cannot write file: $filename: $!);
Reading lines
• Subroutine chomp removes the ‘\n’ character at the end of each line
• $line = <$HANDLE> puts the next line in $line• When there are no more lines, the result is false
• Example: put the whole file in one sequencewhile ($line = <$HANDLE>) { chomp $line $seq = $seq . $line}
Printing to a file
• The print commands (print and printf) can optionally be followed with a file handle before the string to print
• Examples:– print $HANDLE “Hello\n”;– printf $HANDLE “GC percent is %.1f\n”,
$GCcount * 100.0 / $total;
• Subroutine to read FASTA formatted file (p. 141)
• Returns sequence as one long string
• Removes whitespace, lines that begin with # (comments), and all digits
ReadInDNA
FASTA File Format
• One header line, begins with >
• Many lines of text, sometimes capitalized, sometimes with spaces after every n characters
• (ReadInDNA handles these variations)
Getting a FASTA File
• Go to NCBI http://www.ncbi.nlm.nih.gov/
• Search for what you want and download the file to your current machine
• Send the file to your directory of cs.hiram.edu
(Demo to be provided)
Assignment
• Using subroutines from your text, determine the GC content of the given genomes.
(Examples to be provided)