Perl for Bioinformatics Part 2

Perl for BioinformaticsPart 2

Stuart Brown

NYU School of Medicine

Sources

• Beginning Perl for Bioinformatics– James Tisdall, O’Reilly Press, 2000

• Using Perl to Facilitate Biological Analysis in Bioinformatics: A Practical Guide (2nd Ed.)– Lincoln Stein, Wiley-Interscience, 2001

• Introduction to Programming and Perl– Alan M. Durham, Computer Science Dept., Univ. of São Paulo, Brazil

Debugging

• Hopefully you were lucky enough to have some bugs in your programs from the first Perl exercise.

• Test each line as you write – insert extra print statements to check on

variables

Perl Debugging Help

• Add -w on the first line of your programs:

#!usr/local/perl -w– provides ‘warnings’

• Add use strict as the 2nd line of your programs– enforces proper variable names– must initialize variables before using

(set to some initialvalue such as 0 or empty)

Variable “Interpolation”• A variable holds a value $value = 6;• When you print the variable, Perl gives the value

rather than the name of the variable.print $value;

6 • If you put a variable inside double quotes, Perl

substitutes the value (this is called variable interpolation)print “The result is $value\n”

The result is 6• If you use single quotes, the variable name is used

(interpolation is not used) print ‘The result is $value\n’

The result is $value\n

Input

• A Perl program can take input from the keyboard– The angle bracket operator (<>)takes input– Usually this is assigned to a variable

print “Please type a number: ”;

$num = <>;

print “Your number is $num\n”;

chomp• When data is entered from the keyboard, Perl waits for the

Enter key to be typed

• But the string which is captured includes a newline (carriage return) at its end

• Perl uses the function chomp to remove the newline character:

print “Enter your name: ”;

$name = <>;

print “Hello $name, happy to meet you!\n”;

chomp $name;

print “Hello $name, happy to meet you!\n”;

Working with Text Files

• To do real work, Perl has to read data out of text files and write results into output files

• This is done in two steps

• First, you must give the file a name within the script - this is known as a filehandle

• Use the open command:

open FILE1, ‘/u/schmoj01/Seqs/protein1.seq’;

Read From the File

• Once the file is open, you can read from it using the <> operator – (put the filehandle between the angle brackets)

• Perl reads files one line at a time, each time you input data from the file, the next line is read:

open FILE1, ‘/u/prot1.seq’;$line1 = <FILE1>;chomp $line1;$line2 = <FILE1>;

…etc

Write to a File

• Writing to a file is similar to reading from it

• Use the > operator to open a file for writing:

open FILE1, ‘>/u/prot1.seq’;

• This creates a new file with that name, or overwrites an existing file

• Use >> to append text to an existing file• print to the file using the filehandle:

print FILE1 $data1;

Making Decisons

• Useful programs must be able to make some decisions on their own

• The if operator is very powerful

• It is generally used together with numerical or string comparison operators

numerical: ==, !=, >, <, ≥, ≤

strings: eq, ne, gt, lt, ge, le

True/False

• Perl relies on the concept of True/False decisions.

• Things are true if the math works.

• The not operator ! reverses it

print “positive number” if ! ($a < 0);

Conditional Blocks• An if test can be used to control multiple lines

of commands:print “Enter your age: ”;$age = <>;chomp $age;if ($age < 21) { print “You are too young for this kind of work!\n”; die “too young”;

}print “You are old enough to know better!\n”;

• If the test is true, execute all the command lines inside the {} brackets. If not, then go on past the closing } to the statements below.

• If evaluates some statement in parentheses (must be true or false)

• Note: conditional block is indented– Perl doesn’t care about indents, but it makes your

code more human readable

• die is a special function - stops your script and prints its message– Often used to test if keyboard input data is valid

or if an input file exists.

Else & Elseif• Instead of just letting the script go on if it fails the if

test, you can designate a second block of code for the “or else” condition

• You can also perform multiple tests using elseifif $A = 10 {

print “yadda yadda”; # do some stuff} elseif $A > 10 {

print “yowsa yowsa”; # do different stuff} elseif $A < 10 {

print “do this other stuff”;} else $A {

print “if it ain\’t =, >, or <, then I’m stumped”die “not a number”;

}

Loops• OK, we’ve got variables, input & output and

decisions. Now we need Loops.

• Loops test a condition and repeat a block of code based on the result– while loops repeat while the condition is true

$count = 1;while ($count <= 10) {

print “$count bottles of pop\n”;$count = $count +1;

};print “POP!\n”;

[Try this program yourself]

Read a File: line by line

open FILE1, ‘/u/doej01/prot1.seq’;while ($line = <FILE1>){ chomp($line);

$my_sequence = $my_sequence .

$line;};close FILE1

• Dumps the whole file into the variable $my_sequence

Arrays• It is awkward to store a large DNA sequence in

one variable, or to create many variables for a list of numbers

• Perl has a type of variable called an “array” that can store a list of data– multiple lines of a text file– a list of numbers– a list of words

• Array variables are referred to with an “@” symbol

@numbers = (1,2,45,234,11);

Bioinformatics Uses Arrays

• bioinformatics data often comes in the form of arrays– tab delimited lists– multi-line text files

• Arrays are handy because the entries are indexed– You can grab the third number directly

@numbers = (1, 2, 45, 234, 11);print “$numbers[3]\n”;

234#Note - the index starts with zero!

Read a File into an Array

• Rather than read a file one line at time into a scalar variable, it is often helpful to read the entire file into an array

open FILE1, ‘/u/doej01/prot1.seq’;@DNA = <FILE1>;

• join combines the elements of an array into a single scalar variable (a string)

$DNA = join('', @DNA);

• substr takes characters out of a string

$letter = substr($DNA, $position, 1)

join & substr

which string where in the string

how many letters to take

which arrayspacer(empty here)

Exercise

• Read a DNA sequence from a text file

• Calculate the %GC content

• What about non-DNA characters in the file?– carriage returns and blank spaces– N’s or X’s or unexpected letters

• Write the output to the screen and to a file – use append so that the file will grow as you run

this program on additional sequences

Documents

Perl for Bioinformatics Part 2