22
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine

Perl for Bioinformatics Part 2

Embed Size (px)

DESCRIPTION

Perl for Bioinformatics Part 2. Stuart Brown NYU School of Medicine. Sources. Beginning Perl for Bioinformatics James Tisdall, O’Reilly Press, 2000 Using Perl to Facilitate Biological Analysis in Bioinformatics: A Practical Guide (2nd Ed.) Lincoln Stein, Wiley-Interscience, 2001 - PowerPoint PPT Presentation

Citation preview

Page 1: Perl for Bioinformatics Part 2

Perl for BioinformaticsPart 2

Stuart Brown

NYU School of Medicine

Page 2: Perl for Bioinformatics Part 2

Sources

• Beginning Perl for Bioinformatics– James Tisdall, O’Reilly Press, 2000

• Using Perl to Facilitate Biological Analysis in Bioinformatics: A Practical Guide (2nd Ed.)– Lincoln Stein, Wiley-Interscience, 2001

• Introduction to Programming and Perl– Alan M. Durham, Computer Science Dept., Univ. of São Paulo, Brazil

Page 3: Perl for Bioinformatics Part 2

Debugging

• Hopefully you were lucky enough to have some bugs in your programs from the first Perl exercise.

• Test each line as you write – insert extra print statements to check on

variables

Page 4: Perl for Bioinformatics Part 2

Perl Debugging Help

• Add -w on the first line of your programs:

#!usr/local/perl -w– provides ‘warnings’

• Add use strict as the 2nd line of your programs– enforces proper variable names– must initialize variables before using

(set to some initialvalue such as 0 or empty)

Page 5: Perl for Bioinformatics Part 2

Variable “Interpolation”• A variable holds a value $value = 6;• When you print the variable, Perl gives the value

rather than the name of the variable.print $value;

6 • If you put a variable inside double quotes, Perl

substitutes the value (this is called variable interpolation)print “The result is $value\n”

The result is 6• If you use single quotes, the variable name is used

(interpolation is not used) print ‘The result is $value\n’

The result is $value\n

Page 6: Perl for Bioinformatics Part 2

Input

• A Perl program can take input from the keyboard– The angle bracket operator (<>)takes input– Usually this is assigned to a variable

print “Please type a number: ”;

$num = <>;

print “Your number is $num\n”;

Page 7: Perl for Bioinformatics Part 2

chomp• When data is entered from the keyboard, Perl waits for the

Enter key to be typed

• But the string which is captured includes a newline (carriage return) at its end

• Perl uses the function chomp to remove the newline character:

print “Enter your name: ”;

$name = <>;

print “Hello $name, happy to meet you!\n”;

chomp $name;

print “Hello $name, happy to meet you!\n”;

Page 8: Perl for Bioinformatics Part 2

Working with Text Files

• To do real work, Perl has to read data out of text files and write results into output files

• This is done in two steps

• First, you must give the file a name within the script - this is known as a filehandle

• Use the open command:

open FILE1, ‘/u/schmoj01/Seqs/protein1.seq’;

Page 9: Perl for Bioinformatics Part 2

Read From the File

• Once the file is open, you can read from it using the <> operator – (put the filehandle between the angle brackets)

• Perl reads files one line at a time, each time you input data from the file, the next line is read:

open FILE1, ‘/u/prot1.seq’;$line1 = <FILE1>;chomp $line1;$line2 = <FILE1>;

…etc

Page 10: Perl for Bioinformatics Part 2

Write to a File

• Writing to a file is similar to reading from it

• Use the > operator to open a file for writing:

open FILE1, ‘>/u/prot1.seq’;

• This creates a new file with that name, or overwrites an existing file

• Use >> to append text to an existing file• print to the file using the filehandle:

print FILE1 $data1;

Page 11: Perl for Bioinformatics Part 2

Making Decisons

• Useful programs must be able to make some decisions on their own

• The if operator is very powerful

• It is generally used together with numerical or string comparison operators

numerical: ==, !=, >, <, ≥, ≤

strings: eq, ne, gt, lt, ge, le

Page 12: Perl for Bioinformatics Part 2

True/False

• Perl relies on the concept of True/False decisions.

• Things are true if the math works.

• The not operator ! reverses it

print “positive number” if ! ($a < 0);

Page 13: Perl for Bioinformatics Part 2

Conditional Blocks• An if test can be used to control multiple lines

of commands:print “Enter your age: ”;$age = <>;chomp $age;if ($age < 21) { print “You are too young for this kind of work!\n”; die “too young”;

}print “You are old enough to know better!\n”;

• If the test is true, execute all the command lines inside the {} brackets. If not, then go on past the closing } to the statements below.

Page 14: Perl for Bioinformatics Part 2

• If evaluates some statement in parentheses (must be true or false)

• Note: conditional block is indented– Perl doesn’t care about indents, but it makes your

code more human readable

• die is a special function - stops your script and prints its message– Often used to test if keyboard input data is valid

or if an input file exists.

Page 15: Perl for Bioinformatics Part 2

Else & Elseif• Instead of just letting the script go on if it fails the if

test, you can designate a second block of code for the “or else” condition

• You can also perform multiple tests using elseifif $A = 10 {

print “yadda yadda”; # do some stuff} elseif $A > 10 {

print “yowsa yowsa”; # do different stuff} elseif $A < 10 {

print “do this other stuff”;} else $A {

print “if it ain\’t =, >, or <, then I’m stumped”die “not a number”;

}

Page 16: Perl for Bioinformatics Part 2

Loops• OK, we’ve got variables, input & output and

decisions. Now we need Loops.

• Loops test a condition and repeat a block of code based on the result– while loops repeat while the condition is true

$count = 1;while ($count <= 10) {

print “$count bottles of pop\n”;$count = $count +1;

};print “POP!\n”;

[Try this program yourself]

Page 17: Perl for Bioinformatics Part 2

Read a File: line by line

open FILE1, ‘/u/doej01/prot1.seq’;while ($line = <FILE1>){ chomp($line);

$my_sequence = $my_sequence .

$line;};close FILE1

• Dumps the whole file into the variable $my_sequence

Page 18: Perl for Bioinformatics Part 2

Arrays• It is awkward to store a large DNA sequence in

one variable, or to create many variables for a list of numbers

• Perl has a type of variable called an “array” that can store a list of data– multiple lines of a text file– a list of numbers– a list of words

• Array variables are referred to with an “@” symbol

@numbers = (1,2,45,234,11);

Page 19: Perl for Bioinformatics Part 2

Bioinformatics Uses Arrays

• bioinformatics data often comes in the form of arrays– tab delimited lists– multi-line text files

• Arrays are handy because the entries are indexed– You can grab the third number directly

@numbers = (1, 2, 45, 234, 11);print “$numbers[3]\n”;

234#Note - the index starts with zero!

Page 20: Perl for Bioinformatics Part 2

Read a File into an Array

• Rather than read a file one line at time into a scalar variable, it is often helpful to read the entire file into an array

open FILE1, ‘/u/doej01/prot1.seq’;@DNA = <FILE1>;

Page 21: Perl for Bioinformatics Part 2

• join combines the elements of an array into a single scalar variable (a string)

$DNA = join('', @DNA);

• substr takes characters out of a string

$letter = substr($DNA, $position, 1)

join & substr

which string where in the string

how many letters to take

which arrayspacer(empty here)

Page 22: Perl for Bioinformatics Part 2

Exercise

• Read a DNA sequence from a text file

• Calculate the %GC content

• What about non-DNA characters in the file?– carriage returns and blank spaces– N’s or X’s or unexpected letters

• Write the output to the screen and to a file – use append so that the file will grow as you run

this program on additional sequences