42

Bioinformatica 29-09-2011-p1-introduction

Embed Size (px)

DESCRIPTION

Bioinformatica Practicum I

Citation preview

Page 1: Bioinformatica 29-09-2011-p1-introduction
Page 2: Bioinformatica 29-09-2011-p1-introduction

FBW30-09-2010

Wim Van Criekinge

Page 3: Bioinformatica 29-09-2011-p1-introduction
Page 4: Bioinformatica 29-09-2011-p1-introduction

Practicum Bioinformatica

• Practicum – Inleiding tot Perl– Write your first PERL program !– Execute your first.pl

Page 5: Bioinformatica 29-09-2011-p1-introduction

• Perl is a High-level Scripting language• Larry Wall created Perl in 1987

– Practical Extraction (a)nd Reporting Language

– (or Pathologically Eclectic Rubbish Lister)• Born from a system administration tool• Faster than sh or csh• Sslower than C• No need for sed, awk, tr, wc, cut, …• Perl is open and free• http://conferences.oreillynet.com/

eurooscon/

What is Perl ?

Page 6: Bioinformatica 29-09-2011-p1-introduction

• Perl is available for most computing platforms: all flavors of UNIX (Linux), MS-DOS/Win32, Macintosh, VMS, OS/2, Amiga, AS/400, Atari

• Perl is a computer language that is:– Interpreted, compiles at run-time (need for

perl.exe !)

– Loosely “typed”

– String/text oriented

– Capable of using multiple syntax formats

• In Perl, “there’s more than one way to do it”

What is Perl ?

Page 7: Bioinformatica 29-09-2011-p1-introduction

• Ease of use by novice programmers• Flexible language: Fast software prototyping (quick

and dirty creation of small analysis programs)• Expressiveness. Compact code, Perl Poetry:

@{$_[$#_]||[]}• Glutility: Read disparate files and parse the relevant

data into a new format• Powerful pattern matching via “regular

expressions” (Best Regular Expressions on Earth)• With the advent of the WWW, Perl has become the

language of choice to create Common Gateway Interface (CGI) scripts to handle form submissions and create compute severs on the WWW.

• Open Source – Free. Availability of Perl modules for Bioinformatics and Internet.

Why use Perl for bioinformatics ?

Page 8: Bioinformatica 29-09-2011-p1-introduction

• Some tasks are still better done with other languages (heavy computations / graphics)– C(++),C#, Fortran, Java (Pascal,Visual Basic)

• With perl you can write simple programs fast, but on the other hand it is also suitable for large and complex programs. (yet, it is not adequate for very large projects)– Python

• Larry Wall: “For programmers, laziness is a virtue”

Why NOT use Perl for bioinformatics ?

Page 9: Bioinformatica 29-09-2011-p1-introduction

• Sequence manipulation and analysis

• Parsing results of sequence analysis programs (Blast, Genscan, Hmmer etc)

• Parsing database (eg Genbank) files

• Obtaining multiple database entries over the internet

• …

What bioinformatics tasks are suited to Perl ?

Page 10: Bioinformatica 29-09-2011-p1-introduction

Example of problems we will be solving

• Primary Sequence analysis

• Perform alignments

• Simulation experiments to explain Blast statistics

• Predicting protein topology

• Predicting secondary structures

• “Real-life” problems– Proteomics: Given aa masses find protein

in database– …

Page 11: Bioinformatica 29-09-2011-p1-introduction

• Perl (op CD-ROM): – Perl is available for various operating systems. To

download Perl and install it on your computer, have a look at the following resources:

– www.perl.com (O'Reilly). • Downloading Perl Software

– ActiveState. ActivePerl for Windows, as well as for Linux and Solaris.

• ActivePerl binary packages.

– CPAN

• PHPTriad: – bevat Apache/PHP en MySQL:

http://sourceforge.net/projects/phptriad

Perl installation

Page 12: Bioinformatica 29-09-2011-p1-introduction

Check installation

• Command-line flags for perl– Perl – v

• Gives the current version of Perl

– Perl –e• Executes Perl statements from the comment

line.– Perl –e “print 42;”

– Perl –e “print \”Two\n\lines\n\”;”

– Perl –we• Executes and print warnings

– Perl –we “print ‘hello’;x++;”

Page 13: Bioinformatica 29-09-2011-p1-introduction

How to enter your first program ?

• Gebruik een editor – DOS: EDIT– Windows:

• NOTEPAD (Let op!)

• Word(Pad) -> TEXT FILE

– TextPad en/of VIM

– Scite: http://www.scintilla.org/SciTE.html

Page 14: Bioinformatica 29-09-2011-p1-introduction

To start the DOS editor type EDIT at the command prompt

Edit text editor: Command line interface text editor Not a word processor

Cannot format data in documents Cannot manipulate environment

The Command Prompt Text Editor

Page 15: Bioinformatica 29-09-2011-p1-introduction

CD: Change Direcory !

DIR myfile.* - show a listing of any file with the name myfile, ending in ANY extension

DIR *file.dat - show a listing of files beginning with any characters, ending in file and having a .dat extension

DIR *.* - show a listing of ALL files in current directory

Some MSDOS commands

Page 16: Bioinformatica 29-09-2011-p1-introduction

Program files: Named by programmer Commonly have .COM, .EXE,

or .BAT extensions It is these that do not require

the user to use the extension when executing.

Review of File-Naming Rules

Page 17: Bioinformatica 29-09-2011-p1-introduction

Conceptually the syntax is:

COPY source destination

For example:

Copy myfile.doc yourfile.doc

This will make a duplicate of the source file, myfile.doc with the name yourfile.doc

The COPY Command

Page 18: Bioinformatica 29-09-2011-p1-introduction

DOSKEY: Recalls and edits command lines Keeps command history Used to write a macro-can

record strokes to perform a series of operations, then copy the “history” to a file and execute it at a later date.

DOSKEY

Page 19: Bioinformatica 29-09-2011-p1-introduction

Path: Route followed by OS to

locate, save, and/or retrieve a file

Brief Introduction to Subdirectories—The Path

Page 20: Bioinformatica 29-09-2011-p1-introduction

• Probleem– Ofwel kan je perl starten– Ofwel kan je het script niet vinden– Ofwel kan je een file nodig in het script niet

vinden

• Oplossing– Don’t panic !– Gebruikt absolute path-namen

• D:\Perl\bin\perl.exe D:\temp\Test.pl

– Let wel in je script met je de slash “escape”• $filename = “d:\\Temp\\pdb.fasta”

Het absolute pad probleem …

Page 21: Bioinformatica 29-09-2011-p1-introduction

• Oplossingen (II)– Kopieer al de files in dezelfde directory ! – Dus als je perl start vanuit D:\Perl\bin met perl

kan je wel verwijzen naar D:\Temp\test.pl maar dan moet ook de absolute verwijzing gebruikt worden voor $filename ofwel moet je pdb.fasta copieren naar D:\Perl\Bin

– Pas het zoekpad aan zodat je perl overal kan starten

• Path (geeft het zoekpad)• Set Path (past het pad aan, Voorzichtig !). Gebruik de

dos environment variabele %path% om een directory toe te voegen

• Set path=%path%;d:\Perl\bin • (nadien kan de aanpassing controleren door “path” uit

te voeren)

Het absolute pad probleem …

Page 22: Bioinformatica 29-09-2011-p1-introduction

Keyboard: Standard input device

Screen: Standard output device

Redirection

Redirection . . . changes output from monitor to

somewhere else (usually file or printer).

Page 23: Bioinformatica 29-09-2011-p1-introduction

Redirecting output to a File

The command: dir > directfile.txt will send the output of the dir command to a text file NOT to the screen. There is NO response on the screen. You can then print the contents of the file.

Redirection

Page 24: Bioinformatica 29-09-2011-p1-introduction

Perl

Page 25: Bioinformatica 29-09-2011-p1-introduction

• Perl is mostly a free format language: add spaces, tabs or new lines wherever you want.

• For clarity, it is recommended to write each statement in a separate line, and use indentation in nested structures.

• Comments: Anything from the # sign to the end of the line is a comment. (There are no multi-line comments).

• A perl program consists of all of the Perl statements of the file taken collectively as one big routine to execute.

General Remarks

Page 26: Bioinformatica 29-09-2011-p1-introduction

How does the real perl program look like:

#!/usr/local/bin/perl

print “Hello everyone\n”;

Mandatory first line (on UNIX)

How to run it:

1. Save the text of your code as a file -- program.pl

2. Execute it:

perl program.pl

Hello everyone

Page 27: Bioinformatica 29-09-2011-p1-introduction

Three Basic Data Types

• Scalars - $

• Arrays of scalars - @

• Associative arrays of scalers or Hashes - %

Page 28: Bioinformatica 29-09-2011-p1-introduction

2+2 = ?

$a = 2;

$b = 2;

$c = $a + $b;

$ - indicates a variable

; - ends every command

= - assigns a value to a variable

$c = 2 + 2;or

$c = 2 * 2;or

$c = 2 / 2;or

$c = 2 ^ 4;or 2^4 <-> 24 =16

$c = 1.35 * 2 - 3 / (0.12 + 1);or

Page 29: Bioinformatica 29-09-2011-p1-introduction

Ok, $c is 4. How do we know it?

print “Hello \n”;

print command:

$c = 4;

print “$c”;

“ ” - bracket output expression

\n - print a end-of-the-line character

(equivalent to pressing ‘Enter’)

print “Hello everyone\n”;

print “Hello” . ” everyone” . “\n”;

Strings concatenation:

Expressions and strings together:

print “2 + 2 = “ . (2+2) . ”\n”;

expression

2 + 2 = 4

Page 30: Bioinformatica 29-09-2011-p1-introduction

Loops and cycles (for statement):

# Output all the numbers from 1 to 100

for ($n=1; $n<=100; $n+=1) {

print “$n \n”;

}1. Initialization:

for ( $n=1 ; ; ) { … }

2. Increment:for ( ; ; $n+=1 ) { … }

3. Termination (do until the criteria is satisfied):for ( ; $n<=100 ; ) { … }

4. Body of the loop - command inside curly brackets:for ( ; ; ) { … }

Page 31: Bioinformatica 29-09-2011-p1-introduction

FOR & IF -- all the even numbers from 1 to 100:

for ($n=1; $n<=100; $n+=1) {

if (($n % 2) == 0) {

print “$n”;

}

}

Note: $a % $b -- Modulus -- Remainder when $a is divided by $b

Page 32: Bioinformatica 29-09-2011-p1-introduction

Two brief diversions (warnings & strict)

• Use warnings

• strict – forces you to ‘declare’ a variable the first time you use it.– usage: use strict; (somewhere near the top of

your script)

• declare variables with ‘my’– usage: my $variable;– or: my $variable = ‘value’;

• my sets the ‘scope’ of the variable. Variable exists only within the current block of code

• use strict and my both help you to debug errors, and help prevent mistakes.

Page 33: Bioinformatica 29-09-2011-p1-introduction

Grabbing user input

• #!...• Use strict;• Print “Enter a greeting: “;• My $greeting = <>;• Print $greeting;

<> operator, also called the “diamond operator”. This accesses what the usr types at the keyboard and brings it into the program for use

Page 34: Bioinformatica 29-09-2011-p1-introduction

Voorbeeldprogramma: DNA-invoer.pl

#!e:\perl\bin\perl.exe –w use strict;

print "Voer in DNA in:\n"; while (my $dna=<>) {chomp($dna);my $l = length($dna);print "DNA: ".$dna."\n";$dna =~ s/[^atcgATCG]//g;my $l2 = length($dna);if ($l2 < $l) {

print "removed ".($l-$l2)." illegal characters\n";}else{print "OK\n";}

print "Lengte van het DNA: ".$l2."\n";}

Page 35: Bioinformatica 29-09-2011-p1-introduction

Unary Arithmetic Operators eg. Autoincrement ++

• If you place one of the auto operators before the variable, it is known as a pre-incremented (pre-decremented) variable. Its value will be changed before it is referenced. If it is placed after the variable, it is known as a post-incremented (post-decremented) variable and its value is changed after it is used

For example:• $a = 5; # $a is assigned 5 • $b = ++$a; # $b is assigned the incremented value of $a, 6 • $c = $a--; # $c is assigned 6, then $a is decremented to 5

#!e:\perl\bin\perl.exe• $getal1 = 5;• print $getal1."\n";• print $getal1++."\n";• print ++$getal1."\n";

Page 36: Bioinformatica 29-09-2011-p1-introduction

Logical and Comparison operators

• Equal (True if $a is equal to $b)– Numeric: ==– String: eq

• And: &&

• Or: ||

Page 37: Bioinformatica 29-09-2011-p1-introduction

Schuifoperatoren

• Schuifoperatoren zijn handing voor manipulaties op bit-niveau: bv 40

256 128 64 32 16 8 4 2 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 00

000 1 0 1 0 0 0 Program• $getal1 = 40;• print "/4 ".($getal1 >> 2)."\n";• print "*8 ".($getal1 << 3)."\n";

>>2

<<3

Page 38: Bioinformatica 29-09-2011-p1-introduction

Text Processing Functions

The substr function• Definition• The substr function extracts a substring out of a

string and returns it. The function receives 3 arguments: a string value, a position on the string (starting to count from 0) and a length.

Example:• $a = "university"; • $k = substr ($a, 3, 5); • $k is now "versi" $a remains unchanged. • If length is omitted, everything to the end of the

string is returned.

Page 39: Bioinformatica 29-09-2011-p1-introduction

Random

#!c:\perl\bin\perl.exe -w#srand(time|$$);$x = rand(1);

• srand – The default seed for srand, which used to be time, has

been changed. Now it's a heady mix of difficult-to-predict system-dependent values, which should be sufficient for most everyday purposes. Previous to version 5.004, calling rand without first calling srand would yield the same sequence of random numbers on most or all machines. Now, when perl sees that you're calling rand and haven't yet called srand, it calls srand with the default seed. You should still call srand manually if your code might ever be run on a pre-5.004 system, of course, or if you want a seed other than the default

Page 40: Bioinformatica 29-09-2011-p1-introduction

• Oefening hoe goed zijn de random nummers ?

• Als ze goed zijn kan je er Pi mee berekenen …

• Een goede random generator is belangrijk voor goede randomsequenties die we nadien kunnen gebruiken in simulaties

Page 41: Bioinformatica 29-09-2011-p1-introduction

Bereken Pi aan de hand van twee random getallen

1

x

y

Page 42: Bioinformatica 29-09-2011-p1-introduction

Textpad

• Debugging– Tools

• Syntax Highlighting– Document Class