View
230
Download
2
Embed Size (px)
Citation preview
1
Introduction to Perl
Part II
2
Associative arrays or Hashes
Like arrays, but instead of numbers as indices can use strings
@array = (‘john’, ‘steve’, ‘aaron’, ‘max’, ‘juan’, ‘sue’)
%hash = ( ‘apple’ => 12, ‘pear’ => 3, ‘cherry’ =>30, ‘lemon’ => 2, ‘peach’ => 6, ‘kiwi’ => 3);
10 2 3 4 5
Array Hashpear
apple
cherry
lemon
peach
kiwi‘jo
hn
’
‘st
eve’
‘aaro
n’
‘m
ax’
‘ju
an
’
‘su
e’
123
30263
3
Using hashes
{ } operator
Set a value– $hash{‘cherry’} = 10;
Access a value– print $hash{‘cherry’}, “\n”;
Remove an entry– delete $hash{‘cherry’};
4
Get the Keys
keys function will return a list of the hash keys
@keys = keys %fruit;
for my $key ( keys %fruit ) { print “$key => $hash{$key}\n”;}
Would be ‘apple’, ‘pear’, ...
Order of keys is NOT guaranteed!
5
Get just the values
@values = values %hash;
for my $val ( @values ) { print “val is $val\n”;}
6
Iterate through a set
Order is not guaranteed!while( my ($key,$value) = each %hash){ print “$key => $value\n”;}
7
Subroutines
Set of code that can be reused
Can also be referred to as procedures and functions
8
Defining a subroutine
sub routine_name { }
Calling the routine– routine_name;
– &routine_name; (& is optional)
9
Passing data to a subroutine
Pass in a list of data– &dosomething($var1,$var2);
– sub dosomething { my ($v1,$v2) = @_;}sub do2 { my $v1 = shift @_; my $v2 = shift;}
10
Returning data from a subroutine
The last line of the routine set the return valuesub dothis { my $c = 10 + 20;}print dothis(), “\n”;
Can also use return specify return value and/or leave routine early
sub is_stopcodon { my $val = shift @_; if( length($val) != 3 ) { return -1; } elsif( $val eq ‘TAA’ || $val eq ‘TAG’ || $val eq ‘TGA’ ) { return 1; } else { return 0; }
Write subroutine which returns true if codon is a stop codon (for standard
genetic code).-1 on error, 1 on true, 0 on false
12
Context
array versus scalar context my $length = @lst;
my ($first) = @lst;
Want array used to report context subroutines are called in
Can force scalar context with scalarmy $len = scalar @lst;
13
subroutine context
sub dostuff { if( wantarray ) { print “array/list context\n”; } else { print “scalar context\n”; }}
dostuff(); # scalarmy @a = dostuff(); # arraymy %h = dostuff(); # arraymy $s = dostuff(); # scalar
14
Why do you care about context?
sub dostuff { my @r = (10,20); return @r;}
my @a = dostuff(); # arraymy %h = dostuff(); # arraymy $s = dostuff(); # scalar
print “@a\n”; # 10 20print join(“ “, keys %h),”\n”; # 10print “$s\n”; # 2
15
References
Are “pointers” the data object instead of object itsself
Allows us to have a shorthand to refer to something and pass it around
Must “dereference” something to get its actual value, the “reference” is just a location in memory
16
Reference Operators
\ in front of variable to get its memory location– my $ptr = \@vals;
[ ] for arrays, { } for hashes
Can assign a pointer directly– my $ptr = [ (‘owlmonkey’, ‘lemur’)];
my $hashptr = { ‘chrom’ => ‘III’, ‘start’ => 23};
17
Dereferencing
Need to cast reference back to datatype
my @list = @$ptr;
my %hash = %$hashref;
Can also use ‘{ }’ to clarify– my @list = @{$ptr};
– my %hash = %{$hashref};
18
Really they are not so hard...
my @list = (‘fugu’, ‘human’, ‘worm’, ‘fly’);
my $list_ref = \@list;
my $list_ref_copy = [@list];
for my $item ( @$list_ref ) { print “$item\n”;}
19
Why use references?
Simplify argument passing to subroutines
Allows updating of data without making multiple copies
What if we wanted to pass in 2 arrays to a subroutine?
sub func { my (@v1,@v2) = @_; }
How do we know when one stops and another starts?
20
Why use references?
Passing in two arrays to intermix.sub func {
my ($v1,$v2) = @_; my @mixed; while( @$v1 || @$v2 ) { push @mixed, shift @$v1 if @$v1; push @mixed, shift @$v2 if @$v2; } return \@mixed;}
21
References also allow Arrays of Arrays
my @lst;push @lst, [‘milk’, ‘butter’, ‘cheese’];push @lst, [‘wine’, ‘sherry’, ‘port’];push @lst, [‘bread’, ‘bagels’, ‘croissants’];
my @matrix = [ [1, 0, 0], [0, 1, 0], [0, 0, 1] ];
22
Hashes of arrays
$hash{‘dogs’} = [‘beagle’, ‘shepherd’, ‘lab’];$hash{‘cats’} = [‘calico’, ‘tabby’, ‘siamese’];$hash{‘fish’} = [‘gold’,’beta’,’tuna’];
for my $key (keys %hash ) { print “$key => “, join(“\t”, @{$hash{$key}}), “\n”;}
23
More matrix use
my @matrix;open(IN, $file) || die $!;# read in the matrixwhile(<IN>) { push @matrix, [split];}# data looks like# GENENAME EXPVALUE STATUS# sort by 2nd columnfor my $row ( sort { $a->[1] <=> $b->[1] } @matrix ) { print join(“\t”, @$row), “\n”;}
24
Funny operators
my @bases = qw(C A G T)
my $msg = <<EOFThis is the message I wanted to tell you aboutEOF;
25
Part of “amazing power” of Perl
Allow matching of patterns
Syntax can be tricky
Worth the effort to learn!
Regular Expressions
26
if( $fruit eq ‘apple’ || $fruit eq ‘Apple’ || $fruit eq ‘pear’) { print “got a fruit $fruit\n”;}if( $fruit =~ /[Aa]pple|pear/ ){ print “matched fruit $fruit\n”;}
A simple regexp
27
use the =~ operator to match
if( $var =~ /pattern/ ) {} - scalar context
my ($a,$b) = ( $var =~ /(\S+)\s+(\S+)/ );
if( $var !~ m// ) { } - true if pattern doesn’t
m/REGEXPHERE/ - match
s/REGEXP/REPLACE/ - substitute
tr/VALUES/NEWVALUES/ - translate
Regular Expression syntax
28
Search a string for a pattern match
If no string is specified, will match $_
Pattern can contain variables which will be interpolated (and pattern recompiled)while( <DATA> ) { if( /A$num/ ) { $num++ }}while( <DATA> ) { if( /A$num/o ) { $num++ }}
m// operator (match)
29
m// -if specify m, can replace / with anything e.g. m##, m[], m!!
/i - case insensitive
/g - global match (more than one)
/x - extended regexps (allows comments and whitespace)
/o - compile regexp once
Pattern extras
30
\s - whitespace (tab,space,newline, etc)
\S - NOT whitespace
\d - numerics ([0-9])
\D - NOT numerics
\t, \n - tab, newline
. - anything
Shortcuts
31
+ - 1 -> many (match 1,2,3,4,... instances )/a+/ will match ‘a’, ‘aa’, ‘aaaaa’
* - 0 -> many
? - 0 or 1
{N}, {M,N} - match exactly N, or M to N
[], [^] - anything in the brackets, anything but what is in the brackets
Regexp Operators
32
Things in parentheses can be retrieved via variables $1, $2, $3, etc for 1st,2nd,3rd matches
if( /(\S+)\s+([\d\.\+\-]+)/) { print “$1 --> $2\n”;}
my ($name,$score) = ($var =~ /(\S+)\s+([\d\.\+\-]+)/);
Saving what you matched
33
Simple Regexp
my $line = “aardvark”;if( $line =~ /aa/ ) { print “has double a\n” }if( $line =~ /(a{2})/ ) { print “has double a\n” }if( $line =~ /(a+)/ ) { print “has 1 or more a\n” }
34
Matching gene names
# File contains lots of gene names# YFL001C YAR102W - yeast ORF names# let-1, unc-7 - worm names http://biosci.umn.edu/CGC/Nomenclature/nomenguid.htm# ENSG000000101 - human Ensembl gene nameswhile(<IN>) { if( /^(Y([A-P])(R|L)(\d{3})(W|C)(\-\w)?)/ ) { printf “yeast gene %s, chrom %d,%s arm, %d %s strand\n”, $1, (ord($2)-ord(‘A’))+1, $3, $4; } elsif( /^(ENSG\d+)/ ) { print “human gene $1\n” } elsif( /^(\w{3,4}\-\d+)/ ) { print “worm gene $1\n”; }}
35
A parser for output from a gene prediction program
Putting it all together
GlimmerM (Version 3.0)Sequence name: BAC1Contig11Sequence length: 31797 bp
Predicted genes/exons
Gene Exon Strand Exon Exon Range Exon # # Type Length
1 1 + Initial 13907 13985 79 1 2 + Internal 14117 14594 478 1 3 + Internal 14635 14665 31 1 4 + Internal 14746 15463 718 1 5 + Terminal 15497 15606 110
2 1 + Initial 20662 21143 482 2 2 + Internal 21190 21618 429 2 3 + Terminal 21624 21990 367
3 1 - Single 25351 25485 135
4 1 + Initial 27744 27804 61 4 2 + Internal 27858 27952 95 4 3 + Internal 28091 28576 486 4 4 + Internal 28636 28647 12 4 5 + Internal 28746 28792 47 4 6 + Terminal 28852 28954 103
5 3 - Terminal 29953 30037 85 5 2 - Internal 30152 30235 84 5 1 - Initial 30302 30318 17
37
while(<>) { if(/^(Glimmer\S*)\s+\((.+)\)/ { $method = $1; $version = $2; } elsif( /^(Predicted genes)|(Gene)|(\s+\#)/ || /^\s+$/ ) { next } elsif( # glimmer 3.0 output /^\s+(\d+)\s+ # gene num (\d+)\s+ # exon num ([\+\-])\s+ # strand (\S+)\s+ # exon type (\d+)\s+(\d+) # exon start, end \s+(\d+) # exon length /ox ) {my ($genenum,$exonnum,$strand,$type,$start,$end, $len) = ( $1,$2,$3,$4,$5,$6,$7); }}
Putting it together
38
Same as m// but will allow you to substitute whatever is matched in first section with value in the second section
$sport =~ s/soccer/football/
$addto =~ s/(Gene)/$1-$genenum/;
s/// operator (substitute)
39
Match and replace what is in the first section, in order, with what is in the second.
lowercase - tr/[A-Z]/[a-z]/
shift cipher - tr/[A-Z]/[B-ZA]/
revcom - $dna =~ tr/[ACGT]/[TGCA]/; $dna = reverse($dna);
The tr/// operator (translate)
40
aMino - {A,C}, Keto - {G,T}
puRines - {A,G}, prYmidines - {C,T}
Strong - {G,C}, Weak - {A,T}
H (Not G)- {ACT}, B (Not A), V (Not T), D(Not C)
$str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
(aside) DNA ambiguity chars