Upload
gilbert-cody-lambert
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Computer Programming for Biologists
Class 5
Nov 20st, 2014
Karsten Hokamp
http://bioinf.gen.tcd.ie/GE3M25/programming
Computer Programming for Biologists
Project
Program Exit
Random numbers
Regular Expressions
Overview
Computer Programming for Biologists
Task 1: Report length of a sequence in Fasta format
Understand the problem, consider input/output:
>Tmsb10ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCTGAAGAAAACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGAAAAGAGGAGTGAAATCTCCTAA
Sequence length is 135 bp.
Project
Computer Programming for Biologists
Problems:
1.File contains header line
2.Sequence contains line-breaks
>Tmsb10ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCTGAAGAAAACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGAAAAGAGGAGTGAAATCTCCTAA
Project
Computer Programming for Biologists
Steps:
1.Read in file content (line-by-line)
2. Remove line-breaks
3. Skip header line
4. Concatenate sequence into one long string
5. Calculate and report length
Project
Computer Programming for Biologists
Steps:
# 1. Read in file content (line-by-line)
while ($input = <>) {
}
Project
Computer Programming for Biologists
Steps:
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
# 3. Skip header line
# 4. Concatenate sequence into one long
string
}
Project
Computer Programming for Biologists
Steps:
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
chomp $input;
# 3. Skip header line
# 4. Concatenate sequence into one long
string
}
Project
Computer Programming for Biologists
Steps:
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
chomp $input;
# 3. Skip header line
# 4. Concatenate sequence into one long
string
$sequence .= $input;
}
Project
Computer Programming for Biologists
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
chomp $input;
# 3. Skip header line
# 4. Concatenate sequence into one long string
$sequence .= $input;
}
# 5. Calculate and report length
$length = length($sequence);
print "Sequence length: $length bp\n";
Project
Computer Programming for Biologists
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
chomp $input;
# 3. Skip header line (check for '>' in first position)
# extract first character:
$first = substr $input, 0, 1;
# is it a '>'?
if ($first eq '>') {
# skip this line
next;
}
$sequence .= $input;
Project
Computer Programming for Biologists
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
chomp $input;
# 3. Skip header line (check for '>' in first position)
# extract first character:
$first = substr $input, 0, 1;
# is it a '>'?
if ($first eq '>') {
# skip this line
next;
}
$sequence .= $input;
Project
Computer Programming for Biologists
# 1. Read in file content (line-by-line)
while ($input = <>) {
# 2. Remove line-breaks
chomp $input;
# 3. Skip header line (check for '>' in first position)
# extract first character:
$first = substr $input, 0, 1;
# is it a '>'?
unless ($first eq '>') {
# this must be part of the sequence
$sequence .= $input;
}
}
Project
alternativeversionalternativeversion
Computer Programming for Biologists
# 1. Read in file content (line-by-line)while ($input = <>) {
# 2. Remove line-breakschomp $input;# 3. Skip header line (check for '>' in first position)# extract first character:$first = substr $input, 0, 1;# is it a '>'?if ($first eq '>') {
# skip this linenext;
}# 4. Concatenate sequence into one long string$sequence .= $input;
}# 5. Calculate and report length$length = length($sequence);print "Sequence length: $length bp\n";
Project
Computer Programming for Biologists
# Suggestions for the start of the script:
# make sure a file has been providedunless (@ARGV) {
die "Please specify file name on command line!";}
# initialise sequence variable$sequence = '';
# 1. Read in file content (line-by-line)while ($input = <>) {
…
Project
Computer Programming for Biologists
1. automatic exit at end of script
2. explicit exit with value:
exit 0; # default
or
exit 1; # normally indicates an error
3. exit on failure:
die "error message";
("\n" supresses line number)
Exiting a program
Computer Programming for Biologists
Example:
Exiting a program
Computer Programming for Biologists
Practical:
Project
http://bioinf.gen.tcd.ie/GE3M25/programming/class5
Computer Programming for Biologists
• constructs that describe patterns
• powerful methods for text processing
• search for patterns in a string
• search and extract patterns
• search and replace patterns
• pattern at which to split a string
Regular Expressions
Computer Programming for Biologists
Examples:
• Look for a motif in a dna/protein sequence
• Find low complexity repeats and mask with x's
• Find start of sequence string in GenBank record
• Extract e-mail addresses from a web-page
• Replace strings, e.g.: '@tcd.ie' with '@gmail.com'
Regular Expressions
Computer Programming for Biologists
Find a pattern in a string (stored in a variable):
$sequence = 'ataggctagctaga';
if ( $sequence =~ /ctag/ ) { print 'Found!';}
Regular Expressions
string in which to
search
Computer Programming for Biologists
Find a pattern in a string (stored in a variable):
$sequence = 'ataggctagctaga';
if ( $sequence =~ /ctag/ ) { print 'Found!';}
Regular Expressions
binding operator
Computer Programming for Biologists
Find a pattern in a string (stored in a variable):
$sequence = 'ataggctagctaga';
if ( $sequence =~ /ctag/ ) { print 'Found!';}
Regular Expressions
pattern
Computer Programming for Biologists
Find a pattern in a string (stored in a variable):
$sequence = 'ataggctagctaga';
if ( $sequence =~ /ctag/ ) { print 'Found!';}
Regular Expressions
delimiters
Computer Programming for Biologists
Find a pattern in a string (stored in a variable):
$sequence = 'ataggctagctaga';
if ( $sequence =~ /ctag/ ) { print 'Found!';}
Regular Expressions
binding operator pattern
delimitersstring in which to
search
Computer Programming for Biologists
Find a pattern in a string (stored in a variable):
$_ = 'ataggctagctaga';
if ( /ctag/ ) { print 'Found!';}
Regular Expressions
pattern
delimiters
without binding // to a variable, regular expression works on $_
Computer Programming for Biologists
Search modifier:
i = make search case-insensitive
$sequence = 'ataggctagctaga';
if ( $sequence =~ /TAG/i ) {
print 'Found!';
}
Regular Expressions
Computer Programming for Biologists
Metacharacters:
^ = match at the beginning of a line
$ = match at the end of the line
. = match any character (except newline)
\ = escape the next metacharacter
$sequence = ">sequence1\natgacctggaataggat";
if ( $sequence =~ /^>/ ) { # line starts with '>'
print 'Found Fasta header!';
}
Regular Expressions
/\.$/ matches dot at end of line
Computer Programming for Biologists
Exercise:
Modify your course project (sequanto.pl) to use a
regular expression for detection of a header line
instead of 'substr' and 'eq' to check first character.
Project
Computer Programming for Biologists
Matching repetition:
a? = match 'a' 1 or 0 times
a* = match 'a' 0 or more times, i.e., any number of times
a+ = match 'a' 1 or more times, i.e., at least once
a{n,m} = match at least "n" times, but not more than "m" times.
a{n,} = match at least "n" or more times
a{n} = match exactly "n" times
$sequence =~ /a{5,}/; # finds repeats of 5 or more 'a's
Regular Expressions
Computer Programming for Biologists
Search for classes of characters
\d = match a digit character
\w = match a word character (alphanumeric and '_')
\D = match a non-digit character
\W = match a non-word character
\s = whitespace
\S = match a non-whitespace character
$date = '30 Jan 2009';
if ( date =~ /\d{1,2} \w+ \d{2,4}/ ) {
print 'Correct date format!';
}
Regular Expressions
also matches '1 February 09'
Computer Programming for Biologists
Match special characters
\t = matches a tabulator (tab)
\b = matches a word boundary
\r = matches return
\n = matches UNIX newline
\cM = matches Control-M (line-ending in Windows)
while (my $line = <>) {
if ($line =~ /\cM/) {
warn "Windows line-ending detected!";
}
}
Regular Expressions
Computer Programming for Biologists
Search for range of characters
[ ] = match at least one of the characters specified within these brackets
- = specifies a range, e.g. [a-z], or [0-9]
^ = match any character not in the list, e.g. [^A-Z]
$sequence = 'ataggctapgctaga';
if ( $sequence =~ /[^acgt]/ ) {
print "Sequence contains non-DNA character: $&";
}
Regular Expressions
$& is a special variable containing the last pattern match$` and $' contain strings before and after match
Computer Programming for Biologists
Search and replace (substitute):
s/pattern1/pattern2/
$sequence = 'ataggctagctaga';
$rna = $sequence;
$rna =~ s/t/u/;
-> 'auaggctagctaga'
Regular Expressions
Only the first match will be replaced!
Computer Programming for Biologists
Modifiers for substitution:
i = case in-sensitive
g = global
s = match includes newline
$sequence = 'ataggctagctaga';
$rna = $sequence;
$rna =~ s/t/u/g;
-> 'auaggcuagcuaga'
Regular Expressions
replaces all 't' in the line with 'u'
Computer Programming for Biologists
Example: Clean up a sequence string:
$sequence = "
1 ataggctagctagat
16 ttagagctagta
";
$sequence =~ s/[^actg]//g;
-> 'ataggctagctagatttagagctagta'
Regular Expressions
Deletes everything that is not a, c, t, or g.
Computer Programming for Biologists
Extract matched patterns:
- put patterns in parentheses
- \1, \2, \3, … refers back to ()'s within pattern match
- $1, $2, $3, … refers back to ()'s after pattern match
$sequence = ">test\natgtagagctagta";
if ($sequence =~ /^>(.*)/) { $id = $1; }
or
$email =~ s/(.*)\@(.*)\.(.*)/\1 at \2 dot \3/;
print "Changed address to $1 at $2 dot $3\n";
Regular Expressions
changes '[email protected]' to 'kahokamp at tcd dot ie''
Computer Programming for Biologists
Practical:
Project
http://bioinf.gen.tcd.ie/GE3M25/programming/class5
Computer Programming for Biologists
Change a character into an array:
@array = split //, $string;
Split input line at tabs:
@columns = split /\t/, $input_line;
Default splits $_ on whitespace:
while (<>) {
@colums = split;
…
}
Regular Expressions in split