Upload
dyami
View
46
Download
2
Embed Size (px)
DESCRIPTION
James Wasmuth University of Edinburgh [email protected]. Regular Expressions & Pattern Matching. Definitions. Pattern Match – searching for a specified pattern within string. For example: A sequence motif, Accession number of a sequence, Parse HTML, Validating user input. - PowerPoint PPT Presentation
Citation preview
Definitions
Pattern Match – searching for a specified pattern within string.
For example:
A sequence motif,
Accession number of a sequence,
Parse HTML,
Validating user input.
Regular Expression (regex) – how to make a pattern match.
Regular Expressions
A separate programming language,
Utilised in most popular languages - usually as separate library
Perl - fully incorporated (unique).
How Regex work
Regex code
Perl compiler
Input data (e.g. sequence file)
output
Overview:
how to create regular expressions
how to use them to match and extract data
biological context
regex engine
Simple Patterns
Place the regex between a pair of forward slashes ( / / ).try:#!/usr/bin/perl
while (<STDIN>) { if (/abc/) { print “>> found ‘abc’ in $_\n”; }}
Save then run the program. Type something on the terminal then press return. Ctrl+C to exit script.
If you type anything containing ‘abc’ the print statement is returned.
Binding Operator
Previous example matched against $_
Want to match against a scalar variable?
Binding Operator “=~” matches pattern on right against string on left.
Usually add the m operator – clarity of code.
$string =~ m/pattern/
Simple Patterns (2)
Also access files and pattern match using I/O.
try:
#!/usr/bin/perl
open IN, “<genomes_desc.txt”;
while ($line = <IN>) { if ($line=~m/elegans/) { #true if finds ‘elegans’
print $line;
}
}
Flexible matching
Within regex there are many characters with special meanings – metacharacters
star (*) matches any number of instances
/ab*c/ => ‘a’ followed by zero or more ‘b’ followed by ‘c’
plus (+) matches at least one instance/ab+c/ => ‘a’ followed by 1 or more ‘b’ followed by ‘c’
question mark (?) matches zero or one instance/ab?c/ => ‘a’ followed by 0 or 1 ‘b’ followed by ‘c’
More Flexibility
Match a character a specific number or range of instances
{x} will match x number of instances./ab{3}c/ => abbbc
{x,y} will match between x and y instances./a{2,4}bc/ => aabc or aaabc or aaaabc
{x,} will match x+ instances./abc{3,}/ => abccc or abccccccc or abcccccccc
More Flexibility
dot (.) is a wildcard character – matches any character except new line (\n)
/a.c/ => ‘a’ followed by any character followed by ‘c’
Combine metacharacters
/a.{4}c/ => ‘a’ followed 4 instances of any character followed by ‘c’
so will match addddc , afgthc , ab569c
Escaping Metacharacters
to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash.
/C\. elegans/ => C. elegans only
/C. elegans/ => will match Ca , Cb , C3 , C> , C. , etc...
The 'delimitor' of the regex, forward slash '/', and the 'escape' character, backslash '\', are also metacharacters. These need to be escaped if required in regex.
Important when trying to match URLs and email addresses./joe\.bloggs\@darwin\.co\.uk/
/www\.envgen\.nox\.ac\.uk\/biolinux\.html/
Finding Sequence Identifiers
The file nemaglobins contains EMBL database entries for globins of the phylum Nematoda. Write a script that counts the number of entries.try:
#!/usr/bin/perl$count;open IN, “<nemaglobins.embl” or die;while ($line = <IN>) {
if ($line=~m/AC .*/) { #that's three spaces$count++;
}}print “total=$count\n”;
Grouping Patterns
So far using metacharacters with one character.
Can group patterns – place within parenthesis “()”.
Powerful when coupled with quantifiers.
/MLSTSTG+/ => MLSTSTGGGGGGGGG…
/MLS(TSTG)+/ => MLSTSTGTSTGTSTG…TSTG
/ML(ST){2}G/ => MLSTSTG
Alternative Matching
Match this or this.
Two ways which depend on nature of pattern
1) use a verticle bar ‘|’
matches if either left side or right side matches,
/(human|mouse|rat)/ => any string with human or mouse or rat.
2) character class is a list of characters within '[]'. It will match any single character within the class.
/[wxyz1234\t]/ => any of the nine.
a range can be specified with '-' /[w-z1-4\t]/ => as above
to match a hyphen it must be first in the class/[-a-zA-Z]/ => any letter character or a hyphen
negating a character with '^' /[^z]/ => any character except z
/[^abc]/ => any character except a or b or c
Revisting EMBL file
Want to find the number of globins from Ascaris and ?????.
#!/usr/bin/perl
$count;
open IN, “<nemaglobins.embl” or die;
while ($line=<IN>) {
if ($line=~m/OS (Ascaris|Toxocara)/) {
$count++;
}
}
print “Found $count globins from Ascaris or Toxocara\n”;
Shortcuts
\d => any digit [0-9]\w => any “word” character [A-Za-z0-9_]\s => any white space [\t\n\r\f ]
\D => any character except a digit [^\d]\W => any character except a “word” character [^\w]\S => any character except a white space [^\s]
Can use any of these in conjunction with quantifiers,
/\s*/ => any amount of white space
Anchoring a Pattern
/pattern/ will match anywhere in the string
Anchors hold the pattern to a point in the string.
caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string.
/^elegans/ => elegans only at start of string. Not C. elegans.
/Canis$/ => Canis only at end of string. Not Canis lupus.
/^\s*$/ => a blank line.
‘$’ ignores the new line character ‘\n’
Memory Variables
Able to extract sections of the pattern and store in a variable.
Part of the pattern within parentheses ‘()’ is stored in special variable.
First instance is $1, second $2, the fourth $4…
Extract from fileOrganism: Homo sapiens
From Perl script:if ($line=~m/Organism:\s(\w+)\s(\w+)/) { $genus = $1; $species = $2;}
Revisiting EMBL File (again)
Use shortcuts and anchors to find what you want.
if ($line=~m/AC .*/) { #found lots
Try:if ($line=~m/^AC\s{3}([.\w]+)\s*/) {
$accession=$1; #info stored to use later
Substitutions
Match a pattern within in a string and replace with another string.
Uses the ‘s’ operator
s/abc/xyz/ => find abc and replace with xyz
Only finds first instance of match. Using ‘g’ modifer will find and replace all.
$line = ‘abcaabbcabca’;
$line =~ s/abc/xyz/g;
print $line; xyzaabbcxyza
More Substitutions
Remove all gap characters from a multiple sequence alignment:
$aln = ‘AADG--ASD--P-GSTST’;
$aln =~ s/-//g;
print $aln; # AADGASDPGSTST
Inserting information:
$line = ‘vector:’;
$line =~ s/(vector:)/$1 M13MP7/;
$name = ‘Daniel’;
$name =~ s/(Daniel)/Jack $1/;
Resources
Learning Perl (O' Reilly) Ch. 7-9
Regular Expression Pocket Reference (O' Reilly)
perldoc perlre
http://etext.lib.virginia.edu/helpsheets/regex.html
http://www.nematodes.org/~jamesw/Perl/regex
Master Regular Expressions (O'Reilly)