Regular Expressions & Pattern Matching

Regular Expressions &

Pattern Matching

James Wasmuth

University of Edinburgh

[email protected]

Definitions

Pattern Match – searching for a specified pattern within string.

For example:

A sequence motif,

Accession number of a sequence,

Parse HTML,

Validating user input.

Regular Expression (regex) – how to make a pattern match.

Regular Expressions

A separate programming language,

Utilised in most popular languages - usually as separate library

Perl - fully incorporated (unique).

How Regex work

Regex code

Perl compiler

Input data (e.g. sequence file)

output

Overview:

how to create regular expressions

how to use them to match and extract data

biological context

regex engine

Simple Patterns

Place the regex between a pair of forward slashes ( / / ).try:#!/usr/bin/perl

while (<STDIN>) { if (/abc/) { print “>> found ‘abc’ in $_\n”; }}

Save then run the program. Type something on the terminal then press return. Ctrl+C to exit script.

If you type anything containing ‘abc’ the print statement is returned.

Binding Operator

Previous example matched against $_

Want to match against a scalar variable?

Binding Operator “=~” matches pattern on right against string on left.

Usually add the m operator – clarity of code.

$string =~ m/pattern/

Simple Patterns (2)

Also access files and pattern match using I/O.

try:

#!/usr/bin/perl

open IN, “<genomes_desc.txt”;

while ($line = <IN>) { if ($line=~m/elegans/) { #true if finds ‘elegans’

print $line;

}

}

Flexible matching

Within regex there are many characters with special meanings – metacharacters

star (*) matches any number of instances

/ab*c/ => ‘a’ followed by zero or more ‘b’ followed by ‘c’

plus (+) matches at least one instance/ab+c/ => ‘a’ followed by 1 or more ‘b’ followed by ‘c’

question mark (?) matches zero or one instance/ab?c/ => ‘a’ followed by 0 or 1 ‘b’ followed by ‘c’

More Flexibility

Match a character a specific number or range of instances

{x} will match x number of instances./ab{3}c/ => abbbc

{x,y} will match between x and y instances./a{2,4}bc/ => aabc or aaabc or aaaabc

{x,} will match x+ instances./abc{3,}/ => abccc or abccccccc or abcccccccc

More Flexibility

dot (.) is a wildcard character – matches any character except new line (\n)

/a.c/ => ‘a’ followed by any character followed by ‘c’

Combine metacharacters

/a.{4}c/ => ‘a’ followed 4 instances of any character followed by ‘c’

so will match addddc , afgthc , ab569c

Escaping Metacharacters

to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash.

/C\. elegans/ => C. elegans only

/C. elegans/ => will match Ca , Cb , C3 , C> , C. , etc...

The 'delimitor' of the regex, forward slash '/', and the 'escape' character, backslash '\', are also metacharacters. These need to be escaped if required in regex.

Important when trying to match URLs and email addresses./joe\.bloggs\@darwin\.co\.uk/

/www\.envgen\.nox\.ac\.uk\/biolinux\.html/

Finding Sequence Identifiers

The file nemaglobins contains EMBL database entries for globins of the phylum Nematoda. Write a script that counts the number of entries.try:

#!/usr/bin/perl$count;open IN, “<nemaglobins.embl” or die;while ($line = <IN>) {

if ($line=~m/AC .*/) { #that's three spaces$count++;

}}print “total=$count\n”;

Grouping Patterns

So far using metacharacters with one character.

Can group patterns – place within parenthesis “()”.

Powerful when coupled with quantifiers.

/MLSTSTG+/ => MLSTSTGGGGGGGGG…

/MLS(TSTG)+/ => MLSTSTGTSTGTSTG…TSTG

/ML(ST){2}G/ => MLSTSTG

Alternative Matching

Match this or this.

Two ways which depend on nature of pattern

1) use a verticle bar ‘|’

matches if either left side or right side matches,

/(human|mouse|rat)/ => any string with human or mouse or rat.

2) character class is a list of characters within '[]'. It will match any single character within the class.

/[wxyz1234\t]/ => any of the nine.

a range can be specified with '-' /[w-z1-4\t]/ => as above

to match a hyphen it must be first in the class/[-a-zA-Z]/ => any letter character or a hyphen

negating a character with '^' /[^z]/ => any character except z

/[âbc]/ => any character except a or b or c

Revisting EMBL file

Want to find the number of globins from Ascaris and ?????.

#!/usr/bin/perl

$count;

open IN, “<nemaglobins.embl” or die;

while ($line=<IN>) {

if ($line=~m/OS (Ascaris|Toxocara)/) {

$count++;

}

}

print “Found $count globins from Ascaris or Toxocara\n”;

Shortcuts

\d => any digit [0-9]\w => any “word” character [A-Za-z0-9_]\s => any white space [\t\n\r\f ]

\D => any character except a digit [^\d]\W => any character except a “word” character [^\w]\S => any character except a white space [^\s]

Can use any of these in conjunction with quantifiers,

/\s*/ => any amount of white space

Anchoring a Pattern

/pattern/ will match anywhere in the string

Anchors hold the pattern to a point in the string.

caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string.

/êlegans/ => elegans only at start of string. Not C. elegans.

/Canis$/ => Canis only at end of string. Not Canis lupus.

/^\s*$/ => a blank line.

‘$’ ignores the new line character ‘\n’

Memory Variables

Able to extract sections of the pattern and store in a variable.

Part of the pattern within parentheses ‘()’ is stored in special variable.

First instance is $1, second $2, the fourth $4…

Extract from fileOrganism: Homo sapiens

From Perl script:if ($line=~m/Organism:\s(\w+)\s(\w+)/) { $genus = $1; $species = $2;}

Revisiting EMBL File (again)

Use shortcuts and anchors to find what you want.

if ($line=~m/AC .*/) { #found lots

Try:if ($line=~m/ÂC\s{3}([.\w]+)\s*/) {

$accession=$1; #info stored to use later

Substitutions

Match a pattern within in a string and replace with another string.

Uses the ‘s’ operator

s/abc/xyz/ => find abc and replace with xyz

Only finds first instance of match. Using ‘g’ modifer will find and replace all.

$line = ‘abcaabbcabca’;

$line =~ s/abc/xyz/g;

print $line; xyzaabbcxyza

More Substitutions

Remove all gap characters from a multiple sequence alignment:

$aln = ‘AADG--ASD--P-GSTST’;

$aln =~ s/-//g;

print $aln; # AADGASDPGSTST

Inserting information:

$line = ‘vector:’;

$line =~ s/(vector:)/$1 M13MP7/;

$name = ‘Daniel’;

$name =~ s/(Daniel)/Jack $1/;

Resources

Learning Perl (O' Reilly) Ch. 7-9

Regular Expression Pocket Reference (O' Reilly)

perldoc perlre

http://etext.lib.virginia.edu/helpsheets/regex.html

http://www.nematodes.org/~jamesw/Perl/regex

Master Regular Expressions (O'Reilly)

Documents

Regular Expressions & Pattern Matching