23
Regular Expressions & Pattern Matching James Wasmuth University of Edinburgh [email protected]

Regular Expressions & Pattern Matching

  • Upload
    dyami

  • View
    46

  • Download
    2

Embed Size (px)

DESCRIPTION

James Wasmuth University of Edinburgh [email protected]. Regular Expressions & Pattern Matching. Definitions. Pattern Match – searching for a specified pattern within string. For example: A sequence motif, Accession number of a sequence, Parse HTML, Validating user input. - PowerPoint PPT Presentation

Citation preview

Page 1: Regular Expressions  & Pattern Matching

Regular Expressions &

Pattern Matching

James Wasmuth

University of Edinburgh

[email protected]

Page 2: Regular Expressions  & Pattern Matching

Definitions

Pattern Match – searching for a specified pattern within string.

For example:

A sequence motif,

Accession number of a sequence,

Parse HTML,

Validating user input.

Regular Expression (regex) – how to make a pattern match.

Page 3: Regular Expressions  & Pattern Matching

Regular Expressions

A separate programming language,

Utilised in most popular languages - usually as separate library

Perl - fully incorporated (unique).

Page 4: Regular Expressions  & Pattern Matching

How Regex work

Regex code

Perl compiler

Input data (e.g. sequence file)

output

Overview:

how to create regular expressions

how to use them to match and extract data

biological context

regex engine

Page 5: Regular Expressions  & Pattern Matching

Simple Patterns

Place the regex between a pair of forward slashes ( / / ).try:#!/usr/bin/perl

while (<STDIN>) { if (/abc/) { print “>> found ‘abc’ in $_\n”; }}

Save then run the program. Type something on the terminal then press return. Ctrl+C to exit script.

If you type anything containing ‘abc’ the print statement is returned.

Page 6: Regular Expressions  & Pattern Matching

Binding Operator

Previous example matched against $_

Want to match against a scalar variable?

Binding Operator “=~” matches pattern on right against string on left.

Usually add the m operator – clarity of code.

$string =~ m/pattern/

Page 7: Regular Expressions  & Pattern Matching

Simple Patterns (2)

Also access files and pattern match using I/O.

try:

#!/usr/bin/perl

open IN, “<genomes_desc.txt”;

while ($line = <IN>) { if ($line=~m/elegans/) { #true if finds ‘elegans’

print $line;

}

}

Page 8: Regular Expressions  & Pattern Matching

Flexible matching

Within regex there are many characters with special meanings – metacharacters

star (*) matches any number of instances

/ab*c/ => ‘a’ followed by zero or more ‘b’ followed by ‘c’

plus (+) matches at least one instance/ab+c/ => ‘a’ followed by 1 or more ‘b’ followed by ‘c’

question mark (?) matches zero or one instance/ab?c/ => ‘a’ followed by 0 or 1 ‘b’ followed by ‘c’

Page 9: Regular Expressions  & Pattern Matching

More Flexibility

Match a character a specific number or range of instances

{x} will match x number of instances./ab{3}c/ => abbbc

{x,y} will match between x and y instances./a{2,4}bc/ => aabc or aaabc or aaaabc

{x,} will match x+ instances./abc{3,}/ => abccc or abccccccc or abcccccccc

Page 10: Regular Expressions  & Pattern Matching

More Flexibility

dot (.) is a wildcard character – matches any character except new line (\n)

/a.c/ => ‘a’ followed by any character followed by ‘c’

Combine metacharacters

/a.{4}c/ => ‘a’ followed 4 instances of any character followed by ‘c’

so will match addddc , afgthc , ab569c

Page 11: Regular Expressions  & Pattern Matching

Escaping Metacharacters

to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash.

/C\. elegans/ => C. elegans only

/C. elegans/ => will match Ca , Cb , C3 , C> , C. , etc...

The 'delimitor' of the regex, forward slash '/', and the 'escape' character, backslash '\', are also metacharacters. These need to be escaped if required in regex.

Important when trying to match URLs and email addresses./joe\.bloggs\@darwin\.co\.uk/

/www\.envgen\.nox\.ac\.uk\/biolinux\.html/

Page 12: Regular Expressions  & Pattern Matching

Finding Sequence Identifiers

The file nemaglobins contains EMBL database entries for globins of the phylum Nematoda. Write a script that counts the number of entries.try:

#!/usr/bin/perl$count;open IN, “<nemaglobins.embl” or die;while ($line = <IN>) {

if ($line=~m/AC .*/) { #that's three spaces$count++;

}}print “total=$count\n”;

Page 13: Regular Expressions  & Pattern Matching

Grouping Patterns

So far using metacharacters with one character.

Can group patterns – place within parenthesis “()”.

Powerful when coupled with quantifiers.

/MLSTSTG+/ => MLSTSTGGGGGGGGG…

/MLS(TSTG)+/ => MLSTSTGTSTGTSTG…TSTG

/ML(ST){2}G/ => MLSTSTG

Page 14: Regular Expressions  & Pattern Matching

Alternative Matching

Match this or this.

Two ways which depend on nature of pattern

1) use a verticle bar ‘|’

matches if either left side or right side matches,

/(human|mouse|rat)/ => any string with human or mouse or rat.

Page 15: Regular Expressions  & Pattern Matching

2) character class is a list of characters within '[]'. It will match any single character within the class.

/[wxyz1234\t]/ => any of the nine.

a range can be specified with '-' /[w-z1-4\t]/ => as above

to match a hyphen it must be first in the class/[-a-zA-Z]/ => any letter character or a hyphen

negating a character with '^' /[^z]/ => any character except z

/[^abc]/ => any character except a or b or c

Page 16: Regular Expressions  & Pattern Matching

Revisting EMBL file

Want to find the number of globins from Ascaris and ?????.

#!/usr/bin/perl

$count;

open IN, “<nemaglobins.embl” or die;

while ($line=<IN>) {

if ($line=~m/OS (Ascaris|Toxocara)/) {

$count++;

}

}

print “Found $count globins from Ascaris or Toxocara\n”;

Page 17: Regular Expressions  & Pattern Matching

Shortcuts

\d => any digit [0-9]\w => any “word” character [A-Za-z0-9_]\s => any white space [\t\n\r\f ]

\D => any character except a digit [^\d]\W => any character except a “word” character [^\w]\S => any character except a white space [^\s]

Can use any of these in conjunction with quantifiers,

/\s*/ => any amount of white space

Page 18: Regular Expressions  & Pattern Matching

Anchoring a Pattern

/pattern/ will match anywhere in the string

Anchors hold the pattern to a point in the string.

caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string.

/^elegans/ => elegans only at start of string. Not C. elegans.

/Canis$/ => Canis only at end of string. Not Canis lupus.

/^\s*$/ => a blank line.

‘$’ ignores the new line character ‘\n’

Page 19: Regular Expressions  & Pattern Matching

Memory Variables

Able to extract sections of the pattern and store in a variable.

Part of the pattern within parentheses ‘()’ is stored in special variable.

First instance is $1, second $2, the fourth $4…

Extract from fileOrganism: Homo sapiens

From Perl script:if ($line=~m/Organism:\s(\w+)\s(\w+)/) { $genus = $1; $species = $2;}

Page 20: Regular Expressions  & Pattern Matching

Revisiting EMBL File (again)

Use shortcuts and anchors to find what you want.

if ($line=~m/AC .*/) { #found lots

Try:if ($line=~m/^AC\s{3}([.\w]+)\s*/) {

$accession=$1; #info stored to use later

Page 21: Regular Expressions  & Pattern Matching

Substitutions

Match a pattern within in a string and replace with another string.

Uses the ‘s’ operator

s/abc/xyz/ => find abc and replace with xyz

Only finds first instance of match. Using ‘g’ modifer will find and replace all.

$line = ‘abcaabbcabca’;

$line =~ s/abc/xyz/g;

print $line; xyzaabbcxyza

Page 22: Regular Expressions  & Pattern Matching

More Substitutions

Remove all gap characters from a multiple sequence alignment:

$aln = ‘AADG--ASD--P-GSTST’;

$aln =~ s/-//g;

print $aln; # AADGASDPGSTST

Inserting information:

$line = ‘vector:’;

$line =~ s/(vector:)/$1 M13MP7/;

$name = ‘Daniel’;

$name =~ s/(Daniel)/Jack $1/;

Page 23: Regular Expressions  & Pattern Matching

Resources

Learning Perl (O' Reilly) Ch. 7-9

Regular Expression Pocket Reference (O' Reilly)

perldoc perlre

http://etext.lib.virginia.edu/helpsheets/regex.html

http://www.nematodes.org/~jamesw/Perl/regex

Master Regular Expressions (O'Reilly)