Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

Topic 6: Regular expressionsTopic 6: Regular expressions

CSE2395/CSE3395Perl Programming

CSE2395/CSE3395Perl Programming

Llama3 chapters 7-9, pages 98-127

Camel3 pages 139-195

perlre manpage

2Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

In this topicIn this topic

Regular expressions► performing pattern matching

Regular expressions► performing pattern matching


Matching stringsMatching strings

Can find one string within another using index function► returns position of start of substring, or -1 on failure► $needle = "tac";► print index "haystack", $needle; # 4

Only works for constant substrings► not usually sufficient for common pattern-matching

uses

Can find one string within another using index function► returns position of start of substring, or -1 on failure► $needle = "tac";► print index "haystack", $needle; # 4

Only works for constant substrings► not usually sufficient for common pattern-matching

uses

Llama3 pages 208-209; Camel3 page 731; perlfunc manpage


Regular expressionsRegular expressions

Regular expressions are a mini-language used to describe patterns of characters

► e.g., look for a “t”, followed by any vowel, followed by any letter Some strings satisfy a given regular expression

► haystack► taciturn (twice)► settee► top

Some strings can’t satisfy it► mouse► cattle► bite me (has space where consonant needed to be)► empty string

Regular expressions are a mini-language used to describe patterns of characters

► e.g., look for a “t”, followed by any vowel, followed by any letter Some strings satisfy a given regular expression

► haystack► taciturn (twice)► settee► top

Some strings can’t satisfy it► mouse► cattle► bite me (has space where consonant needed to be)► empty string

Llama3 pages 98-99


Regular expressionsRegular expressions

Several Unix programs have support for regular expressions► usually programs which manipulate text► grep (print lines matching a pattern)► sed and awk (stream editors)► vi and emacs (text editors)► lex (tokenizer)► procmail (mail filter)► perl (some programming language)

Share a (reasonably) common format► some minor differences in capabilities and dialects► previous slide’s example written t[aeiou][a-z]

Several Unix programs have support for regular expressions► usually programs which manipulate text► grep (print lines matching a pattern)► sed and awk (stream editors)► vi and emacs (text editors)► lex (tokenizer)► procmail (mail filter)► perl (some programming language)

Share a (reasonably) common format► some minor differences in capabilities and dialects► previous slide’s example written t[aeiou][a-z]


Unix grep programUnix grep program

grep prints out any line in its input that matches a regular expression► only distantly related to Perl’s grep function

grep prints out any line in its input that matches a regular expression► only distantly related to Perl’s grep function

% grep 't[aeiou][a-z]' /usr/dict/wordsabatedabettedabolition... lots more words here ...yesterdayyoungsterytterbium

Llama3 page 99


Regular expressions in PerlRegular expressions in Perl

Perl tries to match regular expression patterns to the string in the variable $_► if successful anywhere inside string, result is true► otherwise (unsuccessful everywhere), result is false

Pattern is written between two forward slashes► /t[aeiou][a-z]/► /.../ called match operator► boolean value returned

– usually used inside if or while condition– if (/t[aeiou][a-z]/) { ... }

Perl tries to match regular expression patterns to the string in the variable $_► if successful anywhere inside string, result is true► otherwise (unsuccessful everywhere), result is false

Pattern is written between two forward slashes► /t[aeiou][a-z]/► /.../ called match operator► boolean value returned

– usually used inside if or while condition– if (/t[aeiou][a-z]/) { ... }

Llama3 page 100; Camel3 pages 140, 145-150, 218; perldoc manpage


TimeoutTimeout

# Find occurrences of a pattern in the named files.

# Read lines of input into $_, one at a time.while (<>){ # Check for the pattern in $_. if (/t[aeiou][a-z]/) { # Success. Print out this line. print; }}

# Find occurrences of a pattern in the named files.

# Read lines of input into $_, one at a time.while (<>){ # Check for the pattern in $_. if (/t[aeiou][a-z]/) { # Success. Print out this line. print; }}


Patterns: literal charactersPatterns: literal characters

Alphanumeric characters match themselves► /abc/ matches substring "abc"► /123/ matches substring "123"

Most other characters require a backslash in order to match themselves► /\[a\]/ matches substring "[a]"► /\/usr\/bin/ matches substring "/usr/bin"► if in doubt, backslash all non-alphanumerics

Backslashes before alphanumerics are special► /\n/ matches newline character► /\b/ matches word boundary► /\d/ is shorthand for /[0-9]/► /\1/ is a backreference

Alphanumeric characters match themselves► /abc/ matches substring "abc"► /123/ matches substring "123"

Most other characters require a backslash in order to match themselves► /\[a\]/ matches substring "[a]"► /\/usr\/bin/ matches substring "/usr/bin"► if in doubt, backslash all non-alphanumerics

Backslashes before alphanumerics are special► /\n/ matches newline character► /\b/ matches word boundary► /\d/ is shorthand for /[0-9]/► /\1/ is a backreference

Llama3 page 100; Camel3 page 158; perlre manpage


Patterns: character classesPatterns: character classes

[letters] matches exactly one of the enclosed letters► /[abc]/ matches substrings "a" or "b" or "c"► can specify ranges with hyphen► /[0-9]/ matches any single digit

inverted classes: [^letters] matches any one character except any of those enclosed► /[^abc]/ matches substring "x" but not "a"► /[^0-9]/ matches any one non-digit

Some common character classes have shorthand forms► /\d/ (digit) same as /[0-9]/► /\s/ (space) same as /[ \t\n\r\f]/► /\w/ (“word letter”) same as /[a-zA-Z0-9_]/► inverted shortcuts /\D/ (non-digit), /\S/ (non-space), /\W/

[letters] matches exactly one of the enclosed letters► /[abc]/ matches substrings "a" or "b" or "c"► can specify ranges with hyphen► /[0-9]/ matches any single digit

inverted classes: [^letters] matches any one character except any of those enclosed► /[^abc]/ matches substring "x" but not "a"► /[^0-9]/ matches any one non-digit

Some common character classes have shorthand forms► /\d/ (digit) same as /[0-9]/► /\s/ (space) same as /[ \t\n\r\f]/► /\w/ (“word letter”) same as /[a-zA-Z0-9_]/► inverted shortcuts /\D/ (non-digit), /\S/ (non-space), /\W/

Llama3 page 105-107; Camel3 pages 159, 165-167; perlre manpage


Patterns: any characterPatterns: any character

. (full stop) shorthand for [^\n] (any character but newline)

► effectively “any character” because $_ seldom contains newline– except perhaps unchomped one at very end

► /d.g/ matches substrings "dog", "dig", "d g", "d!g"► /...../ matches substring containing any five characters

– true when $_ contains at least five characters► /.\../ matches any character, a dot, then any character

– true when $_ contains a dot that isn’t the first or last character of the line

. (full stop) shorthand for [^\n] (any character but newline)

► effectively “any character” because $_ seldom contains newline– except perhaps unchomped one at very end

► /d.g/ matches substrings "dog", "dig", "d g", "d!g"► /...../ matches substring containing any five characters

– true when $_ contains at least five characters► /.\../ matches any character, a dot, then any character

– true when $_ contains a dot that isn’t the first or last character of the line

Llama3 page 100; Camel3 page 159; perlre manpage


TimeoutTimeout

Write regular expressions to match strings containing:► the word “dog” in any form of capitalization► a car’s number plate► a phone number► a four-letter word beginning with “s”► “s” at the beginning of the line► no text at all (an empty line)► a double letter



MultipliersMultipliers

Multipliers allow the previous part of the pattern to repeat► by default, applies to previous letter or character class

– can group using parentheses► write multiplier after part of pattern to repeat► * (asterisk) means “0 or more times”

– /at*e/ matches strings "Caesar", "fate", "matter"– /.*/ matches zero or more of any character

– by itself, matches any string► + (plus) means “one or more times”

– /at+e/ matches "fate", "matter" but not "Caesar"► ? (question mark) means “0 or 1 times”

– /colou?r/ matches substrings "color" and "colour"

Multipliers allow the previous part of the pattern to repeat► by default, applies to previous letter or character class

– can group using parentheses► write multiplier after part of pattern to repeat► * (asterisk) means “0 or more times”

– /at*e/ matches strings "Caesar", "fate", "matter"– /.*/ matches zero or more of any character

– by itself, matches any string► + (plus) means “one or more times”

– /at+e/ matches "fate", "matter" but not "Caesar"► ? (question mark) means “0 or 1 times”

– /colou?r/ matches substrings "color" and "colour"

Llama3 page 100; Camel3 pages 176-178; perlre manpage


Alternation and groupingAlternation and grouping

| (vertical bar) separates alternatives► more flexible than character classes► /cat|dog/ matches substrings "cat" and "dog"► /a|b|c/ means same as /[abc]/

( parentheses ) used to group part of pattern► to apply multiplier to more than one character

– /c(er)+s/ matches strings "saucers" and "sorcerers"

► to factor out common parts of a pattern– /(cat|sel)fish/ matches substrings "catfish" and "selfish"

► to use backreferences and capture strings– see later

| (vertical bar) separates alternatives► more flexible than character classes► /cat|dog/ matches substrings "cat" and "dog"► /a|b|c/ means same as /[abc]/

( parentheses ) used to group part of pattern► to apply multiplier to more than one character

– /c(er)+s/ matches strings "saucers" and "sorcerers"

► to factor out common parts of a pattern– /(cat|sel)fish/ matches substrings "catfish" and "selfish"

► to use backreferences and capture strings– see laterLlama3 page102; Camel3 page 187-188,182-185; perlre manpage


AnchorsAnchors

Sometimes want a pattern to match only at beginning or end of string

► called “anchoring” a pattern ^ (caret) means “beginning of string”

► /^s/ matches beginning of string followed by “s”– i.e., any string that starts with “s”

$ (dollar) means “end of string”► /r$/ matches “r” followed by end of string

– i.e., any string that ends with “r”► works even if string has not been chomped

Both can be used in same regular expression► /^dog$/ matches only if entire string is "dog“

\b means “boundary between word (\w) and non-word (\W) characters”

Sometimes want a pattern to match only at beginning or end of string

► called “anchoring” a pattern ^ (caret) means “beginning of string”

► /^s/ matches beginning of string followed by “s”– i.e., any string that starts with “s”

$ (dollar) means “end of string”► /r$/ matches “r” followed by end of string

– i.e., any string that ends with “r”► works even if string has not been chomped

Both can be used in same regular expression► /^dog$/ matches only if entire string is "dog“

\b means “boundary between word (\w) and non-word (\W) characters”

Llama3 pages 108-109; Camel3 page 178-180; perlre manpage


TimeoutTimeout

# Mail headers revisited: verify mail header format.

# Mail headers look like either of these lines:# word: anything after the colon# continuation lines are indented

while (<>){ # Stop when blank line reached; end of headers. last if /^$/;

# Patterns match if line starts with either # - at least one non-space, then colon, or # - a space unless (/^(\S+:|\s)/) { print "Bad header line:\n$_"; }}

# Mail headers revisited: verify mail header format.

# Mail headers look like either of these lines:# word: anything after the colon# continuation lines are indented

while (<>){ # Stop when blank line reached; end of headers. last if /^$/;

# Patterns match if line starts with either # - at least one non-space, then colon, or # - a space unless (/^(\S+:|\s)/) { print "Bad header line:\n$_"; }}


split and joinsplit and join

split function breaks a string up into pieces► takes regular expression to specify how pieces are to be

separated; returns the pieces as a list► @threeparts = split / /, "cat and mouse";► foreach (split /\s+/, $line) { ... }► @fields = split /,/, $record; # CSV

join function joins a list into a string► takes string to specify what goes between pieces; returns the

glued pieces together into a string► $phrase = join " and ", "cat", "mouse", "fish"► print join " ", @words;► $record = join ",", @fields; # CSV

split function breaks a string up into pieces► takes regular expression to specify how pieces are to be

separated; returns the pieces as a list► @threeparts = split / /, "cat and mouse";► foreach (split /\s+/, $line) { ... }► @fields = split /,/, $record; # CSV

join function joins a list into a string► takes string to specify what goes between pieces; returns the

glued pieces together into a string► $phrase = join " and ", "cat", "mouse", "fish"► print join " ", @words;► $record = join ",", @fields; # CSV

Llama3 pages 125-127; Camel3 pages 794-796, 733; perlfunc manpage


TimeoutTimeout

# Iterate over every word in an input stream.

# Read each line of inputwhile (<STDIN>){ foreach (split /\s+/, $_) { next if /^$/; # Skip blank words.

do_something($_); }}

sub do_something{ print "Saw word ", shift, "\n";}

# Iterate over every word in an input stream.

# Read each line of inputwhile (<STDIN>){ foreach (split /\s+/, $_) { next if /^$/; # Skip blank words.

do_something($_); }}

sub do_something{ print "Saw word ", shift, "\n";}


TimeoutTimeout




Advanced regular expressionsAdvanced regular expressions

Most languages can process regular expressions of complexity seen so far

Perl has many more advanced features which use regular expressions► case-insensitive matching► interpolating patterns► backreferences► capturing matched strings► substitution► matching variables other than $_► greedy and lazy multipliers

Most languages can process regular expressions of complexity seen so far

Perl has many more advanced features which use regular expressions► case-insensitive matching► interpolating patterns► backreferences► capturing matched strings► substitution► matching variables other than $_► greedy and lazy multipliers


Case-insensitive matchesCase-insensitive matches

Regular expressions normally sensitive to case► /a/ doesn’t match substring "A"

Can make pattern case-insensitive using i modifier► put i character immediately after end of match

operator► /a/i matches substrings "a" or "A"

Regular expressions normally sensitive to case► /a/ doesn’t match substring "A"

Can make pattern case-insensitive using i modifier► put i character immediately after end of match

operator► /a/i matches substrings "a" or "A"



Interpolating into patternsInterpolating into patterns

Variables can be interpolated into regular expressions► like double-quoted strings► $pattern = 'fish(es)?'; /cat$pattern/

– same as /catfish(es)?/

Variables can be interpolated into regular expressions► like double-quoted strings► $pattern = 'fish(es)?'; /cat$pattern/

– same as /catfish(es)?/



TimeoutTimeout

# Perl implementation of Unix grep program

# Pattern is first command-line argument$pattern = shift;

while (<>){ # Print the line if it matches the pattern. # o ("once") modifier tells Perl to assume that # the pattern never changes; this allows Perl # to re-use the compiled regular expression, # making the program run faster. print if /$pattern/o;}

# Perl implementation of Unix grep program

# Pattern is first command-line argument$pattern = shift;

while (<>){ # Print the line if it matches the pattern. # o ("once") modifier tells Perl to assume that # the pattern never changes; this allows Perl # to re-use the compiled regular expression, # making the program run faster. print if /$pattern/o;}


BackreferencesBackreferences

So far, cannot write pattern to match double letter► /[a-z][a-z]/ matches any two letters, even if different

Need pattern that says: “match any letter, calling the matched string ‘1’, then match string ‘1’ again”

Backreferences refer to the substrings matched by previous parts of the pattern

► put parentheses around part of pattern to remember– first ( and its matching ) become string 1– second ( and its matching ) become string 2

► write backreference as \1, \2, etc.► /([a-z])\1/ matches substring composed of any double letter► /\b(\w+)\b.*\b\1\b/ matches any string containing the

same word twice

So far, cannot write pattern to match double letter► /[a-z][a-z]/ matches any two letters, even if different

Need pattern that says: “match any letter, calling the matched string ‘1’, then match string ‘1’ again”

Backreferences refer to the substrings matched by previous parts of the pattern

► put parentheses around part of pattern to remember– first ( and its matching ) become string 1– second ( and its matching ) become string 2

► write backreference as \1, \2, etc.► /([a-z])\1/ matches substring composed of any double letter► /\b(\w+)\b.*\b\1\b/ matches any string containing the

same word twice

Llama3 pages 109-111; Camel3 pages 182-184; perlre manpage


Capturing stringsCapturing strings

Matched backreference substrings are available after the match succeeds

► backreference \1 is available in special variable $1► backreference \2 is available in special variable $2► etc.

Allows code to find out what strings matched which parts of a pattern► $_ = "haystack"; /(t([aeiou])[a-z])/ puts "tac" in $1 and "a" in $2

Captured strings are available until next match succeeds► if match fails, variables are not set

Matched backreference substrings are available after the match succeeds

► backreference \1 is available in special variable $1► backreference \2 is available in special variable $2► etc.

Allows code to find out what strings matched which parts of a pattern► $_ = "haystack"; /(t([aeiou])[a-z])/ puts "tac" in $1 and "a" in $2

Captured strings are available until next match succeeds► if match fails, variables are not set

Llama3 pages 109-111; Camel3 pages 182-185; perlre, perlvar manpages


TimeoutTimeout

# Identify mail headers

while (<STDIN>){ last if /^$/;

# Extract name of header (before colon) into $1 # and content of header (after colon to end of # line) into $2. # Match fails on continuation lines, so # $1 and $2 variables not set. if (/^(\S+):\s?(.*)$/) { print "Header name is $1, contains $2\n"; }}

# Identify mail headers

while (<STDIN>){ last if /^$/;

# Extract name of header (before colon) into $1 # and content of header (after colon to end of # line) into $2. # Match fails on continuation lines, so # $1 and $2 variables not set. if (/^(\S+):\s?(.*)$/) { print "Header name is $1, contains $2\n"; }}


TimeoutTimeout

# Fancy Unix grep that identifies where a match was.

$pattern = shift;

# ANSI terminal escapes$bold = "\033[1m"; $norm = "\033[0m";

while (<>){ # Look for pattern, capture it into $2. # Also capture all previous text on line into $1 # and all following text to $3. if (/^(.*)($pattern)(.*)$/o) { print "$1$bold$2$norm$3"; }}

# Fancy Unix grep that identifies where a match was.

$pattern = shift;

# ANSI terminal escapes$bold = "\033[1m"; $norm = "\033[0m";

while (<>){ # Look for pattern, capture it into $2. # Also capture all previous text on line into $1 # and all following text to $3. if (/^(.*)($pattern)(.*)$/o) { print "$1$bold$2$norm$3"; }}


Multiplier greedinessMultiplier greediness

Multipliers *, + and ? are normally greedy► if there are two ways to successfully match a string, they will try

to match the longest substring► $_ = "mississippi"; /m.*ss/ matches up to second “ss”

– because /.*/ would prefer to match “issi” than just “i”

Non-greedy (lazy) multipliers *?, +? and ?? exist► will try to match the shortest substring► $_ = "mississippi"; /m.*?ss/ matches up to first “ss”

If only one way to match, greedy and lazy multipliers match same way

Greediness only important if need to know which part of string matched a pattern

► if using \1, \2, $1, $2, etc.► if using s/.../.../

Multipliers *, + and ? are normally greedy► if there are two ways to successfully match a string, they will try

to match the longest substring► $_ = "mississippi"; /m.*ss/ matches up to second “ss”

– because /.*/ would prefer to match “issi” than just “i”

Non-greedy (lazy) multipliers *?, +? and ?? exist► will try to match the shortest substring► $_ = "mississippi"; /m.*?ss/ matches up to first “ss”

If only one way to match, greedy and lazy multipliers match same way

Greediness only important if need to know which part of string matched a pattern

► if using \1, \2, $1, $2, etc.► if using s/.../.../

Camel3 pages 177-178; perlre manpage


SubstitutionSubstitution

To replace a matched substring with a new substring, use s/pattern/replacement/ operator

pattern is a regular expression to find in the $_ variable

replacement is the string to replace the matching part of $_► not a regular expression► may contain $1, $2, etc. captured strings

If pattern not found, no change is made to $_ s/colou?r/hue/; # Make a synonym

To replace a matched substring with a new substring, use s/pattern/replacement/ operator

pattern is a regular expression to find in the $_ variable

replacement is the string to replace the matching part of $_► not a regular expression► may contain $1, $2, etc. captured strings

If pattern not found, no change is made to $_ s/colou?r/hue/; # Make a synonym

Llama3 pages 122-123; Camel3 pages 152-155; perlop manpage


SubstitutionSubstitution

Variables are interpolated into both pattern and replacement► s/$regex/$new/;

Substitution normally only occurs for the first match in a string► use g (“global”) modifier to make substitution repeat

as often as possible on the string– s/cat/dog/g;

► substitution also takes i (case-insensitive) modifier

Variables are interpolated into both pattern and replacement► s/$regex/$new/;

Substitution normally only occurs for the first match in a string► use g (“global”) modifier to make substitution repeat

as often as possible on the string– s/cat/dog/g;

► substitution also takes i (case-insensitive) modifier

Llama3 pages 123, 124; Camel3 page 153; perlop manpage


TimeoutTimeout

# Censor: change some words in input to others.

%swearwords = ( 'Micro[s\$]oft' => 'M.......t', 'Windows( (95|98|ME))?' => 'Windoze', 'Python' => 'anti-Perl' );

while (<>){ while (($bad, $euphemism) = each %swearwords) { # s/// returns number of times succeeded $count += s/$bad/$euphemism/gi; } print;}print "$count words changed\n";

# Censor: change some words in input to others.

%swearwords = ( 'Micro[s\$]oft' => 'M.......t', 'Windows( (95|98|ME))?' => 'Windoze', 'Python' => 'anti-Perl' );

while (<>){ while (($bad, $euphemism) = each %swearwords) { # s/// returns number of times succeeded $count += s/$bad/$euphemism/gi; } print;}print "$count words changed\n";


Binding operator =~Binding operator =~

Match /.../ and substitution s/.../.../ operators match against $_ variable by default

Can match against any variable with binding operator =~► put variable on left of operator► put match/substitution on right of operator► if ($string =~ /pattern/) { ... }► $changeme =~ s/cat/dog/g;► ($copy = $orig) =~ s/cat/dog/g;► if ($_ =~ /pattern/) # Redundant

Match /.../ and substitution s/.../.../ operators match against $_ variable by default

Can match against any variable with binding operator =~► put variable on left of operator► put match/substitution on right of operator► if ($string =~ /pattern/) { ... }► $changeme =~ s/cat/dog/g;► ($copy = $orig) =~ s/cat/dog/g;► if ($_ =~ /pattern/) # Redundant

Llama3 pages 117-118; Camel3 pages 93-94; perlop manpage


Covered in this topicCovered in this topic

Regular expressions► Character classes

– [...], ., \s, \S, \d, etc.► Multipliers

– *, +, ?, non-greedy versions *?, +?, ??► Anchors

– ^, $ Match operator /.../ Interpolation split and join Alternation and grouping Backreferences and capturing substrings

► \1, \2, $1, $2, etc. Substitution operator s/.../.../ Binding operator =~

Regular expressions► Character classes

– [...], ., \s, \S, \d, etc.► Multipliers

– *, +, ?, non-greedy versions *?, +?, ??► Anchors

– ^, $ Match operator /.../ Interpolation split and join Alternation and grouping Backreferences and capturing substrings

► \1, \2, $1, $2, etc. Substitution operator s/.../.../ Binding operator =~


Going furtherGoing further

Advanced regular expressions► look-ahead, look-behind, evaluating Perl expressions

as regular expressions, etc.► Camel3 pages 195-216► Mastering Regular Expressions, by Jeffrey Friedl,

O’Reilly & Associates

tr/.../.../► transliteration operator, like Unix tr program

sed, awk, grep, vi, ...► some of Unix’s more powerful pattern-matching tools► man sed, man awk, ...

Advanced regular expressions► look-ahead, look-behind, evaluating Perl expressions

as regular expressions, etc.► Camel3 pages 195-216► Mastering Regular Expressions, by Jeffrey Friedl,

O’Reilly & Associates

tr/.../.../► transliteration operator, like Unix tr program

sed, awk, grep, vi, ...► some of Unix’s more powerful pattern-matching tools► man sed, man awk, ...


Next topicNext topic

File I/O Opening and closing files Reading from and writing to files Manipulating files and directories Communicating with processes

File I/O Opening and closing files Reading from and writing to files Manipulating files and directories Communicating with processes

Llama3 chapter 6, pages 86-97, chapters 11-14, pages 148-207Camel3 pages 20-22, 28-29, 97-100, 426-428, 747-755, 770perlfunc, perlopentut manpages

Documents

Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage