35
Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

Embed Size (px)

Citation preview

Page 1: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

Topic 6: Regular expressionsTopic 6: Regular expressions

CSE2395/CSE3395Perl Programming

CSE2395/CSE3395Perl Programming

Llama3 chapters 7-9, pages 98-127

Camel3 pages 139-195

perlre manpage

Page 2: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

2Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

In this topicIn this topic

Regular expressions► performing pattern matching

Regular expressions► performing pattern matching

Page 3: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

3Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Matching stringsMatching strings

Can find one string within another using index function► returns position of start of substring, or -1 on failure► $needle = "tac";► print index "haystack", $needle; # 4

Only works for constant substrings► not usually sufficient for common pattern-matching

uses

Can find one string within another using index function► returns position of start of substring, or -1 on failure► $needle = "tac";► print index "haystack", $needle; # 4

Only works for constant substrings► not usually sufficient for common pattern-matching

uses

Llama3 pages 208-209; Camel3 page 731; perlfunc manpage

Page 4: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

4Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Regular expressionsRegular expressions

Regular expressions are a mini-language used to describe patterns of characters

► e.g., look for a “t”, followed by any vowel, followed by any letter Some strings satisfy a given regular expression

► haystack► taciturn (twice)► settee► top

Some strings can’t satisfy it► mouse► cattle► bite me (has space where consonant needed to be)► empty string

Regular expressions are a mini-language used to describe patterns of characters

► e.g., look for a “t”, followed by any vowel, followed by any letter Some strings satisfy a given regular expression

► haystack► taciturn (twice)► settee► top

Some strings can’t satisfy it► mouse► cattle► bite me (has space where consonant needed to be)► empty string

Llama3 pages 98-99

Page 5: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

5Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Regular expressionsRegular expressions

Several Unix programs have support for regular expressions► usually programs which manipulate text► grep (print lines matching a pattern)► sed and awk (stream editors)► vi and emacs (text editors)► lex (tokenizer)► procmail (mail filter)► perl (some programming language)

Share a (reasonably) common format► some minor differences in capabilities and dialects► previous slide’s example written t[aeiou][a-z]

Several Unix programs have support for regular expressions► usually programs which manipulate text► grep (print lines matching a pattern)► sed and awk (stream editors)► vi and emacs (text editors)► lex (tokenizer)► procmail (mail filter)► perl (some programming language)

Share a (reasonably) common format► some minor differences in capabilities and dialects► previous slide’s example written t[aeiou][a-z]

Page 6: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

6Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Unix grep programUnix grep program

grep prints out any line in its input that matches a regular expression► only distantly related to Perl’s grep function

grep prints out any line in its input that matches a regular expression► only distantly related to Perl’s grep function

% grep 't[aeiou][a-z]' /usr/dict/wordsabatedabettedabolition... lots more words here ...yesterdayyoungsterytterbium

Llama3 page 99

Page 7: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

7Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Regular expressions in PerlRegular expressions in Perl

Perl tries to match regular expression patterns to the string in the variable $_► if successful anywhere inside string, result is true► otherwise (unsuccessful everywhere), result is false

Pattern is written between two forward slashes► /t[aeiou][a-z]/► /.../ called match operator► boolean value returned

– usually used inside if or while condition– if (/t[aeiou][a-z]/) { ... }

Perl tries to match regular expression patterns to the string in the variable $_► if successful anywhere inside string, result is true► otherwise (unsuccessful everywhere), result is false

Pattern is written between two forward slashes► /t[aeiou][a-z]/► /.../ called match operator► boolean value returned

– usually used inside if or while condition– if (/t[aeiou][a-z]/) { ... }

Llama3 page 100; Camel3 pages 140, 145-150, 218; perldoc manpage

Page 8: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

8Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Find occurrences of a pattern in the named files.

# Read lines of input into $_, one at a time.while (<>){ # Check for the pattern in $_. if (/t[aeiou][a-z]/) { # Success. Print out this line. print; }}

# Find occurrences of a pattern in the named files.

# Read lines of input into $_, one at a time.while (<>){ # Check for the pattern in $_. if (/t[aeiou][a-z]/) { # Success. Print out this line. print; }}

Page 9: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

9Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Patterns: literal charactersPatterns: literal characters

Alphanumeric characters match themselves► /abc/ matches substring "abc"► /123/ matches substring "123"

Most other characters require a backslash in order to match themselves► /\[a\]/ matches substring "[a]"► /\/usr\/bin/ matches substring "/usr/bin"► if in doubt, backslash all non-alphanumerics

Backslashes before alphanumerics are special► /\n/ matches newline character► /\b/ matches word boundary► /\d/ is shorthand for /[0-9]/► /\1/ is a backreference

Alphanumeric characters match themselves► /abc/ matches substring "abc"► /123/ matches substring "123"

Most other characters require a backslash in order to match themselves► /\[a\]/ matches substring "[a]"► /\/usr\/bin/ matches substring "/usr/bin"► if in doubt, backslash all non-alphanumerics

Backslashes before alphanumerics are special► /\n/ matches newline character► /\b/ matches word boundary► /\d/ is shorthand for /[0-9]/► /\1/ is a backreference

Llama3 page 100; Camel3 page 158; perlre manpage

Page 10: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

10Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Patterns: character classesPatterns: character classes

[letters] matches exactly one of the enclosed letters► /[abc]/ matches substrings "a" or "b" or "c"► can specify ranges with hyphen► /[0-9]/ matches any single digit

inverted classes: [^letters] matches any one character except any of those enclosed► /[^abc]/ matches substring "x" but not "a"► /[^0-9]/ matches any one non-digit

Some common character classes have shorthand forms► /\d/ (digit) same as /[0-9]/► /\s/ (space) same as /[ \t\n\r\f]/► /\w/ (“word letter”) same as /[a-zA-Z0-9_]/► inverted shortcuts /\D/ (non-digit), /\S/ (non-space), /\W/

[letters] matches exactly one of the enclosed letters► /[abc]/ matches substrings "a" or "b" or "c"► can specify ranges with hyphen► /[0-9]/ matches any single digit

inverted classes: [^letters] matches any one character except any of those enclosed► /[^abc]/ matches substring "x" but not "a"► /[^0-9]/ matches any one non-digit

Some common character classes have shorthand forms► /\d/ (digit) same as /[0-9]/► /\s/ (space) same as /[ \t\n\r\f]/► /\w/ (“word letter”) same as /[a-zA-Z0-9_]/► inverted shortcuts /\D/ (non-digit), /\S/ (non-space), /\W/

Llama3 page 105-107; Camel3 pages 159, 165-167; perlre manpage

Page 11: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

11Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Patterns: any characterPatterns: any character

. (full stop) shorthand for [^\n] (any character but newline)

► effectively “any character” because $_ seldom contains newline– except perhaps unchomped one at very end

► /d.g/ matches substrings "dog", "dig", "d g", "d!g"► /...../ matches substring containing any five characters

– true when $_ contains at least five characters► /.\../ matches any character, a dot, then any character

– true when $_ contains a dot that isn’t the first or last character of the line

. (full stop) shorthand for [^\n] (any character but newline)

► effectively “any character” because $_ seldom contains newline– except perhaps unchomped one at very end

► /d.g/ matches substrings "dog", "dig", "d g", "d!g"► /...../ matches substring containing any five characters

– true when $_ contains at least five characters► /.\../ matches any character, a dot, then any character

– true when $_ contains a dot that isn’t the first or last character of the line

Llama3 page 100; Camel3 page 159; perlre manpage

Page 12: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

12Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

Write regular expressions to match strings containing:► the word “dog” in any form of capitalization► a car’s number plate► a phone number► a four-letter word beginning with “s”► “s” at the beginning of the line► no text at all (an empty line)► a double letter

Write regular expressions to match strings containing:► the word “dog” in any form of capitalization► a car’s number plate► a phone number► a four-letter word beginning with “s”► “s” at the beginning of the line► no text at all (an empty line)► a double letter

Page 13: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

13Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

MultipliersMultipliers

Multipliers allow the previous part of the pattern to repeat► by default, applies to previous letter or character class

– can group using parentheses► write multiplier after part of pattern to repeat► * (asterisk) means “0 or more times”

– /at*e/ matches strings "Caesar", "fate", "matter"– /.*/ matches zero or more of any character

– by itself, matches any string► + (plus) means “one or more times”

– /at+e/ matches "fate", "matter" but not "Caesar"► ? (question mark) means “0 or 1 times”

– /colou?r/ matches substrings "color" and "colour"

Multipliers allow the previous part of the pattern to repeat► by default, applies to previous letter or character class

– can group using parentheses► write multiplier after part of pattern to repeat► * (asterisk) means “0 or more times”

– /at*e/ matches strings "Caesar", "fate", "matter"– /.*/ matches zero or more of any character

– by itself, matches any string► + (plus) means “one or more times”

– /at+e/ matches "fate", "matter" but not "Caesar"► ? (question mark) means “0 or 1 times”

– /colou?r/ matches substrings "color" and "colour"

Llama3 page 100; Camel3 pages 176-178; perlre manpage

Page 14: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

14Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Alternation and groupingAlternation and grouping

| (vertical bar) separates alternatives► more flexible than character classes► /cat|dog/ matches substrings "cat" and "dog"► /a|b|c/ means same as /[abc]/

( parentheses ) used to group part of pattern► to apply multiplier to more than one character

– /c(er)+s/ matches strings "saucers" and "sorcerers"

► to factor out common parts of a pattern– /(cat|sel)fish/ matches substrings "catfish" and "selfish"

► to use backreferences and capture strings– see later

| (vertical bar) separates alternatives► more flexible than character classes► /cat|dog/ matches substrings "cat" and "dog"► /a|b|c/ means same as /[abc]/

( parentheses ) used to group part of pattern► to apply multiplier to more than one character

– /c(er)+s/ matches strings "saucers" and "sorcerers"

► to factor out common parts of a pattern– /(cat|sel)fish/ matches substrings "catfish" and "selfish"

► to use backreferences and capture strings– see laterLlama3 page102; Camel3 page 187-188,182-185; perlre manpage

Page 15: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

15Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

AnchorsAnchors

Sometimes want a pattern to match only at beginning or end of string

► called “anchoring” a pattern ^ (caret) means “beginning of string”

► /^s/ matches beginning of string followed by “s”– i.e., any string that starts with “s”

$ (dollar) means “end of string”► /r$/ matches “r” followed by end of string

– i.e., any string that ends with “r”► works even if string has not been chomped

Both can be used in same regular expression► /^dog$/ matches only if entire string is "dog“

\b means “boundary between word (\w) and non-word (\W) characters”

Sometimes want a pattern to match only at beginning or end of string

► called “anchoring” a pattern ^ (caret) means “beginning of string”

► /^s/ matches beginning of string followed by “s”– i.e., any string that starts with “s”

$ (dollar) means “end of string”► /r$/ matches “r” followed by end of string

– i.e., any string that ends with “r”► works even if string has not been chomped

Both can be used in same regular expression► /^dog$/ matches only if entire string is "dog“

\b means “boundary between word (\w) and non-word (\W) characters”

Llama3 pages 108-109; Camel3 page 178-180; perlre manpage

Page 16: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

16Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Mail headers revisited: verify mail header format.

# Mail headers look like either of these lines:# word: anything after the colon# continuation lines are indented

while (<>){ # Stop when blank line reached; end of headers. last if /^$/;

# Patterns match if line starts with either # - at least one non-space, then colon, or # - a space unless (/^(\S+:|\s)/) { print "Bad header line:\n$_"; }}

# Mail headers revisited: verify mail header format.

# Mail headers look like either of these lines:# word: anything after the colon# continuation lines are indented

while (<>){ # Stop when blank line reached; end of headers. last if /^$/;

# Patterns match if line starts with either # - at least one non-space, then colon, or # - a space unless (/^(\S+:|\s)/) { print "Bad header line:\n$_"; }}

Page 17: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

17Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

split and joinsplit and join

split function breaks a string up into pieces► takes regular expression to specify how pieces are to be

separated; returns the pieces as a list► @threeparts = split / /, "cat and mouse";► foreach (split /\s+/, $line) { ... }► @fields = split /,/, $record; # CSV

join function joins a list into a string► takes string to specify what goes between pieces; returns the

glued pieces together into a string► $phrase = join " and ", "cat", "mouse", "fish"► print join " ", @words;► $record = join ",", @fields; # CSV

split function breaks a string up into pieces► takes regular expression to specify how pieces are to be

separated; returns the pieces as a list► @threeparts = split / /, "cat and mouse";► foreach (split /\s+/, $line) { ... }► @fields = split /,/, $record; # CSV

join function joins a list into a string► takes string to specify what goes between pieces; returns the

glued pieces together into a string► $phrase = join " and ", "cat", "mouse", "fish"► print join " ", @words;► $record = join ",", @fields; # CSV

Llama3 pages 125-127; Camel3 pages 794-796, 733; perlfunc manpage

Page 18: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

18Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Iterate over every word in an input stream.

# Read each line of inputwhile (<STDIN>){ foreach (split /\s+/, $_) { next if /^$/; # Skip blank words.

do_something($_); }}

sub do_something{ print "Saw word ", shift, "\n";}

# Iterate over every word in an input stream.

# Read each line of inputwhile (<STDIN>){ foreach (split /\s+/, $_) { next if /^$/; # Skip blank words.

do_something($_); }}

sub do_something{ print "Saw word ", shift, "\n";}

Page 19: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

19Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

Write regular expressions to match strings containing:► the word “dog” in any form of capitalization► a car’s number plate► a phone number► a four-letter word beginning with “s”► “s” at the beginning of the line► no text at all (an empty line)► a double letter

Write regular expressions to match strings containing:► the word “dog” in any form of capitalization► a car’s number plate► a phone number► a four-letter word beginning with “s”► “s” at the beginning of the line► no text at all (an empty line)► a double letter

Page 20: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

20Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Advanced regular expressionsAdvanced regular expressions

Most languages can process regular expressions of complexity seen so far

Perl has many more advanced features which use regular expressions► case-insensitive matching► interpolating patterns► backreferences► capturing matched strings► substitution► matching variables other than $_► greedy and lazy multipliers

Most languages can process regular expressions of complexity seen so far

Perl has many more advanced features which use regular expressions► case-insensitive matching► interpolating patterns► backreferences► capturing matched strings► substitution► matching variables other than $_► greedy and lazy multipliers

Page 21: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

21Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Case-insensitive matchesCase-insensitive matches

Regular expressions normally sensitive to case► /a/ doesn’t match substring "A"

Can make pattern case-insensitive using i modifier► put i character immediately after end of match

operator► /a/i matches substrings "a" or "A"

Regular expressions normally sensitive to case► /a/ doesn’t match substring "A"

Can make pattern case-insensitive using i modifier► put i character immediately after end of match

operator► /a/i matches substrings "a" or "A"

Llama3 page 116; Camel3 pages 147-178; perlre manpage

Page 22: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

22Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Interpolating into patternsInterpolating into patterns

Variables can be interpolated into regular expressions► like double-quoted strings► $pattern = 'fish(es)?'; /cat$pattern/

– same as /catfish(es)?/

Variables can be interpolated into regular expressions► like double-quoted strings► $pattern = 'fish(es)?'; /cat$pattern/

– same as /catfish(es)?/

Llama3 page 118; Camel3 pages 190-191; perlre manpage

Page 23: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

23Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Perl implementation of Unix grep program

# Pattern is first command-line argument$pattern = shift;

while (<>){ # Print the line if it matches the pattern. # o ("once") modifier tells Perl to assume that # the pattern never changes; this allows Perl # to re-use the compiled regular expression, # making the program run faster. print if /$pattern/o;}

# Perl implementation of Unix grep program

# Pattern is first command-line argument$pattern = shift;

while (<>){ # Print the line if it matches the pattern. # o ("once") modifier tells Perl to assume that # the pattern never changes; this allows Perl # to re-use the compiled regular expression, # making the program run faster. print if /$pattern/o;}

Page 24: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

24Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

BackreferencesBackreferences

So far, cannot write pattern to match double letter► /[a-z][a-z]/ matches any two letters, even if different

Need pattern that says: “match any letter, calling the matched string ‘1’, then match string ‘1’ again”

Backreferences refer to the substrings matched by previous parts of the pattern

► put parentheses around part of pattern to remember– first ( and its matching ) become string 1– second ( and its matching ) become string 2

► write backreference as \1, \2, etc.► /([a-z])\1/ matches substring composed of any double letter► /\b(\w+)\b.*\b\1\b/ matches any string containing the

same word twice

So far, cannot write pattern to match double letter► /[a-z][a-z]/ matches any two letters, even if different

Need pattern that says: “match any letter, calling the matched string ‘1’, then match string ‘1’ again”

Backreferences refer to the substrings matched by previous parts of the pattern

► put parentheses around part of pattern to remember– first ( and its matching ) become string 1– second ( and its matching ) become string 2

► write backreference as \1, \2, etc.► /([a-z])\1/ matches substring composed of any double letter► /\b(\w+)\b.*\b\1\b/ matches any string containing the

same word twice

Llama3 pages 109-111; Camel3 pages 182-184; perlre manpage

Page 25: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

25Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Capturing stringsCapturing strings

Matched backreference substrings are available after the match succeeds

► backreference \1 is available in special variable $1► backreference \2 is available in special variable $2► etc.

Allows code to find out what strings matched which parts of a pattern► $_ = "haystack"; /(t([aeiou])[a-z])/ puts "tac" in $1 and "a" in $2

Captured strings are available until next match succeeds► if match fails, variables are not set

Matched backreference substrings are available after the match succeeds

► backreference \1 is available in special variable $1► backreference \2 is available in special variable $2► etc.

Allows code to find out what strings matched which parts of a pattern► $_ = "haystack"; /(t([aeiou])[a-z])/ puts "tac" in $1 and "a" in $2

Captured strings are available until next match succeeds► if match fails, variables are not set

Llama3 pages 109-111; Camel3 pages 182-185; perlre, perlvar manpages

Page 26: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

26Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Identify mail headers

while (<STDIN>){ last if /^$/;

# Extract name of header (before colon) into $1 # and content of header (after colon to end of # line) into $2. # Match fails on continuation lines, so # $1 and $2 variables not set. if (/^(\S+):\s?(.*)$/) { print "Header name is $1, contains $2\n"; }}

# Identify mail headers

while (<STDIN>){ last if /^$/;

# Extract name of header (before colon) into $1 # and content of header (after colon to end of # line) into $2. # Match fails on continuation lines, so # $1 and $2 variables not set. if (/^(\S+):\s?(.*)$/) { print "Header name is $1, contains $2\n"; }}

Page 27: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

27Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Fancy Unix grep that identifies where a match was.

$pattern = shift;

# ANSI terminal escapes$bold = "\033[1m"; $norm = "\033[0m";

while (<>){ # Look for pattern, capture it into $2. # Also capture all previous text on line into $1 # and all following text to $3. if (/^(.*)($pattern)(.*)$/o) { print "$1$bold$2$norm$3"; }}

# Fancy Unix grep that identifies where a match was.

$pattern = shift;

# ANSI terminal escapes$bold = "\033[1m"; $norm = "\033[0m";

while (<>){ # Look for pattern, capture it into $2. # Also capture all previous text on line into $1 # and all following text to $3. if (/^(.*)($pattern)(.*)$/o) { print "$1$bold$2$norm$3"; }}

Page 28: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

28Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Multiplier greedinessMultiplier greediness

Multipliers *, + and ? are normally greedy► if there are two ways to successfully match a string, they will try

to match the longest substring► $_ = "mississippi"; /m.*ss/ matches up to second “ss”

– because /.*/ would prefer to match “issi” than just “i”

Non-greedy (lazy) multipliers *?, +? and ?? exist► will try to match the shortest substring► $_ = "mississippi"; /m.*?ss/ matches up to first “ss”

If only one way to match, greedy and lazy multipliers match same way

Greediness only important if need to know which part of string matched a pattern

► if using \1, \2, $1, $2, etc.► if using s/.../.../

Multipliers *, + and ? are normally greedy► if there are two ways to successfully match a string, they will try

to match the longest substring► $_ = "mississippi"; /m.*ss/ matches up to second “ss”

– because /.*/ would prefer to match “issi” than just “i”

Non-greedy (lazy) multipliers *?, +? and ?? exist► will try to match the shortest substring► $_ = "mississippi"; /m.*?ss/ matches up to first “ss”

If only one way to match, greedy and lazy multipliers match same way

Greediness only important if need to know which part of string matched a pattern

► if using \1, \2, $1, $2, etc.► if using s/.../.../

Camel3 pages 177-178; perlre manpage

Page 29: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

29Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

SubstitutionSubstitution

To replace a matched substring with a new substring, use s/pattern/replacement/ operator

pattern is a regular expression to find in the $_ variable

replacement is the string to replace the matching part of $_► not a regular expression► may contain $1, $2, etc. captured strings

If pattern not found, no change is made to $_ s/colou?r/hue/; # Make a synonym

To replace a matched substring with a new substring, use s/pattern/replacement/ operator

pattern is a regular expression to find in the $_ variable

replacement is the string to replace the matching part of $_► not a regular expression► may contain $1, $2, etc. captured strings

If pattern not found, no change is made to $_ s/colou?r/hue/; # Make a synonym

Llama3 pages 122-123; Camel3 pages 152-155; perlop manpage

Page 30: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

30Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

SubstitutionSubstitution

Variables are interpolated into both pattern and replacement► s/$regex/$new/;

Substitution normally only occurs for the first match in a string► use g (“global”) modifier to make substitution repeat

as often as possible on the string– s/cat/dog/g;

► substitution also takes i (case-insensitive) modifier

Variables are interpolated into both pattern and replacement► s/$regex/$new/;

Substitution normally only occurs for the first match in a string► use g (“global”) modifier to make substitution repeat

as often as possible on the string– s/cat/dog/g;

► substitution also takes i (case-insensitive) modifier

Llama3 pages 123, 124; Camel3 page 153; perlop manpage

Page 31: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

31Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

TimeoutTimeout

# Censor: change some words in input to others.

%swearwords = ( 'Micro[s\$]oft' => 'M.......t', 'Windows( (95|98|ME))?' => 'Windoze', 'Python' => 'anti-Perl' );

while (<>){ while (($bad, $euphemism) = each %swearwords) { # s/// returns number of times succeeded $count += s/$bad/$euphemism/gi; } print;}print "$count words changed\n";

# Censor: change some words in input to others.

%swearwords = ( 'Micro[s\$]oft' => 'M.......t', 'Windows( (95|98|ME))?' => 'Windoze', 'Python' => 'anti-Perl' );

while (<>){ while (($bad, $euphemism) = each %swearwords) { # s/// returns number of times succeeded $count += s/$bad/$euphemism/gi; } print;}print "$count words changed\n";

Page 32: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

32Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Binding operator =~Binding operator =~

Match /.../ and substitution s/.../.../ operators match against $_ variable by default

Can match against any variable with binding operator =~► put variable on left of operator► put match/substitution on right of operator► if ($string =~ /pattern/) { ... }► $changeme =~ s/cat/dog/g;► ($copy = $orig) =~ s/cat/dog/g;► if ($_ =~ /pattern/) # Redundant

Match /.../ and substitution s/.../.../ operators match against $_ variable by default

Can match against any variable with binding operator =~► put variable on left of operator► put match/substitution on right of operator► if ($string =~ /pattern/) { ... }► $changeme =~ s/cat/dog/g;► ($copy = $orig) =~ s/cat/dog/g;► if ($_ =~ /pattern/) # Redundant

Llama3 pages 117-118; Camel3 pages 93-94; perlop manpage

Page 33: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

33Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Covered in this topicCovered in this topic

Regular expressions► Character classes

– [...], ., \s, \S, \d, etc.► Multipliers

– *, +, ?, non-greedy versions *?, +?, ??► Anchors

– ^, $ Match operator /.../ Interpolation split and join Alternation and grouping Backreferences and capturing substrings

► \1, \2, $1, $2, etc. Substitution operator s/.../.../ Binding operator =~

Regular expressions► Character classes

– [...], ., \s, \S, \d, etc.► Multipliers

– *, +, ?, non-greedy versions *?, +?, ??► Anchors

– ^, $ Match operator /.../ Interpolation split and join Alternation and grouping Backreferences and capturing substrings

► \1, \2, $1, $2, etc. Substitution operator s/.../.../ Binding operator =~

Page 34: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

34Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Going furtherGoing further

Advanced regular expressions► look-ahead, look-behind, evaluating Perl expressions

as regular expressions, etc.► Camel3 pages 195-216► Mastering Regular Expressions, by Jeffrey Friedl,

O’Reilly & Associates

tr/.../.../► transliteration operator, like Unix tr program

sed, awk, grep, vi, ...► some of Unix’s more powerful pattern-matching tools► man sed, man awk, ...

Advanced regular expressions► look-ahead, look-behind, evaluating Perl expressions

as regular expressions, etc.► Camel3 pages 195-216► Mastering Regular Expressions, by Jeffrey Friedl,

O’Reilly & Associates

tr/.../.../► transliteration operator, like Unix tr program

sed, awk, grep, vi, ...► some of Unix’s more powerful pattern-matching tools► man sed, man awk, ...

Page 35: Topic 6: Regular expressions CSE2395/CSE3395 Perl Programming Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

35Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Next topicNext topic

File I/O Opening and closing files Reading from and writing to files Manipulating files and directories Communicating with processes

File I/O Opening and closing files Reading from and writing to files Manipulating files and directories Communicating with processes

Llama3 chapter 6, pages 86-97, chapters 11-14, pages 148-207Camel3 pages 20-22, 28-29, 97-100, 426-428, 747-755, 770perlfunc, perlopentut manpages