Upload
imogen-thornton
View
222
Download
2
Embed Size (px)
Citation preview
www.ischool.drexel.edu
INFO 320Server Technology I
Week 7
Regular expressions
1INFO 320 week 7
www.ischool.drexel.edu
Overview
• One of the most powerful tools in UNIX/Linux is the ability to compare regular expressions– Regular expressions overview– grep– Character classes– Applications
2INFO 320 week 7
www.ischool.drexel.edu
Regular expressions overview
Mostly from Regular-Expressions.info and the man pages cited
3INFO 320 week 7
www.ischool.drexel.edu
Regular expressions?
• “A regular expression (regex or regexp for short) is a special text string for describing a search pattern” – While developed in UNIX, regular expressions
can be also used with little modification in Windows, Perl, PHP, Java, or a .NET language
– “little modification?” Yes, you have to be careful which set of regex rules you’re using
4INFO 320 week 7
www.ischool.drexel.edu
Regular expressions
• The down side?– They look like complete and utter gibberish
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
• The good news?– There are zillions of cookbook recipes for
common uses of them– And with commands (grep, ed, sed), they can
be used in scripts
5INFO 320 week 7
www.ischool.drexel.edu
Fancy wildcards?
• The basic idea is that regex are wildcards on steroids
• We saw that, in bash scripting– A star ‘*’ can substitute for zero or more of any
character (except a line break)– A question mark ‘?’ can substitute for exactly
one any character HERE IT DOESN’T– We’ll refine our use of brackets [ ] to include
or exclude any specific one character
6INFO 320 week 7
www.ischool.drexel.edu
Regex syntax
• Within UNIX, there are variations on regex syntax– GNU grep (our main tool) uses
GNU Basic Regular Expressions syntax (BRE)– GNU egrep uses
GNU Extended Regular Expressions syntax (ERE)
– POSIX-compliant systems use POSIX Basic Regular Expressions for grep, or POSIX Extended Regular Expressions for egrep
7INFO 320 week 7
www.ischool.drexel.edu
BRE (grep) vs ERE (egrep)
• The only difference is that BRE's will use backslashes to give various characters a special meaning, while ERE's will use backslashes to take away the special meaning of the same characters
• egrep has the same functions as grep, it’s just a little faster– grep –E is the same as egrep
8INFO 320 week 7
www.ischool.drexel.edu
Ed and sed
• Similar regex rules are used by grep, ed, and sed– ed is a text line editor– sed is used to perform basic transformations
on an input text stream
9INFO 320 week 7
www.ischool.drexel.edu
grep
10INFO 320 week 7
www.ischool.drexel.edu
Regular expressions and grep
• Regular expressions were first implemented in the 1970’s in UNIX for the ‘grep’ command – grep = generate regular expression– egrep = extended grep
• We’ll focus on grep– grep matches BREs, which were defined by
IEEE Std 1003.1-2001, Section 9.3, Basic Regular Expressions (now dated 2008)
11INFO 320 week 7
www.ischool.drexel.edu
grep syntax
• The basic form is – grep –options pattern file
• The normal output from grep is a text list of all the lines which matched the pattern in the file – Notice that patterns like ‘re-
member’ which cross lines are not found! Regex matches cannot span multiple lines
12INFO 320 week 7
www.ischool.drexel.edu
grep options
• Like most UNIX commands, grep has many options (see handout), including– -c shows the count of lines matched, instead
of the lines themselves– -i ignores case when matching (!)– -n gives the line number of each line matched– -v gives lines which don’t match the
pattern(s) as output
13INFO 320 week 7
www.ischool.drexel.edu
grep options
• You can also include a list of patterns with the –e option
• Or use a file with patterns using the –f option
• You can match lines where the whole line matches the pattern, with the –x option
14INFO 320 week 7
www.ischool.drexel.edu
Search patterns
• As a good habit, put the search pattern in single or double quotes (either works if consistent)– The pattern is a regular expression
• If you give an empty pattern all lines will be matched– So what does grep –c ‘’ filename do?
15INFO 320 week 7
www.ischool.drexel.edu
Metacharacters
• Regex metacharacters are text strings that have special meaning in this context
• We’ll look at them in groups– We already mentioned the wildcard ‘*’ which
matches zero or more of any character (except newline)
– To match any exactly one character, use a period ‘.’
• Notice a ‘?’ did this in the context of scripting
16INFO 320 week 7
www.ischool.drexel.edu
Metacharacters
• We can identify words that start or end of a line
• ‘^’ (the carat) marks the start of the line– ‘^Four’
• ‘$’ (dollar) marks the end of the line– ‘ago$’– Again, different meaning than in scripting
17INFO 320 week 7
www.ischool.drexel.edu
Metacharacters
• We can identify the start or end of a word
• ‘\<‘ marks the start of a word– ‘\<eat’ would match eats or eating, not feat
• ‘\>’ marks the end of a word– ‘ing\>’ would match loving but not sings
18INFO 320 week 7
www.ischool.drexel.edu
Character classes
19INFO 320 week 7
www.ischool.drexel.edu
Character classes
• With a "character class" (or set) you can tell the regex engine to match only one out of several characters– Simply place the possible characters you want to
match between square brackets
• If you want to match an a or an e, use [ae]– You could use this in gr[ae]y to match either gray
or grey• Very useful if you do not know whether the document you are
searching through is written in American or British English
From http://www.regular-expressions.info/charclass.html
20INFO 320 week 7
www.ischool.drexel.edu
Character classes
• The order of the characters inside a character class does not matter– The results are identical [ae] or [ea]
• The characters don’t have to be sequential– [dptjgm583;] is fine– But if you want cite special characters [\^$.|?*+(){} literally, you need to add a backslash before them
• So [abc\\\?] matches a b c \ or ?
21INFO 320 week 7
www.ischool.drexel.edu
Character classes
• More generally in character classes – ‘[]’ matches any one character specified
between the brackets– ‘[^abc]’ matches any one character NOT
specified between the brackets• That example means ‘does not have a b or c in it’• Notice the ^ has very different meaning in a
character class or as its own metacharacter
22INFO 320 week 7
www.ischool.drexel.edu
Character classes
• Within character classes, ranges of possible characters can be given– [a-z] means any lower case letter– [a-zA-Z] means any upper or lower case letter– [a-zA-Z0-9] could be any character that isn’t
a letter or number
23INFO 320 week 7
www.ischool.drexel.edu
Metacharacters
• The pipe means logical OR in an expression, here called alternation– abc(def|xyz) matches abcdef or abcxyz
• Multiple alternations are allowed– s[i|a|o]ng
• Notice the parentheses group a string of characters to be treated as one
24INFO 320 week 7
www.ischool.drexel.edu
Bracket expressions
• POSIX has bracket expressions to provide abbreviations for common search terms– For example instead of [a-z] can use [:lower:] – [a-zA-Z] becomes [:alpha:] – [a-zA-Z0-9] becomes [:alnum:] – What does [A-Fa-f0-9] = [:xdigit:] mean?
• So [^x-z[:digit:]] matches a single character that is not x, y, z or a digit [0-9]
From http://www.regular-expressions.info/posixbrackets.html
25INFO 320 week 7
www.ischool.drexel.edu
Optional
• The question mark will attempt match the preceding token zero times or once, in effect making it optional– colou?r matches both colour and color– Nov(ember)? will match Nov and November
26INFO 320 week 7
www.ischool.drexel.edu
Repetition
• The asterisk or star tells the engine to attempt to match the preceding token zero or more times. – ‘<[A-Za-z][A-Za-z0-9]*>’ matches an
HTML tag without any attributes
• The plus tells the engine to attempt to match the preceding token once or more. – ‘<[A-Za-z0-9]+>’ will match a tag with
any one or more alphanumeric characters27INFO 320 week 7
www.ischool.drexel.edu
Limiting repetition
• As a further refinement, it’s possible to specify how many times a string will be repeated, by adding {min,max} instead of a star or plus
• Max is infinite if not specified, so– * = {0,} + = {1,} and ? = {0,1}– But {0,3} would limit the previous character
to appear zero to three times
28INFO 320 week 7
www.ischool.drexel.edu
() [] [::]?
• So in the context of a regex– Parentheses ( ) are used for grouping, to treat
a series of characters as one for repetition– Square brackets [ ] define a character class,
matches any one character in that class– Square brackets with colons [: :] define a
POSIX bracket expression
29INFO 320 week 7
www.ischool.drexel.edu
?*+{}?
• And following any kind of grouping, character class, or bracket expression– ? Makes a group repeated zero or one time
(optional)– + makes a group repeated one or more times– * makes a group repeated zero or more times
– Curly brackets { } are used for controlling repetition by giving min and max limits
30INFO 320 week 7
www.ischool.drexel.edu
Searching for special characters
• To match a ], put it as the first character after the opening [ or the negating ^
• To match a -, put it right before the closing ]
• To match a ^, put it before the final literal - or the closing ]
• Put together, []\d^-] matches ], \, d, ^ or -
31INFO 320 week 7
www.ischool.drexel.edu
Applications
From http://www.regular-expressions.info/examples.html
32INFO 320 week 7
www.ischool.drexel.edu
Ok, now what?
• Given this terribly complex set of rules for defining a regular expression … so what?
• Regexes are very handy for searching for specific terms, or validating inputs
• Here we’ll review a few cookbook examples
33INFO 320 week 7
www.ischool.drexel.edu
Trimming Whitespace
• A mundane example is to use regular expressions to get rid of spaces at the start and end of lines– Search for ^[ \t]+ and replace with nothing
to delete leading whitespace – Search for [ \t]+$ and replace with nothing
to trim trailing whitespace– [ \t] matches a space or tab
34INFO 320 week 7
www.ischool.drexel.edu
Match IP addresses
• A simplified version is \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
• But that will catch illegal IP addresses above 255; to fix that use– \b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b – Ok, matching numbers is tough in a text world
35INFO 320 week 7
www.ischool.drexel.edu
Numbers are challenging
• To get a real number– [-+]?[0-9]*\.?[0-9]+
• But if you might need exponential notation– [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
36INFO 320 week 7
www.ischool.drexel.edu
Validate email addresses
• If you get a string and want to see if it’s an email address, could try– ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
– What assumption is made here about case?
37INFO 320 week 7
www.ischool.drexel.edu
Validate a date
• (19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])
• Matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31
38INFO 320 week 7
www.ischool.drexel.edu
Validate credit cards
• To validate a credit card, need their format, and first strip out spaces & dashes
• Visa: ^4[0-9]{12}(?:[0-9]{3})?$ – All Visa card numbers start with a 4; new
cards have 16 digits, old cards have 13
• MasterCard: ^5[1-5][0-9]{14}$ – All MasterCard numbers start with the
numbers 51 through 55; all have 16 digits
39INFO 320 week 7
www.ischool.drexel.edu
References
• Regular-expressions.infohttp://www.regular-expressions.info/
• Grep man pagehttp://manpages.ubuntu.com/manpages/jaunty/en/man1/grep.1posix.html
• Lots of books are also available on regular expressions
40INFO 320 week 7