Upload
paolo-carrasco
View
357
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Trainning given a few years ago about Regular Expressions
Citation preview
The Nearshore Advantage!The Nearshore Advantage!
Regular expressionsPaolo Carrasco
The Nearshore Advantage!The Nearshore Advantage!
Agenda• Definition• How a regex engine works• Applications
– Matching• Regex Pattern• Characters
– Literal Characters– Special characters – Character classes/sets
• Anchors • Repetition and optional items
– Extracting• Grouping• Alternation• Groups
– Replacing• Advanced topics
– Greediness– Lookahead and lookbehind
• Techniques for Faster Expressions
The Nearshore Advantage!The Nearshore Advantage!
Definition
• Basically, a regular expression is a pattern describing a certain amount of text.
• Their name comes from the mathematical theory on which they are based.
• Regular expressions provide a powerful, flexible, and efficient method for processing text.
• Sometimes it is also named regex or regexp.
The Nearshore Advantage!The Nearshore Advantage!
How a regex engine works
Regex Engine
The centerpiece of text processing with regular expressions is the regex engine.A regex "engine" is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly.
API
The Nearshore Advantage!The Nearshore Advantage!
How a regex engine works
Regex Engine
Replace-ment
InputPattern
At a minimum, processing text using
regular expressions requires that the
engine be provided with the
following two items of information:
•The regular expression pattern to
identify in the text.
•The text to parse for the regular
expression pattern.
• (Optionally) A replacement string.
The Nearshore Advantage!The Nearshore Advantage!
Applications
Matching
Extracting
Replacing
• The extensive pattern-matching
notation of regular expressions enables
you to quickly parse large amounts of
text to find specific character patterns;
to validate text to ensure that it
matches a predefined pattern; to
extract, edit, replace, or delete text
substrings; and to add the extracted
strings to a collection in order to
generate a report.
The Nearshore Advantage!The Nearshore Advantage!
Applications: MatchingDetermine whether a string matches a regular expression pattern.
The Nearshore Advantage!The Nearshore Advantage!
Regular Expression Pattern
Pattern
Literal Characters
Operators
Constructs
A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs.
The Nearshore Advantage!The Nearshore Advantage!
Characters: Literal Chars
• All characters except [\^$.|?*+() refer to the simple meaning of characters.
• All characters except the listed special characters match a single instance of themselves.
• Note: The regex engines are case sensitive by default.
The Nearshore Advantage!The Nearshore Advantage!
Characters: Special chars or metacharacters
• Most of the times we will need special characters for complex matches.
• These characters are often called metacharacters.
[ \ ^ $ . | ? * + ( ) • The dot or period is one of
the most commonly used. Unfortunately, it is also the most commonly misused metacharacter.
• When a special char is needed as literal, it requires a backslash followed by any of metacharacters.
• A backslash escapes special characters to suppress their special meaning.
The Nearshore Advantage!The Nearshore Advantage!
Escaped character Description Pattern Matches
\t Matches a tab, \u0009.
(\w+)\t "item1\t", "item2\t" in "item1\titem2\t"
\r Matches a carriage return, \u000D. (\r is not equivalent to the newline character, \n.)
\r\n(\w+) "\r\nThese" in "\r\nThese are\ntwo lines."
\n Matches a new line, \u000A.
\r\n(\w+) "\r\nThese" in "\r\nThese are\ntwo lines."
\s Matches an empty space
\w+\s\w+ “Hello world” in “Hello world”
Characters: Non-printable or escaped chars
The Nearshore Advantage!The Nearshore Advantage!
Char classes/setsA character class matches any one of a set of characters.The order of the characters inside a character class does not matter.
• A character in the input string must match one of a specified set of characters.
Positive character
group
• A character in the input string must not match one of a specified set of characters.
Negative character
group
• The dot or period character is a wildcard character that matches any character except \n
Any character
Shorthand classes• Since certain character
classes are used often, a series of shorthand character classes are available.
• Shorthand character classes can be used both inside and outside the square brackets.
• Some shorthand have negated versions.
The Nearshore Advantage!The Nearshore Advantage!
AnchorsAnchor Description
^ The match must occur at the beginning of the string or line.
$ The match must occur at the end of the string or line, or before \n at the end of the string or line.
\b The match must occur on a word boundary.
\B The match must not occur on a word boundary.
• Anchors match a position before, after or between characters.
• They can be used to "anchor" the regex match at a certain position.
The Nearshore Advantage!The Nearshore Advantage!
Repetition and optional items
• Match-zero-or-more*• Match-one-or-more+• Match-zero-or-one?• Interval{ }
The Nearshore Advantage!The Nearshore Advantage!
Hands On Lab 1Processing which emails match a request.
The Nearshore Advantage!The Nearshore Advantage!
Applications: ExtractingExtract a single match or the first match or all matches.
The Nearshore Advantage!The Nearshore Advantage!
Grouping
• A group, also known as a subexpression, consists of an "open-group operator", any number of other operators, and a "close-group operator". (blahblahblahblahblahblahblahblahblah)
• Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit.
open-group-operator close-group-operator
The Nearshore Advantage!The Nearshore Advantage!
Grouping: Alternation
• Alternation match one of a choice of regular expressions:– If you put the character(s) representing the
alternation operator between any two regular expressions A and B, the result matches the union of the strings that A and B match.
• It operates on the largest possible surrounding regular expressions. Thus, the only way you can delimit its arguments is to use grouping.
The Nearshore Advantage!The Nearshore Advantage!
Grouping: Groups
• By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together.
The Nearshore Advantage!The Nearshore Advantage!
Applications: ReplacingReplace a matched substring.
The Nearshore Advantage!The Nearshore Advantage!
Replacement TextSintax
Feature .NET Java Perl ECMA Ruby
$& (whole regex match) YES error YES YES no
$0 (whole regex match) YES YES no no no
$1 through $99 (backreference) YES YES YES YES no
${1} through ${99} (backreference) YES error YES no no
${group} (named backreference) YES error no no no
$` (backtick; subject text to the left of the match)
YES error YES YES no
$' (straight quote; subject text to the right of the match)
YES error YES YES no
$_ (entire subject string) YES error YES IE only no
$+ (highest-numbered group in the regex)
YES error noIE and Firefox
no
$$ (escape dollar with another dollar) YES error no YES no
$ (unescaped dollar as literal text) YES error no YES YES
The Nearshore Advantage!The Nearshore Advantage!
Advanced Topics"Everything should be made as simple as possible, but not simpler."
The Nearshore Advantage!The Nearshore Advantage!
Greediness
• The repetition operators or quantifiers are greedy. • They will expand the match as far as they can, and
only give back if they must to satisfy the remainder of the regex.
• The quick fix to this problem is to make the quantifier lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do this by putting a question mark behind the plus in the regex.
The Nearshore Advantage!The Nearshore Advantage!
Lookahead and lookbehind
Negative lookahead• It is indispensable if you
want to match something not followed by something else.
Positive lookahead• It is indispensable if you
want to match something not followed by something else.
Collectively, these are called "lookaround". They do not consume characters in the string, but only assert whether a match is possible or not.
The Nearshore Advantage!The Nearshore Advantage!
Techniques for Faster Expressions•Avoid recompiling •Use non-capturing parentheses •Don't add superfluous parentheses •Don't use superfluous character classes •Use leading anchors
Common Sense Techniques
•Expose ^ and \G at the front of expressions •Expose $ at the end of expressions Expose Anchors
•The repetition operators or quantifiers are greedy.Lazy Versus Greedy: Be
Specific
•Put the most likely alternative first
•Distribute into the end of alternation
Lead the Engine to a Match
The Nearshore Advantage!The Nearshore Advantage!
Hands On Lab 2Extract data from email and format it.
The Nearshore Advantage!The Nearshore Advantage!
Resources and Tools• Regex Testers
– http://www.gskinner.com/RegExr/– http://osteele.com/tools/rework/
• Regex Patterns Library– http://regexlib.com
• Complete Tutorials– http://www.regular-expressions.info/– Javascript:
• http://www.w3schools.com/jsref/jsref_obj_regexp.asp
– .NET: • http://msdn.microsoft.com/en-us/library/hs600312.aspx
– Java: • http://java.sun.com/docs/books/tutorial/essential/regex/index.html